Data Preparation
Overview
This section explores and cleans the data to prepare them for the analysis of weather patterns in San Diego, CA. Specifically, in the next section we build a decision tree for predicting low humidity days, which are known to increase the risk of wildfires.
The dataset is described and imported in the previous section.
The analysis is described in the next section.
This project is based on assignments from Big Data Specialization by University of California San Diego on Coursera.
The analysis for this project was performed in Spark.
Data Exploration
The daily_weather.csv dataset has 11 features and 1,095 rows (or samples) as can be verified using len(df.columns)
and df.count()
commands, respectively.
To explore the dataset, we output feature names using df.columns
, the data type for each feature using df.printSchema()
and summary statistics using df.describe().toPandas().transpose()
. The output is as follows:
Feature names
['number',
'air_pressure_9am',
'air_temp_9am',
'avg_wind_direction_9am',
'avg_wind_speed_9am',
'max_wind_direction_9am',
'max_wind_speed_9am',
'rain_accumulation_9am',
'rain_duration_9am',
'relative_humidity_9am',
'relative_humidity_3pm']
Data type for each feature:
root
|-- number: integer (nullable = true)
|-- air_pressure_9am: double (nullable = true)
|-- air_temp_9am: double (nullable = true)
|-- avg_wind_direction_9am: double (nullable = true)
|-- avg_wind_speed_9am: double (nullable = true)
|-- max_wind_direction_9am: double (nullable = true)
|-- max_wind_speed_9am: double (nullable = true)
|-- rain_accumulation_9am: double (nullable = true)
|-- rain_duration_9am: double (nullable = true)
|-- relative_humidity_9am: double (nullable = true)
|-- relative_humidity_3pm: double (nullable = true)
Summary statistics
summary | count | mean | stddev | min | max |
---|---|---|---|---|---|
number | 1095 | 547.0 | 316.24 | 0 | 1094 |
air_pressure_9am | 1092 | 918.88 | 3.18 | 907.99 | 929.32 |
air_temp_9am | 1090 | 64.93 | 11.17 | 36.75 | 98.90 |
avg_wind_direction_9am | 1091 | 142.23 | 69.13 | 15.50 | 343.4 |
avg_wind_speed_9am | 1092 | 5.50 | 4.55 | 0.693 | 23.55 |
max_wind_direction_9am | 1092 | 148.95 | 67.23 | 28.89 | 312.19 |
max_wind_speed_9am | 1091 | 7.01 | 5.59 | 1.18 | 29.84 |
rain_accumulation_9am | 1089 | 0.20 | 1.59 | 0.0 | 24.01 |
rain_duration_9am | 1092 | 294.10 | 1598.07 | 0.0 | 17704.0 |
relative_humidity_9am | 1095 | 34.24 | 25.47 | 6.09 | 92.62 |
relative_humidity_3pm | 1095 | 35.34 | 22.52 | 5.30 | 92.25 |
Handling Missing Values
The summary statistics table indicates that some of the features have less than 1,095 rows (the total number of rows in the dataset).
For example, air_temp_9am has only 1,090 rows:
df.describe('air_temp_9am').show()
summary | air_temp_9am |
---|---|
count | 1090 |
mean | 64.93 |
stddev | 11.17 |
min | 36.75 |
max | 98.90 |
This means that five rows in air_temp_9am have missing values.
We can drop all the rows missing a value in any feature as follows:
removeAllDF = df.na.drop()
This leaves us with 1,064 rows in data frame removeAllDF, as can be verified using removeAllDF.count()
.
The summary statistics for the air temperature at 9am are now as follows:
removeAllDF.describe('air_temp_9am').show()
summary | air_temp_9am |
---|---|
count | 1064 |
mean | 65.02 |
stddev | 11.16 |
min | 36.75 |
max | 98.90 |
After the number of observations for air_temp_9am declined from 1,090 to 1,064, the mean and standard deviation of this feature are still close the original values: the mean is 64.933 vs. 65.022, and standard deviation is 11.175 vs. 11.168.
Alternatively, we can replace missing values in each feature with the mean value for that feature:
from pyspark.sql.functions import avg
imputeDF = df
for x in imputeDF.columns:
meanValue = removeAllDF.agg(avg(x)).first()[0]
print(x, meanValue)
imputeDF = imputeDF.na.fill(meanValue, [x])
The above code produces the following output of the mean values for each feature:
number 545.00
air_pressure_9am 918.90
air_temp_9am 65.02
avg_wind_direction_9am 142.30
avg_wind_speed_9am 5.48
max_wind_direction_9am 148.48
max_wind_speed_9am 6.99
rain_accumulation_9am 0.18
rain_duration_9am 266.39
relative_humidity_9am 34.07
relative_humidity_3pm 35.14
The summary statistics for air_temp_9am are now as follows:
imputeDF.describe('air_temp_9am').show()
summary | air_temp_9am |
---|---|
count | 1095 |
mean | 64.93 |
stddev | 11.14 |
min | 36.75 |
max | 98.90 |
The number of rows in air_temp_9am is now 1,095 (increased from 1,090) which means that the feature no longer has missing values.
In the analysis performed in the next section, we will drop all the rows missing a value in any feature:
df = removeAllDF
Next step: Analysis