Data Preparation

Overview

This section explores and cleans the data to prepare them for the analysis of weather patterns in San Diego, CA. Specifically, in the next section we build a decision tree for predicting low humidity days, which are known to increase the risk of wildfires.

The dataset is described and imported in the previous section.

The analysis is described in the next section.

This project is based on assignments from Big Data Specialization by University of California San Diego on Coursera.

The analysis for this project was performed in Spark.

Data Exploration

The daily_weather.csv dataset has 11 features and 1,095 rows (or samples) as can be verified using len(df.columns) and df.count() commands, respectively.

To explore the dataset, we output feature names using df.columns, the data type for each feature using df.printSchema() and summary statistics using df.describe().toPandas().transpose(). The output is as follows:

Feature names

['number',
 'air_pressure_9am',
 'air_temp_9am',
 'avg_wind_direction_9am',
 'avg_wind_speed_9am',
 'max_wind_direction_9am',
 'max_wind_speed_9am',
 'rain_accumulation_9am',
 'rain_duration_9am',
 'relative_humidity_9am',
 'relative_humidity_3pm']

Data type for each feature:

root
 |-- number: integer (nullable = true)
 |-- air_pressure_9am: double (nullable = true)
 |-- air_temp_9am: double (nullable = true)
 |-- avg_wind_direction_9am: double (nullable = true)
 |-- avg_wind_speed_9am: double (nullable = true)
 |-- max_wind_direction_9am: double (nullable = true)
 |-- max_wind_speed_9am: double (nullable = true)
 |-- rain_accumulation_9am: double (nullable = true)
 |-- rain_duration_9am: double (nullable = true)
 |-- relative_humidity_9am: double (nullable = true)
 |-- relative_humidity_3pm: double (nullable = true)

Summary statistics

summary	count	mean	stddev	min	max
number	1095	547.0	316.24	0	1094
air_pressure_9am	1092	918.88	3.18	907.99	929.32
air_temp_9am	1090	64.93	11.17	36.75	98.90
avg_wind_direction_9am	1091	142.23	69.13	15.50	343.4
avg_wind_speed_9am	1092	5.50	4.55	0.693	23.55
max_wind_direction_9am	1092	148.95	67.23	28.89	312.19
max_wind_speed_9am	1091	7.01	5.59	1.18	29.84
rain_accumulation_9am	1089	0.20	1.59	0.0	24.01
rain_duration_9am	1092	294.10	1598.07	0.0	17704.0
relative_humidity_9am	1095	34.24	25.47	6.09	92.62
relative_humidity_3pm	1095	35.34	22.52	5.30	92.25

Handling Missing Values

The summary statistics table indicates that some of the features have less than 1,095 rows (the total number of rows in the dataset).

For example, air_temp_9am has only 1,090 rows:

df.describe('air_temp_9am').show()

summary	air_temp_9am
count	1090
mean	64.93
stddev	11.17
min	36.75
max	98.90

This means that five rows in air_temp_9am have missing values.

We can drop all the rows missing a value in any feature as follows:

removeAllDF = df.na.drop()

This leaves us with 1,064 rows in data frame removeAllDF, as can be verified using removeAllDF.count().

The summary statistics for the air temperature at 9am are now as follows:

removeAllDF.describe('air_temp_9am').show()

summary	air_temp_9am
count	1064
mean	65.02
stddev	11.16
min	36.75
max	98.90

After the number of observations for air_temp_9am declined from 1,090 to 1,064, the mean and standard deviation of this feature are still close the original values: the mean is 64.933 vs. 65.022, and standard deviation is 11.175 vs. 11.168.

Alternatively, we can replace missing values in each feature with the mean value for that feature:

from pyspark.sql.functions import avg

imputeDF = df

for x in imputeDF.columns:
    meanValue = removeAllDF.agg(avg(x)).first()[0]
    print(x, meanValue)
    imputeDF = imputeDF.na.fill(meanValue, [x])

The above code produces the following output of the mean values for each feature:

number 545.00
air_pressure_9am 918.90
air_temp_9am 65.02
avg_wind_direction_9am 142.30
avg_wind_speed_9am 5.48
max_wind_direction_9am 148.48
max_wind_speed_9am 6.99
rain_accumulation_9am 0.18
rain_duration_9am 266.39
relative_humidity_9am 34.07
relative_humidity_3pm 35.14

The summary statistics for air_temp_9am are now as follows:

imputeDF.describe('air_temp_9am').show()

summary	air_temp_9am
count	1095
mean	64.93
stddev	11.14
min	36.75
max	98.90

The number of rows in air_temp_9am is now 1,095 (increased from 1,090) which means that the feature no longer has missing values.

In the analysis performed in the next section, we will drop all the rows missing a value in any feature:

df = removeAllDF

Next step: Analysis

Classification of Low Humidity Days in San Diego, CA

Eugene Agronin, Ph.D. | eagronin@gmail.com

Data Preparation

Overview

Data Exploration

Handling Missing Values