Data Preparation

Overview

This section prepares the weather data imported in the previous section for analysis. The analysis in the subsequent sections shows that the range of temperatures in Silicon Valley has widened in 2015 as compared to the previous 10-year period (from 2005 to 2014).

Data acquisition is described in the previous section.

The analysis is described in the next section.

This project is based on assignments from Applied Plotting, Charting & Data Representation in Python by University of Michigan on Coursera

The analysis for this project was performed in Python.

Data Processing

The following code imports the data, then outputs the first five rows of the data and summary statistics:

df = read_weather_data()

print(df.head())

print('...\n\nSummary Statistics:\n\n', df.describe())
print('\nMissing values across all attributes and samples: ', df.isnull().sum().sum())

The output is shown below:

            ID     Date Element  Data_Value
0  USC00043244   7/1/12    TMAX         222
1  USC00047807  6/24/05    TMAX         211
2  USC00046599  6/25/06    TMIN         139
3  USR0000CALT  8/14/09    TMAX         256
4  USC00047339  1/23/08    TMIN          56
...

Summary Statistics:

           Data_Value
count  143226.000000
mean      147.061546
std        74.438715
min       -67.000000
25%        94.000000
50%       139.000000
75%       194.000000
max       456.000000

Missing values across all attributes and samples:  0

The original data has a column ‘Element’ with TMIN and TMAX as values. First, we reshape the data to convert the Element column into two TMIN and TMAX columns by creating two datasets for TMIN and TMAX, respectively, and then joining the datasets. Next, we remove leap days (i.e. February 29th) from the dataset for the purpose of subsequent analysis and visualization.

def transform_weather_data():
    
    # Read data
    df = read_weather_data()
    
    # Reshape data
    df_min = (df[df.Element == 'TMIN']
              .drop(['ID', 'Element'], axis = 1)
              .groupby('Date').min()
              .sort_index()
              .reset_index(drop = False)
              .rename(columns = {'Data_Value': 'TMIN'}))
    df_min.Date = list(map(pd.to_datetime, df_min.Date))


    df_max = (df[df.Element == 'TMAX']
              .drop(['ID', 'Element'], axis = 1)
              .groupby('Date').max()
              .sort_index()
              .reset_index(drop = False)
              .rename(columns = {'Data_Value': 'TMAX'}))
    df_max.Date = list(map(pd.to_datetime, df_max.Date))


    df_joined = df_min.merge(df_max, how = 'inner', on = 'Date')

    # Remove leap days
    df_joined = df_joined[~((df_joined.Date.dt.month == 2) & (df_joined.Date.dt.day == 29))]
    
    return df_joined

print(df_joined.head())

The output below shows the first five rows of the transformed data frame:

        Date  TMIN  TMAX
0 2005-01-01    22   144
1 2006-01-01    56   172
2 2007-01-01    11   189
3 2008-01-01   -11   222
4 2009-01-01    44   178

Next step: Analysis