Data Preparation and Cleaning
Overview
This section first describes cleaning and processing of the data on single-family homes in Denver, CO. It then turns to constructing new features, which were subsequently used for predicting home values. Such new features include comparable-based valuation and forecasted value using the history of prior sales. Home values are subsequently predicted using a random forest model fitted to the training data and evaluated using the test data. The model produced the test set R-squared of 0.91. The subsequent sections discuss the steps that I took in more detail.
Description of the data is provided in the previous section.
Estimation of the model and evaluation of its performance are discussed in the next section.
The analysis for this project was performed in Python.
Data Cleaning and Processing
The summary statistics of the raw dataset are reported below:
count mean std min 25% 50% 75% max
id 15000.0 51762290.74 61908763.48 143367.00 10048022.75 25632408.50 51142218.75 3.209481e+08
zipcode 15000.0 80204.92 9.72 80022.00 80205.00 80206.00 80207.00 8.020900e+04
latitude 14985.0 39.74 0.02 39.61 39.73 39.75 39.76 3.989000e+01
longitude 14985.0 -104.96 0.04 -105.11 -104.98 -104.96 -104.94 -1.048300e+02
bedrooms 15000.0 2.71 0.90 0.00 2.00 3.00 3.00 1.500000e+01
bathrooms 15000.0 2.20 1.17 0.00 1.00 2.00 3.00 1.200000e+01
rooms 15000.0 6.16 1.96 0.00 5.00 6.00 7.00 3.900000e+01
squareFootage 15000.0 1514.50 830.64 350.00 986.00 1267.50 1766.25 1.090700e+04
lotSize 15000.0 5820.77 3013.28 278.00 4620.00 5950.00 6270.00 1.228390e+05
yearBuilt 14999.0 1929.52 29.94 1874.00 1907.00 1925.00 1949.00 2.016000e+03
lastSaleAmount 15000.0 405356.34 775699.85 259.00 194000.00 320000.00 463200.00 4.560000e+07
priorSaleAmount 11287.0 259435.01 337938.70 0.00 110000.00 210000.00 330240.00 1.600000e+07
estimated_value 15000.0 637162.55 504418.49 147767.00 398434.75 518357.50 687969.25 1.014531e+07
In order to show summary statistics for the dates of last and prior sales, I convert these features into date format:
data.lastSaleDate = pd.to_datetime(data.lastSaleDate)
data.priorSaleDate = pd.to_datetime(data.priorSaleDate)
print(np.around(data[['lastSaleDate', 'priorSaleDate']].describe().transpose(), decimals=2).to_string())
This results in the following output:
count unique top freq first last
lastSaleDate 15000 4347 2014-03-03 00:00:00 19 1997-08-01 00:00:00 2017-07-21 00:00:00
priorSaleDate 11173 4475 2008-01-03 00:00:00 15 1967-05-10 00:00:00 2017-07-11 00:00:00
In order to prepare the data for analysis, I evaluated accuracy of the data, imputed missing values and constructed new features.
1. Evaluation of data accuracy
This step fixes inaccuracies in the data. It removes one duplicate record, an outlier in terms of lotSize (lotSize of 278 sq. feet) and 3 outliers in terms of lastSaleAmount ($45.6 million for each - a price not consistent with the other characteristics, such as square footage). Removal of these records is performed using the code below:
import pandas as pd
def check_data_accuracy(data):
# check if there are multiple samples with the same address
data.address.unique().size
data[data.duplicated(subset = 'address', keep = False) == True]
if 4762 in data.index:
data.drop(data.index[4762], axis = 0, inplace = True) # drop one duplicate
# remove 3 records that appear to have incorrect lastSaleAmount
data = data[data.lastSaleAmount != 45600000]
# convert zipcode to string format
data.zipcode = data.zipcode.astype(str)
# remove one observation with lotSize < 500 sq. feet
data = data[data.lotSize > 500]
return data
2. Imputing missing values
A look at the summary statistics above reveals that the most significant issue in the data is a large number of missing values for priorSaleDate and priorSaleAmount. Further, there are also 1,296 zero values for priorSaleAmount.
With respect to these zero values, it appears that the time period between the last and prior sale dates is substantially shorter when prior sale amount is zero (611 days on average) compared to when priorSaleAmount is not zero (2080 days on average). It is possible that such prior sales never occurred and rather represent a stage in the process of the last sale.
If the same sale sometimes appears in the data both as the last sale and as the prior sale, we would expect sales dates and amounts to match between lastSaleDate and priorSaleDate for at least a small portion of the dataset. There are indeed 155 samples in which priorSaleDate and priorSaleAmount are identical to lastSaleDate and lastSaleAmount, respectively.
Based on the above observations I assume that when priorSaleAmount is zero, there was only one sale. I set priorSaleAmount and priorSaleDate to lastSaleAmount and lastSaleDate when priorSaleAmount is either zero or missing, or when priorSaleDate is missing. I further assume that observations with identical lastSaleDate and priorSaleDate but different lastSaleAmount and priorSaleAmount also have only one sale, and change priorSaleAmount to lastSaleAmount. In the records with priorSaleDate coming after lastSaleDate, I set priorSaleDate to lastSaleDate. Finally, I create a dummy variable taking value of 1 when prior sale exists and 0 otherwise.
The other features with missing values include latitude, longitude and yearBuilt. I impute average latitude and longitude for the home’s zipcode when the coordinates are missing, and I set one missing value for yearBuilt to the average yearBuilt in the data.
The code implementing the above steps is shown below:
import pandas as pd
import numpy as np
def impute_missing_values(data):
# convert zipcode to string format
data.zipcode = data.zipcode.astype(str)
# change priorSaleAmount and priorSaleDate to lastSaleAmount and lastSaleDate, respectively, when priorSaleAmount is either zero or NAN
data.priorSaleDate[(data.priorSaleAmount == 0) | (data.priorSaleAmount.isnull())] = data.lastSaleDate
data.priorSaleAmount[(data.priorSaleAmount == 0) | (data.priorSaleAmount.isnull())] = data.lastSaleAmount
# change priorSaleAmount and priorSaleDate to lastSaleAmount and lastSaleDate, respectively, when priorSaleDate is NAN
data.priorSaleDate[data.priorSaleDate.isnull()] = data.lastSaleDate
data.priorSaleAmount[data.priorSaleDate.isnull()] = data.lastSaleAmount
# assume that an observation with identical lastSaleDate and priorSaleDate but different lastSaleAmount and priorSaleAmount did not have a prior sale (i.e., change priorSaleAmount to lastSaleAmount)
data.priorSaleAmount[(data.priorSaleAmount != data.lastSaleAmount) & (data.priorSaleDate == data.lastSaleDate)] = data.lastSaleAmount
# fix observations with inconsistent last sale date and prior sale date
data.priorSaleDate[data.lastSaleDate - data.priorSaleDate < pd.Timedelta('0D')] = data.lastSaleDate
# create a dummy variable that equals 1 when prior sale occurred and zero otherwise
data['priorSaleDummy'] = 1
data.priorSaleDummy[data.priorSaleDate == data.lastSaleDate] = 0
# assign average longitude and average latitude for the respective zipcode
# to homes for which these attributes are missing
# calculate average for each zipcode:
temp = data[['zipcode', 'latitude', 'longitude']]
temp = temp.groupby('zipcode').mean()
temp = temp.rename(columns = {'latitude': 'av_lat', 'longitude': 'av_lon'})
temp = temp.reset_index(drop = False)
data = data.merge(temp, on = 'zipcode', how = 'left')
data.latitude[data.latitude.isnull()] = data.av_lat
data.longitude[data.longitude.isnull()] = data.av_lon
data.drop(['av_lat', 'av_lon'], axis = 1, inplace = True)
# in case we want to delete data with missing latitude or longitude:
# data = data[(data.latitude.isnull() == False) & (data.longitude.isnull() == False)]
# impute missing values in yearBuilt
data.yearBuilt[data.yearBuilt.isnull()] = data.yearBuilt.mean()
return data
The refined summary statistics after cleaning the data and imputing missing values are shown below:
count mean std min 25% 50% 75% max
id 14995.0 51747046.96 61886948.72 143367.00 10048014.50 25632409.00 51142171.50 3.209481e+08
latitude 14995.0 39.74 0.02 39.61 39.73 39.75 39.76 3.989000e+01
longitude 14995.0 -104.96 0.04 -105.11 -104.98 -104.96 -104.94 -1.048300e+02
bedrooms 14995.0 2.71 0.90 0.00 2.00 3.00 3.00 1.500000e+01
bathrooms 14995.0 2.19 1.17 0.00 1.00 2.00 3.00 1.200000e+01
rooms 14995.0 6.16 1.96 0.00 5.00 6.00 7.00 3.900000e+01
squareFootage 14995.0 1514.48 830.63 350.00 986.00 1268.00 1766.00 1.090700e+04
lotSize 14995.0 5821.33 3013.40 832.00 4620.00 5950.00 6270.00 1.228390e+05
yearBuilt 14995.0 1929.51 29.93 1874.00 1907.00 1925.00 1949.00 2.016000e+03
lastSaleAmount 14995.0 565579.91 602265.46 435.61 309350.55 449413.01 643237.79 2.923536e+07
priorSaleAmount 14995.0 522941.82 706421.18 183.33 260258.14 398366.98 609081.04 3.573380e+07
estimated_value 14995.0 637238.82 504478.69 147767.00 398465.50 518476.00 688162.00 1.014531e+07
priorSaleDummy 14995.0 0.64 0.48 0.00 0.00 1.00 1.00 1.000000e+00
rebuiltDummy 14995.0 0.01 0.08 0.00 0.00 0.00 0.00 1.000000e+00
yearsBetweenSales 14995.0 3.68 5.86 0.00 0.00 0.99 5.26 4.988000e+01
annAppreciation 14995.0 0.10 0.18 -0.50 0.00 0.03 0.11 5.000000e-01
Dummy2012ForLastSaleAmount 14995.0 0.41 0.49 0.00 0.00 0.00 1.00 1.000000e+00
lastSaleAmountAfter2012 14995.0 242000.54 455360.62 0.00 0.00 0.00 427672.49 1.773222e+07
Dummy2012ForPriorSaleAmount 14995.0 0.16 0.36 0.00 0.00 0.00 0.00 1.000000e+00
priorSaleAmountAfter2012 14995.0 75165.38 312688.04 0.00 0.00 0.00 0.00 1.773222e+07
80203 14995.0 0.03 0.16 0.00 0.00 0.00 0.00 1.000000e+00
80204 14995.0 0.17 0.38 0.00 0.00 0.00 0.00 1.000000e+00
80205 14995.0 0.24 0.43 0.00 0.00 0.00 0.00 1.000000e+00
80206 14995.0 0.16 0.37 0.00 0.00 0.00 0.00 1.000000e+00
80207 14995.0 0.27 0.44 0.00 0.00 0.00 1.00 1.000000e+00
80209 14995.0 0.12 0.33 0.00 0.00 0.00 0.00 1.000000e+00
80123 14995.0 0.01 0.11 0.00 0.00 0.00 0.00 1.000000e+00
3. Constructing new features
I construct several new features and make adjustments to the existing features in order to enhance the model’s performance. I discuss the steps that I took below.
First, while the history of sales prices is important for valuing a home, many sales in the data took place a long time ago. Sale amounts are not adjusted for the appreciation in home value between lastSaleDate and present time.
The following example demonstrates how this issue can affect the model’s performance. Suppose, for example, that home A is currently valued at $2 million and home B is currently valued at $1 million. The data also shows that home A was last sold in 1990 for $600,000, while home B was last sold recently in 2018. In this example, a model that uses lastSaleAmount as a feature may predict a lower value for home A even though this home is as twice as much more expensive than home B.
To fix this timing issue, I adjust lastSaleAmount for the housing price appreciation between lastSaleDate and the present using S&P/Case-Shiller CO-Denver Home Price Index. This index increased in value from 49.856 in December 1987 to 206.014 in December 2017. This implies the annual appreciation rate of 4.84 percent (calculated as (206.014/49.856)^(1/30)-1). Then, for example, if a home was last sold in 2010 (eight years ago), I adjust lastSaleAmount using the following formula: lastSaleAmount * 1.0484^8-1. I perform the same adjustment for priorSaleAmount.
data['priceAppreciation'] = 1.0484
data['lastSaleAmount'] = data.lastSaleAmount * data.priceAppreciation.pow(2018 - data.lastSaleDate.dt.year)
data['priorSaleAmount'] = data.priorSaleAmount * data.priceAppreciation.pow(2018 - data.priorSaleDate.dt.year)
data.drop(['priceAppreciation'], axis = 1, inplace = True)
Second, because homes are typically valued using information from sales of comparable homes, I created a valuation measure using estimated_value (or Zestimate) of comparable homes. When the model was trained, each home in the training data has been valued using the three most comparable homes in the training data (i.e., the average of estimated_values for the three homes). When the model was tested, each home in the test data has been compared to the three most comparable homes in the training set (not in the test set, because estimated_value of comparable homes cannot be used as a predictor of value at the test stage). I used geographical proximity, the number of bedrooms and the number of bathrooms to assess comparability. Figuratively speaking, this valuation approach is identical in spirit to an implementation of the K Nearest Neighbors algorithm.
The implementation of the valuation approach using comparable homes is shown below:
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
train['lat'] = train.latitude.apply(lambda i: radians(i))
train['lon'] = train.longitude.apply(lambda i: radians(i))
train['estimated_value_sqft'] = train.estimated_value / train.squareFootage
train['lastSaleAmount_sqft'] = train.lastSaleAmount / train.squareFootage
def zestimate_comps(idTargetHome, sqftTargetHome, lonTargetHome, latTargetHome, bedTargetHome, bathTargetHome):
# select comparables and compute an estimate of home value based on zestimate (estimated_value) of comparables
#print('before: ', train.shape)
train_cp = train
#print('after: ', train_cp.shape)
R = 6373.0 # approximate radius of earth in km
latTargetHome = radians(latTargetHome)
lonTargetHome = radians(lonTargetHome)
dlon = train.lon - lonTargetHome
dlat = train.lat - latTargetHome
a = dlat.apply(lambda i: sin(i / 2)**2) + cos(latTargetHome) * train.lat.apply(lambda i: cos(i)) * dlon.apply(lambda i: sin(i / 2)**2)
c = a.apply(lambda i: 2 * atan2(sqrt(i), sqrt(1 - i)))
distance = R * c
train_cp['distance'] = distance
# calculate estimate of home value based on comps' zestimate
# create an index by adding points for shorter distance from the home being valued, identical number of
# bedrooms and bathrooms
train_cp['ind'] = 0
train_cp.ind[train_cp.distance <= 1] = train_cp.ind + 1
train_cp.ind[train_cp.distance <= 2] = train_cp.ind + 1
train_cp.ind[train_cp.bedrooms == bedTargetHome] = train_cp.ind + 1
train_cp.ind[train_cp.bathrooms == bathTargetHome] = train_cp.ind + 1
train_cp.ind = train_cp.ind.max() - train_cp.ind
train_cp = train_cp.sort_values(by = ['ind', 'distance'], ascending = True)
#print(train_cp[['ind', 'distance']].iloc[:3,:])
train_cp = train_cp.iloc[:4,:]
train_cp = train_cp[train_cp.id != idTargetHome] # exclude the home being value from the set of potential comparables
zestimateCompsValue = train_cp.estimated_value_sqft.mean() * sqftTargetHome
#print(train_cp[['id', 'ind', 'distance']])
return zestimateCompsValue
X['zestCompVal'] = np.nan
X.zestCompVal = X.apply(lambda x: zestimate_comps(x.id, x.squareFootage, x.longitude, x.latitude, x.bedrooms, x.bathrooms), axis = 1)
I have also created a similar valuation metric which uses lastSaleAmount (adjusted to the present value of the home using the S&P Case Shiller index as shown above) instead of estimated_value:
def sold_comps(idTargetHome, sqftTargetHome, lonTargetHome, latTargetHome, bedTargetHome, bathTargetHome):
# select comparables and compute an estimate of home value based on lastSaleAmount of comparables
#print('before: ', train.shape)
train_cp = train[train.id != idTargetHome] # exclude the home being valued from the set of potential comparables
#print('after: ', train_cp.shape)
R = 6373.0 # approximate radius of earth in km
latTargetHome = radians(latTargetHome)
lonTargetHome = radians(lonTargetHome)
dlon = train.lon - lonTargetHome
dlat = train.lat - latTargetHome
a = dlat.apply(lambda i: sin(i / 2)**2) + cos(latTargetHome) * train.lat.apply(lambda i: cos(i)) * dlon.apply(lambda i: sin(i / 2)**2)
c = a.apply(lambda i: 2 * atan2(sqrt(i), sqrt(1 - i)))
distance = R * c
train_cp['distance'] = distance
# calculate value estimate based on comps' lastSaleAmount if sold within last 3.5 years;
# create an index by adding points for shorter distance from the home being valued, identical number of bedrooms and bathrooms and shorter time elapsed since lastSaleDate
train_cp['ind'] = 0
train_cp.ind[(train_cp.distance <= 1) & (train_cp.lastSaleDate.dt.year > 2014)] = train_cp.ind + 1
train_cp.ind[(train_cp.distance <= 2) & (train_cp.lastSaleDate.dt.year > 2014)] = train_cp.ind + 1
train_cp.ind[(train_cp.bedrooms == bedTargetHome) & (train_cp.lastSaleDate.dt.year > 2014)] = train_cp.ind + 1
train_cp.ind[(train_cp.bathrooms == bathTargetHome) & (train_cp.lastSaleDate.dt.year > 2014)] = train_cp.ind + 1
train_cp.ind[train_cp.lastSaleDate.dt.year > 2015] = train_cp.ind + 1
train_cp.ind[train_cp.lastSaleDate.dt.year > 2016] = train_cp.ind + 1
train_cp.ind = train_cp.ind.max() - train_cp.ind
train_cp = train_cp.sort_values(by = ['ind', 'distance'], ascending = True)
#print(train_cp[['ind', 'distance']].iloc[:3,:])
train_cp = train_cp.iloc[:4,:]
train_cp = train_cp[train_cp.id != idTargetHome] # exclude the home being valued from the set of potential comparables
soldCompsValue = train.lastSaleAmount_sqft.mean() * sqftTargetHome
return soldCompsValue
X['soldCompVal'] = np.nan
X.soldCompVal = X.apply(lambda x: sold_comps(x.id, x.squareFootage, x.longitude, x.latitude, x.bedrooms, x.bathrooms), axis = 1)
Third, I construct the annualized home price appreciation between priorSaleDate and lastSaleDate. This feature has a potential to capture the variation in home price appreciation across homes.
However, this measure is noisy because, for example, purchasing a fixer upper, refurbishing it and selling one year later at a substantially higher price would result in a home price appreciation that is not representative of home price appreciation over the lifetime of that home.
The code that implements the construction of home price appreciation is shown below:
# calculate values per sq. foot
data['estimated_value_sqft'] = data.estimated_value / data.squareFootage
data['lastSaleAmount_sqft'] = data.lastSaleAmount / data.squareFootage
data['priorSaleAmount_sqft'] = data.priorSaleAmount / data.squareFootage
data['yearsBetweenSales'] = (data.lastSaleDate - data.priorSaleDate).dt.days/365
data['annAppreciation'] = (data.lastSaleAmount_sqft / data.priorSaleAmount_sqft).pow(1/data.yearsBetweenSales) - 1
# smooth the annualized appreciation rate as sales that occur within a few months of each other result in large variation of annualized appreciation rate; it is unlikely that home value increases or declines at an annualized rate of more than 50 percent
data.annAppreciation[data.annAppreciation > 0.5] = 0.5
data.annAppreciation[data.annAppreciation < -0.5] = -0.5
Fourth, in order to account for the explosive growth in housing prices in Denver after 2012 (a pattern observed in other cities as well), I added a dummy variable that takes the value of one if lastSaleDate is after 2012, and zero otherwise. I also added such a dummy for priorSaleDate and the interactions of these dummies with lastSaleAmount and priorSaleAmount, respectively:
data['Dummy2012ForLastSaleAmount'] = 0
data.Dummy2012ForLastSaleAmount[data.lastSaleDate.dt.year > 2012] = 1
data['lastSaleAmountAfter2012'] = data.lastSaleAmount * data.Dummy2012ForLastSaleAmount
data['Dummy2012ForPriorSaleAmount'] = 0
data.Dummy2012ForPriorSaleAmount[data.priorSaleDate.dt.year > 2012] = 1
data['priorSaleAmountAfter2012'] = data.priorSaleAmount * data.Dummy2012ForPriorSaleAmount
Fifth, I create a dummy for homes that were rebuilt after the last sale. These are homes with yearBuilt greater than the year of lastSaleDate. Identifying such homes is important because their estimated_value could be considerably higher than the last sale price:
data[data.yearBuilt > data.lastSaleDate.dt.year].shape
data['rebuiltDummy'] = 0
data.rebuiltDummy[data.yearBuilt > data.lastSaleDate.dt.year] = 1
Sixth, there is a considerable variation in home prices per square foot across zipcodes:
priorSaleAmount_sqft lastSaleAmount_sqft estimated_value_sqft
zipcode
80123 325.989608 291.466454 259.225760
80203 362.742573 381.297086 395.870911
80204 247.842270 279.329178 361.540286
80205 265.983143 311.524722 394.511851
80206 430.499172 452.784866 481.895697
80207 324.652388 359.447086 414.608342
80209 479.952938 495.706467 517.771521
Therefore, I added a dummy variable for each zipcode:
zipcode = pd.get_dummies(data.zipcode)
# drop zipcodes with a very small number of home sales
z1 = ['80203', '80204', '80205', '80206', '80207', '80209', '80123']
z2 = zipcode.columns
z_intersection = [value for value in z1 if value in z2]
zipcode = zipcode[z_intersection]
data = pd.concat([data, zipcode], axis = 1)
Finally, I remove the features that are not going to be used as explanatory variables in fitting the model:
data.drop(['address', 'city', 'state', 'lastSaleAmount_sqft', 'priorSaleAmount_sqft', 'estimated_value_sqft'], axis = 1, inplace = True)
Next step: Analysis.