Analysis
Overview
This section describes a model for predicting home values in Denver, CO. The model was estimated using a random forest fitted to the training data and evaluated using test data. The model produced the test set R-squared of 0.91.
Cleaning and processing of the data is described in the previous section.
The results are reported in the next section.
The analysis for this project was performed in Python.
Methodology in Assessing the Data
In order to predict home values I fitted a linear regression, the lasso and a random forest. The lasso model was used in order to determine the most relevant features by shrinking the coefficients of less relevant features to zero. Following the implementation of the lasso, the features with non-zero coefficients were used for fitting a random forest.
The table provides a preliminary look at which features are likely to be important predictors of home value by showing correlations of each feature with estimated_value, lastSaleAmount and priorSaleAmount:
estimated_value lastSaleAmount priorSaleAmount
estimated_value 1.00 0.79 0.62
lastSaleAmount 0.79 1.00 0.77
priorSaleAmount 0.62 0.77 1.00
latitude -0.27 -0.25 -0.23
longitude 0.12 0.10 0.08
bedrooms 0.37 0.28 0.24
bathrooms 0.72 0.57 0.45
rooms 0.58 0.46 0.39
squareFootage 0.82 0.65 0.54
lotSize 0.46 0.39 0.34
yearBuilt 0.17 0.14 0.11
priorSaleDummy 0.03 0.04 -0.01
rebuiltDummy 0.18 0.06 0.06
yearsBetweenSales 0.04 0.03 0.01
annAppreciation -0.03 0.01 -0.17
Dummy2012ForLastSaleAmount 0.01 0.03 -0.02
lastSaleAmountAfter2012 0.38 0.47 0.33
Dummy2012ForPriorSaleAmount -0.02 0.01 -0.03
priorSaleAmountAfter2012 0.16 0.36 0.29
80203 0.01 0.02 0.02
80204 -0.23 -0.20 -0.16
80205 -0.15 -0.15 -0.14
80206 0.32 0.27 0.23
80207 -0.12 -0.09 -0.08
80209 0.27 0.24 0.21
80123 -0.04 -0.01 0.01
As a first step, I split the dataset into training and test sets. Each model was fitted using the training set and evaluated using the test set. The implementation is shown below:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from comps_features import comps_features
def train_test_evaluation(data, use_comps, non_linear, rseed, n_est=10, max_d=None, min_samp_split=2, min_samp_leaf=1, max_f='auto'):
y = data.estimated_value
X = data # keep zestimate (estimated_value) in the data for now and drop it later
# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = rseed)
y_train = X_train.estimated_value
y_test = X_test.estimated_value
# add home valuation features based on comparables
if use_comps == 1:
# create a dataset from which comparables for each home will be selected
train = X_train[['id', 'latitude', 'longitude', 'estimated_value', 'lastSaleAmount', 'lastSaleDate', 'squareFootage', 'bedrooms', 'bathrooms']]
X_train = comps_features(X_train, train)
X_test = comps_features(X_test, train)
# remove features that should not be used as independent features in fitting the model
X_train.drop(['estimated_value', 'id', 'lastSaleDate', 'priorSaleDate', 'zipcode'], axis = 1, inplace = True)
X_test.drop(['estimated_value', 'id', 'lastSaleDate', 'priorSaleDate', 'zipcode'], axis = 1, inplace = True)
# evaluate and select model using training and test sets
X_columns = X_test.columns
# add non-linearities
if non_linear == 1:
poly = PolynomialFeatures(degree = 2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)
#z = pd.DataFrame(X_train, columns = poly.get_feature_names(X_columns))
#z = z[['lastSaleAmount', 'priorSaleAmount', 'compVal']]
#z = z.reset_index(drop = True)
#y_train = y_train.reset_index(drop = True)
#z = pd.concat([z, y_train], axis = 1)
#z.corr()
# linear regression
lm = LinearRegression().fit(X_train, y_train)
y_pred_train = lm.predict(X_train)
R2_lm_train = r2_score(y_train, y_pred_train)
y_pred_test = lm.predict(X_test)
R2_lm_test = r2_score(y_test, y_pred_test)
# lasso
scaler = StandardScaler().fit(X_train)
X_train_st = scaler.transform(X_train)
X_test_st = scaler.transform(X_test)
ls = Lasso(alpha = 100, max_iter = 10000, normalize = True, random_state = rseed).fit(X_train_st, y_train)
y_pred_train = ls.predict(X_train_st)
R2_ls_train = r2_score(y_train, y_pred_train)
y_pred_test = ls.predict(X_test_st)
R2_ls_test = r2_score(y_test, y_pred_test)
if non_linear == 1:
# for a model with poynomial features:
out = pd.DataFrame(poly.get_feature_names(X_columns), ls.coef_, columns = ['variables'])
else:
# for a model without polynomial features:
out = pd.DataFrame(X_columns, ls.coef_, columns = ['variables'])
# keep features with non-zero coefficients only:
out = out.reset_index(drop = False)
out = out.rename(columns = {'index': 'coefficients'})
out = out[['variables', 'coefficients']]
relFeatures = out[abs(out.coefficients) > 0.0]
# random forest
X_train_rf = pd.DataFrame(X_train, columns = out.variables.tolist())
X_test_rf = pd.DataFrame(X_test, columns = out.variables.tolist())
# if we want to keep only the features selected using regularization:
X_train_rf = X_train_rf[relFeatures.variables.tolist()]
X_test_rf = X_test_rf[relFeatures.variables.tolist()]
rf = RandomForestRegressor(n_estimators=n_est, max_depth=max_d, min_samples_split=min_samp_split, min_samples_leaf=min_samp_leaf, max_features=max_f, random_state = rseed).fit(X_train_rf, y_train)
y_pred_train = rf.predict(X_train_rf)
R2_rf_train = r2_score(y_train, y_pred_train)
y_pred_test = rf.predict(X_test_rf)
R2_rf_test = r2_score(y_test, y_pred_test)
featureImportances = pd.DataFrame(X_test_rf.columns, rf.feature_importances_)
featureImportances = featureImportances.sort_index(ascending = False)
featureImportances = featureImportances.rename(columns = {0: 'featureImportances'})
featureImportances = featureImportances.reset_index(drop = False)
return R2_lm_train, R2_lm_test, R2_ls_train, R2_ls_test, R2_rf_train, R2_rf_test, relFeatures, featureImportances
Next step: Results.