Linear and Random Forest Forest Regression Analysis on House Prices
Trouble finding affordable housing in your city? Ignoring the crazy inflation that has been happening lately, typically house prices vary on various different features attached to each house. This project focuses on the training, testing, and comparison of a linear regression and a random forest regressor on the famous Ames Housing and Real Estate data, a data set used to challenge data scientists on how to train regression models. The dataset I used came from a New York City Data Science Academy machine learning project folder. It draws on Ames Housing Data with the main features of interest and Ames Real Estate Data to get MapRefNo, which will be used to get coordinates later to calculate distances from Iowa State University (ISU). Information on what each feature represents can be found here on Kaggle.
I. Data Cleaning Pipeline
First, I focus on Ames Housing Data and make MSSubClass and MoSold strings. Afterwards, I fill all nan values in string columns with a string 'nan'. Once that is done, I make dummy columns for all string columns, dropping the first column. I make a dictionary called relatedDummiesDictionary that stores all dummy columns as keys and their values as string lists. Each of these string lists contains the column name and the column names of its sibling dummy columns. I then save relatedDummiesDictionary as relatedDummiesDictionary.json. Now that the data set has almost all the columns I need, I can then perform the train-test-split and save the train and test data as .csv files.
Now it is time to get distances between each house and Iowa State University with the method util.returnDFWithISUDistance. In that method, I take the PID column of Ames Housing Data and turn that into a data frame. I then merge this new data frame with (Ames Real Estate Data)[[ 'MapRefNo', 'Prop_Addr' ]], joining Housing PID to Real Estate MapRefNo. Then, using Google_V3 geocoder, I get the coordinates of ISU ('Iowa State University, Ames, USA') and each house address (Prop_Addr + ', Ames,IA, USA'). With those coordinates, I calculate the distance between ISU and the houses in miles via geopy.distance.great_circle. These distances are saved as the column, ISUDistance, in the merged data frame. Lastly, I attach ISUDistance back to the original data frame joining PID and MapRefNo.
After acquiring distances, I can drop all strings. I gather the modes of OverallQual and OverallCond and the means of YearBuilt, YearRemodAdd, GarageYrBlt, YrSold, and ISUDistance. The collected mode and mean values become the nan replacement values for the columns they represent. All other nan values become zeros. These nan replacement values are saved as a dictionary and json file to be re-used by the method util.replaceNansWithTrainingDataValues during nan replacement for both training and testing datasets.
Here lies the fork in the road between Linear Regression Preparation (Methods II) and Random Forest Regression Preparation (Methods III).
II. Linear Regression Preparation
II.a) Elastic Net
Adding all columns to a linear regression is not optimal due issues like homoskedasity and colinearity. To ensure I am picking the best columns, I started with an elastic net with an alpha penalty of 1000 and a l1-ratio of 0.5, which is half Lasso and half Ridge. Originally, I wanted to Grid Search CV the best alpha and l1 ratio parameters, but that led to an alpha of 1 and an l1 ratio of 0. As that is a ridge regression with no penalty, it would make this step useless. I wanted to keep this step in to ensure model accuracy, so I imposed a heavy alpha penalty of 1000. As for the 0.5 l1 ratio, I wanted the best of both worlds between Lasso and Ridge. Pure lasso eliminates unnecessary columns by applying zero coefficients and ridge will keep a small non-zero to not essential columns. Whatever survives the elastic net will be used for pure, penalty-free linear regression. The final model is a pure linear regression because it will be simpler to relay coefficient values with dollar values without having to explain error penalties.
II.a.i) Making Gross Living Area the base of the model
The base columns of this model starts with Gross Living Area, the total finished living area above ground. The NYCDSA proposal folder suggested to look at Gross Living Area because it was the top feature with highest f_regression score in relation to price. This same folder also suggested expressed that there is a different rate between homes above and below the 80th percentile in Gross Living Area for each particular neighborhood. It mentions that this difference is a discount for large houses in the neighborhood. Following the suggestion I group the data frame by neighborhood. I find the 80th percentile of each neighborhood. Afterward, I create two new columns, smallGrLivArea and larageGrLivArea. For each house, if the GrLivArea is smaller than the 80th percentile in its neighborhood, it's smallGrLivArea is GrLivArea and largeGrLivArea is zero. In contrast, houses above the 80th percentile of their neighborhood will have a smallGrLivArea of zero and largeGrLivArea of GrLivArea. Once both columns are established. I drop GrLivArea from the dataframe.
The proposal document also expresses that both of these new columns follow power laws via finding the slope of log-log plots. The power law relation for each column is found, and two new columns are created relabeled with the power laws. After column relabeling, I dropped the old columns. These two columns form the base for the rest of the linear portion of the project.
II.a.ii) Storing significant correlations
I calculate Spearman correlations between all variables against each other. If the p value is greater than 0.05, the correlation is insignificant and discarded. The correlations then are ranked by absolute correlation values. I experiment and pick a percentile value. For this study I chose the 97.5th percentile. The pairs with absolute R values larger than the 97.5th percentile are stored in significantCorrelationsDictionary and significantCorrelationsDictionary.json. Also, if the pair is associated with sibling dummy variables defined by relatedDummiesDictionary, which is mentioned above, the pairings of those sibling dummies with the partnered variable are also recorded in the signficantCorrelationsDictionary. For example, let's say FeatureA has a significant correlation with FeatureB_4, which has sibling dummies FeatureB_2, FeatureB_3, and FeatureB_5. We would then have to put pairings FeatureA-FeatureB_2, FeatureA-FeatureB_3, FeatureA-FeatureB_4, and FeatureA-FeatureB_5 in the significantCorrelationsDictionary.
II.a.iii) Applying the model lift algorithm
Seeing that f_regression was the main reason why GrLivArea was suggested as the first feature to consider, I sorted the remainder of the figures via f_regression score from highest to lowest. The features created by smallGrLivArea and largeGrLivArea started as the base model. I proceeded to loop through the sorted f_regression list with the following algorithm which can be broken into two parts.
II.a.iii.A) Create new model concatenating the next variable(s) in the sorted f_regression base model.
I first check if the selected feature correlates with any feature in the base model. If the correlation in question is in the signficantCorrelationsDictionary or a correlation between one of it's sibling dummies, then I would skip the feature and move on to the next feature. After passing correlation checks I would either add the feature(s) directly or transform the feature and add the column directly to a new model composing of the base model and the new feature(s) in question. If it is dummy value, I add the dummy feature and its sibling dummies directly to the new model and move to Methods II.a.iii.b. If the feature has a set of less than 15, I utilize util.engineerSmallFeature which tries to fit 3 simple linear regressions between price and three versions of the feature: price vs feature, price vs feature^power, and price vs boxcox(feature). If the power or the boxcox is chosen, we re-label the column name with the parameters to be used for descriptive analysis later. The best scored model of these 3 small regression gets added to the new model.
All other continuous variables will go through a more rigorous process with util.IsHomoskedastic and util.engineerFeature. The method util.IsHomoskedastic fits a simple linear regression with the x and y values it's given, makes predictions on the x values, and sends the residuals to two methods, util.IsCentered and util.retrieveSpreadValue. util.IsCentered splits the residuals via three even bins along the predicted axis. If zero falls within the 95% confidence interval of the residuals of all 3 sections, then it passes the centering test. util.retrieveSpreadValue also sections the residual data the same way in the 3 bins as util.IsCentered, but it compares the largest residual variance and the smallest residual variance of the three sections with a spread value (largest variance divided by smallest variance).
If the spread value is less than 1.5, it passes the spread test. If the feature passes both centering and spread tests in util.IsHomoskedastic, it will directly add the feature to the new model. If the feature fails either test of util.IsHomoskedastic it tests, it will be analyzed via util.engineerFeature, where it will try to fit a simple linear regression to price vs feature^power. It will then retest the modified power feature with util.IsHomoskedastic. If it passes, a new power relation is passed to section II.a.iii.B. If not, it is boxcoxed and passed II.a.iii.B.
II.a.iii.B) Cross validate new model and determine whether to keep or discard the new feature(s)
Now it is time to cross validate the model with 5-fold validation. If the test_score of the new model is better than the base, the new model becomes the new base. If not, the new feature(s) are discarded and the algorithm restarts with the next feature(s) in the list.
II.b) Regular linear regression
After the algorithm has been completed, I trained a regular linear regression model with the final collection of columns chosen in II.a and used the trained model for testing and feature description analysis
III. Random Forest Regression Preparation
Originally, I was using GridSearchCV to optimize random forest parameters; however, I got impatient and went with this methodology. First I looped using the number of estimators, n_estimators, equalling 100, 500, and 1000. Whatever got the best test_score from cross validation, I kept as the n_estimator value. This value happened to be 1000.
Keeping the n_estimator value, I followed the same methodology, but now using the n_estimator value and looping through maximum tree depth, max_depth, equalling 10, 20, and None. When this value is set to None, the trees in the regressor will continue to divide into enough splits until values are no longer ambiguous. Providing None as an option for max_depth is one reason why GridSearchCV was taking very long.
The best max depth was 20. With n_estimators = 1000 and max_depth=20, I do one final loop, but now with n_features, the number of features considered at each split, being 'sqrt','log2', and None. At each split, 'sqrt' looked at the sqrt(number of columns), 'log2' examined log2(number of columns), and None reviewed every column. Providing the None option also to max_features in the GridSearchCV also bogged down GridSearchCV's speed. The final random forest used for the model was n_estimators = 1000, max_depth = 20, and max_features = None.
I. Random Forest Regression Feature Importances
Above are the top 10 features of importance used by the Random Forest to determine house prices. These features in order represent, overall quality, gross living area, 1st floor square footage, total basement square footage, garage area, finished basement type 1 square footage, lot area, year build, 2nd floor square footage, and garage size in car capacity . Unfortunately, many of these features could not be found in the linear regression due to the rigorous significant correlation analysis. However, their correlated partners did survive. For Results II, I l focus on the top 5 deciding continuous variables (a-c), the top 2 categorical variables (d-e), and the surviving correlating variables that are related to the random forest top 10 features of importance (f-k).
II. Feature Analysis and Correlations
II.a) Gross Living Area
Houses with a Gross Living Area smaller than the 80th percentile size of their neighborhood had a Gross Living Area price contribution of $70.17 x GrLivArea $/ft^2.
Homes with Gross Living Area at or above the 80th percentile of their neighborhood had a Gross Living Area contribution of $39.55 x GrLivArea^1.08 $/(ft^2.16).
II.b) Basement Full Bathrooms & Basement Half Bathrooms
The Basement half bath contribution was $6136 x (number of basement half baths)^0.7
When looking at the number of basement full bathrooms, the price increased $15561 for each basement half bath.
II.c) Distance from Iowa State University
Ames house prices increase by $800.13 per mile in distance away from Iowa State University.
II.d) Exterior Quality
The model defaulted the exterior quality of the house to excellent. Average/Typical exterior quality dropped the price by $77k, fair quality exterior quality dropped the housing price by $70k, and good exterior quality dropped the price by $64k. There were no poor quality exterior quality houses in the dataset.
II.e) Home Functionality
The home functionality contribution used Major Deductions 1 as its default. Homes with Major Deductions 2 lowered in price by $19k, and homes with Salvage only functionality dropped in price by $40k. Minor deductions 2 cases had a price increase of $13k, minor deductions 1 cases increased by $14k, and moderate deduction cases increased by $15k.
II.f) Garage Type (correlates with 1st floor sqft and year built)
More than one type of garage was the default selected for Garage Type. Attached to home garage types increased the price by $12k. Detached from home garage types gained $9k. Homes with no garage gained $14k. The houses with a basement garage dropped $4k. Carport garages increased the price by $2k. Built-in garages dropped the default price by $75.
II.g) MSZoning Code (correlates with lot area)
Agricultural zoning is the default price value for this model. Building in the Floating Village Residential area gives a $157.87 boost. Houses in commercial areas have a $6k boost. Industrially bound houses gained an additional $10k on top of the default price. Residential low, medium, and high density locations increased the price value by $30k, $19k, and $30k respectively.
II.h) Enclosed Porch (correlates with year built)
The house price value goes up by $2.53 for every square foot of the enclosed porch.
II.i) Overall Condition (correlates with year built)
For each rating increase of overall condition, the price increased by $4k.
II.j) Paved Driveway (correlates with year built)
The model defaulted with dirt/gravel driveways. Partial pavement driveways increased the price by $2k. Paved drives increased in price by $5k.
II.k) Garage Condition (correlates with size garage in car space (garagecars))
The linear model selected Excellent garage condition as its default. Good garages lowered the price by 14k. Typical/Average garages demoted the price by $20k. Fair garages received a $23k discount. Poor garages reduced the price by $24k. Lastly, having no garage diminishes the house price by $31k.
Taking the square root of the mean square error gives us an average price error between the actual and predicted values of a house. The linear regression model had an average price error of $26501.33 for training data and $25268.16. For the random forest, the training set had an average price error of $9489.71 and the testing set had an average price error of $20701.15.
I. Linear Regression Cost analysis
When comparing the Gross Living Area rates between small and large houses, the presence of a discount for larger houses does exist as the proposal suggested.
Originally, I assumed, since the proposal suggested that most residents worked at Iowa State University, that the closer one was to the campus, the more expensive it got. However, discussions with audience members when I presented this work suggested that closer houses were cheaper due to student housing rates.
In regards to MSZoning, the Floating Village Residential Area is a retirement community. Thus, the price increase from the agricultural default is small. Commercial and Industrial areas are more desirable than agricultural areas, but less preferred over residential areas, so a $6k-$10k increase over agricultural land is understandable.
The interesting observation regarding zoning is that residential medium density is only a $19k increase over agricultural land, but the high and low density residential areas have a $30k increase over agricultural land. I speculate that this discount for residential medium density is due to balancing a combination of other factors, good and bad, that differ from its extreme counterparts. For this reason, it's safe in a Goldilocks zone.
Everything else mentioned is based on standard practice. Adding more bathrooms or improving the quality or condition of a certain part of the house will increase price values as expected. Less bathrooms, lower quality, or worse conditions in turn lower the price value as expected. More expensive features, means higher house value.
II. Model comparison
My Random Forest model accurately predicts the prices better than my linear regression model. However, comparing the $9k error from training data and the $20k error from the test data, the random forest has a high variance and overfit. The linear regression model in contrast has a slight bias when examining the $27k error on training data and a $25k error on the test data. This small bias is caused by using the combination of cross validation and only including columns if it gives a model lift.
Conclusion and Future Works
It should be noted that random forests are good predictors, but can not be described. Linear regression models on the other hand can be a good use for descriptive modeling; however, if columns with high feature importance are being excluded from the linear regression, we might not get the true reason why the houses are priced at those values. Thus, for the next iteration of this project, I will start with a random forest regressor on training data and return back to the original idea of optimizing the forest's parameters with grid search, but with finite values on number estimators, the maximum depth of the trees, and the maximum number of features analyzed at each split. In addition to excluding the None option from the previously mentioned parameters, I will add splitting criteria as a parameter to the grid search.
After optimizing the random forest via the grid search, I will perform the same column addition process as mentioned in the methods, with the following changes:
- Any feature with a set length less than 15 should be considered categorical and turned into dummy columns. Analysis of these numerical small length sets show that ratings, qualities, and conditions might be ordinal, but their influence on price is not exactly linear.
- Dummy variable correlation analysis with other features will be excluded. Important features were excluded because of dummy variable correlation. Thus, I'll permit these features to skip correlation analysis and let them attempt cross validation.
- After starting the linear model with Small and Large Gross Living Area contributions, column addition via elastic net (alpha = 1000, l1-ratio = 0.5) and cross validation, will be done in order the Random Forest's feature importances.
Once satisfied from the model training of both models, I will evaluate both models via testing data and look at the descriptive analysis of the linear model.
Nevertheless, the final models of this iteration were able to get price errors of $20k-25k; this model is safe to use for descriptive analysis, and the logic behind the price contributions makes sense. Bathroom counts, areas, conditions and quality of particular features, and location play a significant part in how a house is typically priced, excluding inflation. Hopefully, our economy will get better and people are able to buy houses that they could turn into homes.
All data used in this project comes from a NYCDSA Machine Learning Project proposal folder, but this data set can be acquired via combining the following documents:
Suggested Ames Real Estate Data: https://www.cityofames.org/home/showpublisheddocument/58715/637843112781470000
Suggested Ames Housing Data: https://www.kaggle.com/datasets/prevek18/ames-housing-dataset