The quest for R^2 ROI on Home Improvements
Link to GitHub https://github.com/jackparsons93/Ames_Housing_ML
As a house is often the biggest purchase many of us will make, we want to be certain about its value and be strategic about improvements we make with the intent to sell at a profit. Using the Kaggle Ames Housing set, I applied machine learning with the aim of finding the features that would achieve an ROI R^2. This will allow investors to more accurately price houses allowing for greater ROI. The mean house price in the Dataset is $178,059, and the root mean squared error $52,060 and an increase in .01 of R^2 leads to a $543 decrease in Root Mean Squared error. RMSE provides a measure of the average magnitude of the prediction error.
The following presentation is split into two subsets. The first part of my presentation is based on my original data science journey into the dataset. As I came near the end of that project, I realized that it was possible to merge the two datasets with the help of the Geopy Nominatim library. The second part of this blog represents the finding based on that merged dataset.
To start, I used linear regression with several quantitative variables the variables that I have selected to use first are features = ['GrLivArea', 'LotArea', 'YearBuilt', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea'] and target = 'SalePrice'
- The top left graph shows actual vs predicted price.
- The top right shows residuals vs predicted price.
- The bottom left shows the Training MSE vs the Test MSE.
- The bottom right shows the Training and Test R^2.
Please note that in the previous image we saw that the Test R^2 score was actually higher than the Training R^2 score.
The training R^2 score was 0.76, and the test R^2 score was 0.78.
The next visualization is a correlation heatmap that shows the relationship between the variables. I attempt to address the correlation later using Ridge and Lasso, as well as step forward selection.
The next thing I did was try a polynomial regression. I increased the R^2 to .79, and used the same for plots as before.
As we can expect now, the training score is higher than the test score.
We now see a 3rd degree polynomial that, we will see, clearly overfits the data. And It produces a high training R^2 and a negative R^2 score on the test data. In the next slide, we use the same 4 graphs as before with shockingly different results.
Notice how the Training MSE is near 0 indicating a nearly perfect fit on the training data. The training R^2 score was 0.867353835864601
The Test R^2 score was -550539.2725510775.
In the next image we see several plots of the residuals:
- Top left distribution of residuals
- Top Right QQ-plot of residuals
- Bottom Left Leverage vs Residuals
- Bottom Right Prediction interval plot
As we can see, the residuals form a normal Gaussian Distribution around the mean of 0.
A Q-Q plot compares the quantiles of the sample data to the quantiles of a theoretical distribution. If the data follows the theoretical distribution, the points will lie approximately along a straight line. As we can see, the Q-Q plot shows a straight line, indicating that the polynomial regression does not fit the data.
As we can see from the Leverage plot, there are a couple of data points that severely affect the linear model. To adjust for that, I tried to remove such outliers by leverage. However, I realized that I cannot get a better result on a neural network by removing high leverage points.

=
I next look at the importance factor of features to the model
Overall Quality is the most important feature to the model by a big margin, followed by year built. Out of the five features I selected to model first, Overall Quality reflects the target variable price the most. I later used Sequential Feature Selection and Lasso to select the most important variables to the model.
Next I use a dummy encoded from pd.get_dummies, to convert the categorical values of the dataset into dummy variables. Now on linear regression we get an R^2 of .9,1 a significant increase.
Now we try new regression models, such as Support vector regression and random forests.

#image_title
Linear regression for this model is king so far. Random forest ranks second with a R^2 of .89. The poorest results come from the Support Vector Regressor, which produced a negative R^2 score โ not even outperforming the null model of the mean of the data set.
The next diagram shows a decision tree based on the most important variable overall quality.
The following shows a decision tree plotted against its decisions in overall quality.
Next we see Kaggleโs favorite method, XGBoost.The XGBoost method only got an R^2 score of .907, which was disappointing.
However, it does perform better when applied to the entire dataset, as weโll see later..
Next we look at Ridge and Lasso Penalized methods. We can actually see a slight improvement in Ridge and Lasso from linear regression. As the graphs look the same, I do not show them here.
Ridge gets an R^2 score of .918 and Lasso Gets a R^2 score of .914 both perform slightly better than linear regression. Next we take a look at neural networks and use them to get an R^2 score of .93
I performed a grid search to get the best parameters for training a neural network. Using this method it is possible to get an R^2 score of .93.
The next model I look at in an attempt to reclaim the title for simple linear regression is Sequential forward selector.We now get an R^2 score of .918 for linear regression, which is pretty impressive
The next step is training the most effective neural network from above with an R^2 score of .93. I trained it using the selected features from forward step selection. Remember the parameters were as shown below:
The R^2 Score of the new neural network is 0.922, a slightly higher score than the linear model using SFS and slightly lower than the .93 of neural networks trained on all features.
Part II
Now we move on to the second part of my project.In the second part of my project we will be merging the datasets on Ames House pricing. Using Geopy and Nominatim allows us to use more data and achieve an even higher R^2 score.
I tried to do Sequential Feature Selection on the new dataset. However, after the dataset grew to over 10,000 columns after dummifying, and SFS was taking more than 20 hours to run before I canceled it. Even though SFS failed, there is another method called Lasso that feature selects and sets certain coefficients to zero. Lasso is a penalized linear regression model that works well on its own. However, when we feature select using Lasso and then pass the features where coefficients do not equal zero, we see much better performance from XGBoost, which got the best r^2 score.
The best alpha (lambda) parameter for Lasso is 7. The R^2 score with an alpha of 7 is .931. However, as we will see later in XGBoost. The highest R^2 model of Lasso does not provide the best set of coefficients for XGBoost.
The best alpha for ridge regression was 10. Ridge regression for an alpha at 10 got an R^2 of .926, slightly worse than Lasso. Here we see Lasso Vs Ridge for Mean squared error and R^2 score. Notice that Lasso has a lower MSE and therefore higher R^2.
Xgboost Gets an R^2 of .95.This means that it gets $51,614 more accurate per house than the null model which is 29% of the total value of a home and $2,716 more accurate per house than a model that gets an R^2 of .90, which is 1.5% the value of the mean of the dataset which is about $178,000. In order to get the XGBoost score of .95, I had to first parameter select using Lasso and then feed those parameters into XGBoost. Please note that the Lasso with the highest R^2 score was not the best for XGBoost. I had to loop through many Lassos to find the optimal R^2 for XGBoostNext is a predicted vs actual chart. As noted above, we can see that XGBoost fits the model much better than earlier.
Next is a q-q plot of XGboost. We see for the most part that the residuals are on the slant of the line and are a straight line except for at the beginning and ends of the plot, so it is essentially good.
My final graph on Xgboost shows the relationship between Alpha of Lasso and XGBoost R^2 score.
Finally, I would like to take a look at support vector regressors I hyperparameter tune the SVR and get an R^2 score of .913 with the best parameters of C=100, epsilon=.1 and kernel=linear
Here is a look at the graph created by SVR. At first glance the SVR appears to overfit the data. In fact, though, the epsilon is only .1, meaning that there is only a small distance around the hyperplane between the decision boundary.
The final method I tried is tensorflow without feature selection tensorflow on the merged dataset got an R^2 of .92. I then did feature selection using Lasso and then feeding the coefficients not equal to zero to tensorflow, and I got an R^2 of .936, about the same as the original dataset.
The winner of the battle for R^2 is XGBoost with an R^2 score .95.As I stated earlier, the net value of .01 R^2 is about $560 dollars per house, as signified by the Root mean squared Error, and the value of XGBoost over 2nd place which is tensorflow at .936 is about $1,000 per house. For the total dataset, a net value gained of $2,680,000 over tensorflow.
Where to go from here
Things that could be done in the future with more time and computational power including creating fake data using a Generative adversarial network (GAN). Neural networks using GANs have achieved the top score in Image-nets large scale visual recognition challenge, making new fake images to additionally train neural networks on. Imagine an infinite set of data to train on all that passed the discriminator's judgment of what is real and fake data. That could increase the R^2 score of a neural network or XGboost even further.