NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > The quest for R^2 ROI on Home Improvements

The quest for R^2 ROI on Home Improvements

chris_wilson
Posted on Sep 25, 2024

Link to GitHub https://github.com/jackparsons93/Ames_Housing_ML

 

As a house is often the biggest purchase many of us will make, we want to be certain about its value and be strategic about improvements we make with the intent to sell at a profit. Using the Kaggle Ames Housing set, I applied machine learning with the aim of finding the features that would achieve an ROI R^2. This will allow investors to more accurately price houses allowing for greater ROI.  The mean house price in the Dataset is $178,059, and the root mean squared error $52,060 and an increase in .01 of R^2  leads to a $543 decrease in Root Mean Squared error. RMSE provides a measure of the average magnitude of the prediction error.

 

The following presentation is split into two subsets.  The first part of my presentation is based on my original data science journey into the dataset. As I came near the end of that project, I realized that  it was possible to merge the two datasets  with the help of  the Geopy Nominatim library. The second part of this blog represents the finding based on that merged dataset.   

To start,  I used linear regression with several quantitative variables the variables that I have selected to use first are features = ['GrLivArea', 'LotArea', 'YearBuilt', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea'] and target = 'SalePrice'

 

  1. The top left graph shows actual vs predicted price.
  2. The top right shows residuals vs predicted price.
  3. The bottom left shows the Training MSE vs the Test MSE.
  4. The bottom right shows the Training and Test R^2.

 

Please note that in the previous image we saw that the Test R^2 score was actually higher than the Training R^2 score.

The training R^2 score was 0.76, and the test R^2 score was 0.78.

 

The next visualization is  a correlation heatmap that shows the relationship between the variables. I attempt to address the correlation later using Ridge and Lasso, as well as step forward selection.

 

The next thing I did was  try a polynomial regression. I increased the R^2 to .79,  and used the same for plots as before.

As we can expect now, the training score is higher than the test score.

 

We now see a 3rd degree polynomial that,  we will see, clearly overfits the data. And It produces a high training R^2 and a negative R^2 score on the test data. In the next slide, we use the same 4 graphs as before with shockingly different results.

 

Notice how the Training MSE is near 0 indicating a nearly perfect fit on the training data. The training R^2 score was 0.867353835864601

The Test R^2 score was  -550539.2725510775. 

In the next image we see several plots of the residuals:

  1. Top left distribution of residuals 
  2. Top Right QQ-plot of residuals
  3. Bottom Left Leverage vs Residuals
  4. Bottom Right Prediction interval plot

 

As we can see, the residuals form a normal Gaussian Distribution around the mean of 0.

A Q-Q plot compares the quantiles of the sample data to the quantiles of a theoretical distribution. If the data follows the theoretical distribution, the points will lie approximately along a straight line. As we can see, the Q-Q plot shows a straight line, indicating that the polynomial regression does not fit the data.

As we can see from the Leverage plot, there are a couple of data points that severely affect the linear model.  To adjust for that, I tried  to remove such outliers by leverage. However, I realized  that I cannot get a better result on a neural network by removing high leverage points.

=

I next look at the importance factor of features to the model

 

Overall Quality is the most important feature to the model by a big margin, followed by year built.  Out of the five features I selected to model first, Overall Quality reflects the target variable price the most.  I later used Sequential Feature Selection and Lasso to select the most important variables to the model.  

 

Next I use a dummy encoded from pd.get_dummies, to convert the categorical values of the dataset into dummy variables. Now on linear regression we get an R^2 of .9,1 a significant increase.

Now we try new regression models, such as Support vector regression and random forests.

#image_title

Linear regression for this model is king so far.  Random forest ranks second  with a R^2 of .89.  The poorest results come from the  Support Vector Regressor, which produced a negative R^2 score โ€“ not even outperforming the null model of the mean of the data set.

 

The next diagram shows a decision tree based on the most important variable overall quality.

 

The following shows a decision tree plotted against its decisions in overall quality.

 

Next we see Kaggleโ€™s favorite method, XGBoost.The XGBoost method only got an R^2 score of .907,  which was disappointing. 

However, it does perform better when applied to the entire dataset, as weโ€™ll see later..

 

Next we look at Ridge and Lasso Penalized methods.  We can actually see a slight improvement in Ridge and Lasso from linear regression. As the graphs look the same, I do not show them here.

Ridge gets an R^2 score of .918 and Lasso Gets a R^2 score of .914 both perform slightly better than linear regression. Next we take a look at neural networks and use them to get an R^2 score of .93

I performed a grid search to get the best parameters for training a neural network. Using this method it is possible to get an R^2 score of .93.

 

The next model I look at in an attempt to reclaim the title for simple linear regression is Sequential forward selector.We now get an R^2 score of .918 for linear regression, which is pretty impressive

The next step is training the most effective neural network from above with an R^2 score of .93. I  trained it using the selected features from forward step selection.  Remember the parameters were as shown below: 

The R^2 Score of the new neural network is  0.922, a slightly higher score than the linear model using SFS and slightly lower than the .93 of neural networks trained on all features.

Part II

Now we move on to the second part  of my project.In the second part of my project we will be merging the datasets on Ames House pricing.  Using Geopy and Nominatim allows us to use more data and achieve an even higher R^2 score.

 

I  tried to do Sequential Feature Selection on the new dataset. However, after the dataset grew to over 10,000 columns after dummifying, and SFS was taking more than 20 hours to run before I canceled it. Even though SFS failed, there is another method called Lasso that feature selects and sets certain coefficients to zero. Lasso is a penalized linear regression model that works well on its own.  However, when we feature select using Lasso and then pass the features where coefficients do not equal zero, we see much better performance from  XGBoost, which got the best r^2 score.

 

The best alpha (lambda) parameter for Lasso is 7. The R^2 score with an alpha of 7 is .931. However, as we will see later in XGBoost. The highest R^2 model of Lasso does not provide the best set of coefficients for XGBoost.

 

The best alpha for ridge regression was 10. Ridge regression for an alpha at 10 got an R^2 of .926,  slightly worse than Lasso. Here we see Lasso Vs Ridge for Mean squared error and R^2 score.  Notice that Lasso has a lower MSE and therefore higher R^2.

 

Xgboost Gets an R^2 of .95.This means that it gets $51,614 more accurate per house than the null model which is 29% of the total value of a home and $2,716 more accurate per house than a model that gets an R^2 of .90, which is 1.5% the value of the mean of the dataset which is about $178,000. In order to get the XGBoost score of .95, I had to first parameter select using Lasso and then feed those parameters into XGBoost. Please note that the Lasso with the highest R^2 score was not the best for XGBoost. I had to loop through many Lassos to find the optimal R^2 for XGBoostNext is a predicted vs actual chart. As noted above,  we can see that XGBoost fits the model much better than earlier.

Next is a q-q plot of XGboost. We see for the most part that the residuals are on the slant of the line and are a straight line except for at the beginning and ends of the plot, so it is essentially good.

My final graph on Xgboost shows the relationship between Alpha of Lasso and XGBoost R^2 score.

Finally, I would like to take a look at support vector regressors I hyperparameter tune the SVR and get an R^2 score of .913 with the best parameters of C=100, epsilon=.1 and kernel=linear

 

Here is a look at the graph created by SVR.  At first glance the SVR appears to overfit the data.  In fact, though, the epsilon is only .1, meaning that there is only a small distance around the hyperplane between the decision boundary.

The final method I tried is tensorflow without feature selection tensorflow on the merged dataset got an R^2 of .92. I then did feature selection using Lasso and then feeding the coefficients not equal to zero to tensorflow, and I got an R^2 of .936, about the same as the original dataset.

 

The winner of the battle for R^2 is XGBoost with an R^2 score .95.As I stated earlier, the net value of .01 R^2 is about $560 dollars per house, as signified by the Root mean squared Error, and the value of XGBoost over 2nd place which is tensorflow at .936 is about $1,000 per house. For the total dataset, a net value gained of $2,680,000 over tensorflow.

 

Where to go from here

 

Things that could be done in the future with more time and computational power including creating fake data using a Generative adversarial network (GAN).  Neural networks using GANs have achieved the top score in Image-nets large scale visual recognition challenge, making new fake images to additionally train neural networks on.  Imagine an infinite set of data to train on all that passed the discriminator's judgment of what is real and fake data.  That could increase the R^2 score of a neural network or XGboost even further.

 

About Author

chris_wilson

I used to be an amateur ruby on rails programmer, but rails is dead except for senior programmers. I also have a background in mathematics, I used to do very well in math tournaments as a youngster, I...
View all posts by chris_wilson >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application