NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > Analyzing Data to Predict Housing Prices in Ames, Iowa

Analyzing Data to Predict Housing Prices in Ames, Iowa

Daniel Choy
Posted on Mar 9, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Portfolio | GitHub | Codes | LinkedIn

Introduction

Data has shown that house flipping is a common real estate investment strategy by purchasing a property and selling it in the hopes of making a profit. This can mean that sometimes, flipping a house means that the temporary owner has to make a lot of repairs or renovations until the owner can sell it for more than the investment cost. Hence, the goal is to buy low and sell high.

However, house flipping can sometimes be financially risky due to the uncertainty of the market. As data scientists, we approached this machine learning project with a two-fold goal in mind: first, we want to explore which housing characteristics are correlated with sale price per square feet in Ames; and second, we aim to build a model for future sale price estimation to understand which features make first the most impactful renovations to ultimately provide greater transparency to homeowners or house flippers.

Background: Understanding Ames, Iowa and Ames' Housing Market

Before diving into the project's details, it is important to discuss a brief background of Ames, Iowa, to understand Ames' housing market better. Based on a United States Census Bureau report in 2010, Ames, Iowa had a population of approximately 59,000. Also, Ames, Iowa economy and demographics is largely defined by the Iowa State University, a public research university located in the middle of the city. More than 75%  of Ames' population is either studying as a student or working as a faculty at Iowa State University, making Ames one large extended campus (more information at this website).

Therefore, it isnโ€™t surprising that Ames's largest employer is Iowa State University, which employed approximately 20% of total employment. Hence, just like many college towns, Ames' real estate market is defined by a substantial proportion of rental properties, explaining the housing market's stability in Ames.

Analyzing Data to Predict Housing Prices in Ames, Iowa

if we look at the Ames housing price distribution graph, the graph indicates that Ames housing prices have more outliers on the expensive side. When we look at Ames housing market trend from 2006 to 2010, Ames housing market is relatively stable in terms of per square foot pricing over the years.

Analyzing Data to Predict Housing Prices in Ames, Iowa

If we look at the map, we can see that the cheaper homes are generally located in the city centers or generally located around the Iowa State University campus. The more expensive houses are located in the northern part of the city. In general, it seems like the cheaper houses are clustered and the expensive houses are clustered together.

About the Data

The data contains 2558 observations and 190 features on homes sold in Ames, Iowa from 2006 to 2010. Within the features, we carefully selected a subset of these features and engineered some of our own features to simplify and sharpen our subsequent models' focus. We also ran random forest and lasso regression to further select our features before finalizing our features into our learning and tree-based machine learning models.

Data Cleaning

After carefully reviewing the documentation on each variable, we initially went through the imputation process. Most of the processes were on missing variables - variables having N/A values that corresponded to the absence of a feature. These values were either replaced by a string - None - or 0 depending on the type of the variable. For example, the missing value in the continuous variable GarageArea was imputed to 0 as it was assumed that the absence of a value most likely entailed the absence of a garage.

Exploratory Data Analysis

We conducted graphical and numerical exploratory data analysis to understand the dataset and the relationships between the features and our target variable, sale price per square foot. While no two homes are the same, price per square foot is helpful when comparing similar homes in the same neighborhood.

Due to many housing features, all features and analyses will not be discussed in this post. Instead, the post will focus on a select few features for exploratory data analysis, feature selection, and feature engineering based on the correlation heatmap. We will explore several features that might impact sale price per square foot for future discussion and break this down into 5 different categories: neighborhood, house size, house age, house features, and other features.

Neighborhood Data 

Analyzing Data to Predict Housing Prices in Ames, Iowa

As mentioned above, the average Ames housing sale price differs based on the neighborhood. Neighborhoods around Iowa State University and the city center are normally cheaper while the Northern neighborhood - Gilbert and Grand Ave/30th St - are expensive. Therefore, as an investor or as a homeowner who is into house flipping, it is important to understand what neighborhood you are investing in. 

House Size Data 

The plot above shows the sales price per square foot against the total living area. Based on the graph, there is a strong positive relationship between these two variables. In general, the larger the living area, the higher the sale price per square foot.  

House Age

Orange - old; Yellow - fairly old; Green - fairly new; Dark green - new

Again, if we look at the map, similar to the previous maps where we looked at the prices, if we take out fairly new and fairly old houses, we can see that the new houses are relatively more away from the college campus and new houses are clustered around the northern neighborhood and the old houses are clustered around the city center โ€“ a similar pattern we saw with the average sale price per square feet. The graph also shows that the more recent houses were built on the outskirts of Ames which suggests that the city is expanding outward.

House Features and Others

Based on the graph above, in terms of additional house features such as heating quality, exterior quality, and fireplace quality, the better the quality the higher the sale price.

Feature Engineering

Based on what we observed in our exploratory data analysis, we created several new features to reduce dimensionality and to better explain and predict sale price.

For example, Basement Total Finished Square Feet is the total basement area that is finished and Building Age is calculated as Year Sold - Year Built. These newly created features are highly correlated with the sale price and these features will be used as our predictors for our models.

Machine Learning Models

We implemented several machine learning models for different purposes. We first started with Lasso for empirical feature selection. Then we created two predictive models - one linear and one non-linear model. Finally, we ran a multiple linear regression model to find which features make for the most impactful renovations.

  1. Lasso

For the purpose of empirical feature selection, we started with a Lasso model. Lasso favors less complicated models by introducing a penalty term on predictor coefficients that gradually approach zero as the penalty term increases. By deciding the appropriate penalty term, which is decided by the hyperparameter lambda, certain predictor coefficients would be sent to zero while others remained non-zero. Predictors correlated with other predictors would have their overall impact regulated.

Based on our grid search with cross-validation, we selected the Lasso model that fit the dataset well without overfitting. The model reduced the number of predictors from the original dataset down to 81 features that include numerical variables that were highly correlated with the sale price per square foot, such as GrLivArea, OverallQual, OverallCond, GarageArea, and categorical variables such as Neighborhood, GarageType, and HouseStyle. Please note our GitHub repository for more information.

2. Elastic Net

With the selected features from our Lasso model, we ran an elastic net model to predict sale price. Using grid search and cross-validation, we chose parameters that fit well without overfitting. Our best parameters were Lambda = 1e-6 and L1 ratio = 1.0. This means that our elastic net model ended up behaving like a lasso model.

3. Random Forest

For our next model, we selected a random forest as our non-linear predictive model as it is a well-tested tree-based model that is robust to overfitting. However, Compared to our other linear regularized models, our random forest model performance declined mainly because the house prices seem to have intrinsic linearity. Intuitively, the value of a house will typically increase as features are added or improved. House value will decrease as features are removed.

4. Multiple Linear Regression

What can a homeowner do to increase the value of their property? In order to successfully flip a house, or in other words, if a homeowner wanted to make some renovations for profit, which ones would have the greatest impact on Sale Price?

In order to answer these questions, we finally ran a multiple linear regression model on a particular subset of predictors. Multiple linear regression was chosen for the interpretability and simplicity that its coefficients tell. In multiple linear regression, for every 1 unit increase in a given feature, you can expect the target variable to increase by the value of that feature's coefficient. This allows for easy interpretation; hence, straightforward insight for homeowners. 

We started with the list of 81 features provided by our Lasso model for house renovations. Because Lasso is nothing more than penalized linear regression, it makes sense to use Lasso's output features as our multiple linear regression model's input features. As a result, our model earned a train score of 0.912, which gives us confidence in the model's ability to explain the data, and ultimately its choice for the most important features.

Additional Insights

Based on the model, we would hope that when deciding which renovations to make for a successful house flipping project and investment, a homeowner or investor in Ames, Iowa might choose to consider the following features: total Living Area, Distance from Iowa State University, Overall Quality and condition, Garage Area, Number of bathrooms, Kitchen Quality, Heating quality, Basement exposure, Fireplace, Exterior quality. 

In addition, in terms of quality, the single most important factor in selling a home, the overall quality, material, and finish of the house. If one is prioritizing areas to remodel, outdoor finishes, followed by indoor finishes and finally basement finishes may be the best approach. If remodeling over several years with plans to sell the home in the future, Exterior Quality has the advantage of staying in style many decades longer than interior finishes. Therefore, it may be important to prioritize the order of interior finishes so that the most outdated areas of the home will be those that contribute less strongly to Sale Price, given that the years since the last remodel also influence sale price.

In addition, for a simpler renovation, homeowners or investors could increase the finished percentage of their basement and could attract more buyers willing to spend more for a fully finished property.

Conclusion

Overall, our analysis showed that regularized linear model makes better predictions than a tree-based model, and we were able to get a list of features ranked by value of importance for homeowners looking to add value to their property with renovations or for investors who are also looking for a house-flipping project to make profits.

About Author

Daniel Choy

View all posts by Daniel Choy >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application