Investment Opportunity in Ames, Iowa

Avatar
Posted on Jul 4, 2020

Introduction: 

Home ownership has long been touted as part of the American dream. Unfortunately for those living in coastal cities, such as New York or San Francisco, home ownership is an increasingly distant dream. Luckily there are more affordable options in Ames, Iowa.

Why Ames? For those who are unfamiliar with the city, Ames has been named the best college town in America and CNN Money's 9th best place to live. It's close to a lot nature reserves, and is not too far from bigger cities like Minneapolis or Chicago. Because it is a university town, you can expect no shortage of incoming students or faculty looking looking to move to Ames, making property purchase in Ames the ideal rental investment.

I decided to look at real estate investment opportunities from two perspectives: 

  1. A home buyer
  2. A real estate developer 

The data I used for my analysis is from Kaggle.

Missing Data:

Before starting my analysis, I needed to fix the issue with missing data. At first glance, many features have missing values, but after checking the data dictionary, these missing values are due to the home not having that feature. For example, missing pool quality simply means that home does not have a pool, and missing value for fence, similarly means there is no fence. For these values, I simply changed the missing value into an appropriate text value stating there is no pool, no garage or no fence. For the remaining variables that were missing at random, I chose mode imputation for categorical variables, and median imputation for the numerical variables. 

​​Data Exploration and Feature Engineering:​​​ ​​​

One of the things I tried to do was to consolidate information into as few variables as possible. The data provided five different variables reporting porch related space. I combined this into one variable: TotalPorchSF. Many of the variables are related to each other. For example: high garage capacity (number of cars) usually indicates large garage size. Because of this I only kept the number of cars for my analysis. I removed a total of 16 variables that either had low predictive power, or showed a duplication of information.

Another issue was outliers. I identified two homes that were priced very low in comparison to their size: the two data points observed at the bottom right in the plot below. I removed a total of 5 observations from the training set. The remaining three were removed as they represented extremely rare external materials. There simply was not enough data to tell us about the quality reagarding those materials.

Target variable:

The distribution of the sale price of homes in Ames, Iowa is right skewed. Because of this I took the log of the sale price. As you can see, it is closer to being normally distributed. I used the log of the sale price as my predictor variable when building my Machine Learning models.

Machine Learning: 

I tried a variety of linear and tree-based models including LASSO, Ridge, Random Forest, and Gradient Boost models. Out of these, I had the best results using tree-based models. Ultimately, I chose Random Forest for my analysis as it was the best performing when considering speed and accuracy.

One of the benefits of the Random Forest model is that it does not overfit the data even when increasing the number of trees. However, the problem with having a Random Forest model with more trees is that it slows down the computation speed, with little return on results. I chose 500 trees for my model, as I did not see an improvement in accuracy when increasing the number of trees of past 500.

# of treesTest RMSEKaggle Score
10,0000.1190.141
5000.120.141

​The second model is trained using 12 different variables related to the materials and style of the home. The purpose was to identify areas of focus from a real estate development perspective. I was able to fit a Random Forest with a RMSE of .243, which is pretty good considering it eliminated many of the most important variables as they don't relate to home construction directly. ​

Below are the most important features for the home sale price model, and construction model, respectively.

Conclusion and Business Recommendations:

Based on the results of my model above, I recommend the following investment plan for home buyers:

  • Invest big, in terms of square feet (not price)
  • Improve quality and condition of the home through renovations
  • Only consider homes with fireplaces
  • Boosting overall home quality from 6 to 10 increases home value by $21,900 (on average), the following specific renovations are the most recommended:
    • Improving kitchen quality increases home value by $11,100
    • Improving fireplace increases home value by $3,800
    • Improving basement increases home value by $7,300

I recommend the following home design for construction considerations:

  • Poured concrete or wood foundation
  • Stone masonry veneer
  • Two story home with built-in garage
  • Single-family detached home, or townhouse end unit
  • Cement board or vinyl siding exterior

Finally, I recommend both buying and selling in the summer as there are the most options along with more people looking to buy. Real estate developers should plan construction accordingly with this timeline in mind.

Future Work:

One of the issues with my model is that it tends to overpredict cheaper homes, and underpredict more expensive homes. Also I will work on consolidating some of the related features in my main model.

About Author

Avatar

Jessie Wang

Jessie is a graduate from the University of California, Santa Barbara with a degree in Actuarial Science. Upon graduation, she joined UnitedHealth Group as an actuary where she gained a wide array of experience in the healthcare industry....
View all posts by Jessie Wang >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp