Using Data to Predict Housing Prices in Ames, Iowa

and
Posted on Mar 31, 2021

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Housing Prices in Ames, IA - Background

Based on data there are many qualities of a residential home that determine its worth and price outside of just the number of rooms and square footage. Taking a closer look at 81 residential variables from Ames, Iowa between 2006 to 2010, we were able to use various machine learning methods to determine features of a home in this region that are important to predict sale prices. In this regression analysis, the R2 value across select regression models helped determine which model best fit the dataset and predicted housing prices.

About the Housing Prices Data

The dataset includes several residential features, from ordinal numeric values like OveralQual and GrLivArea to different string variables. In all, after combining the test and training data, over 2900 observations were used for fitting various models.

Exploring the Housing Prices Data

Applying a correlation heatmap to the data reveals which features are most correlated to the sale price (SalePrice) and which are least correlated. The heat map below shows the 10 features most correlated to SalePrice. There's no great surprise here. The top 10 are the primary features people look for in a home they're considering buying. So, it's no wonder that they would most likely affect sale prices. Converting the full heatmap to a bar chart, we can also see the features least correlated to the price like BsmtFinSF2, BsmthHalfBath, and MiscVal.

Using Data to Predict Housing Prices in Ames, Iowa Using Data to Predict Housing Prices in Ames, Iowa

As noted, the overall quality (OverallQual) greatly affects the sale price. The higher the OverallQual, the higher the price, as shown on the box plot below.

Using Data to Predict Housing Prices in Ames, Iowa

Data Pre-processing

Missing Data

Using Data to Predict Housing Prices in Ames, Iowa

A number of features were identified with significant amounts of missing data. The row Id and features with more than 80% missing values, including PoolQC, MiscFeature (with its counterpart, MiscVal), Alley, and Fence were deleted from the dataset. 

Other features with missing values were imputed based on their data type. Categorical features were imputed with “None.” Numerical features were imputed with zeros. Features including LotFrontage, MSZoning, Utilities, Electrical, KitchenQual, SaleType, Functional, Exterior1st, and Exterior2nd were imputed with the mode.

One difference between the simple and advanced dataset include whether GarageCars, GarageArea, and MasVnrArea had a “None” or zero imputation, as both imputation methods yielded different R2 values.

Feature Engineering

To further improve the data set, several features were transformed to yield meaningful insight. For example, several quality and condition string features--including ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, GarageQual, GarageCond--had values that were converted to ordinal numerical values (e.g. {None: 0, Po: 1, Fa: 2, TA: 3, Gd: 4, Ex: 5}). Other categorical variables were converted to ordinal values as well, as seen below:

  • LotShape - {IR3: 1, IRF2: 2, IRF1: 3, Reg: 4}
  • BsmtExposure - {None: 0, No: 1, Mn: 2, Av: 3, Gd: 4}
  • BsmtFinType1 and BsmtFinType2 - {None: 0, Unf: 1, LwQ: 2, Rec: 3, BLQ: 4, ALQ: 5, GLQ: 6}
  • Functional - {None: 0, Sal: 1, Sev: 2, Maj2: 3, Maj1: 4, Mod: 5, Min2: 6, Min1: 7, Typ: 8}
  • GarageFinish - {None: 0, Unf: 1, RFn: 2, Fin: 3}
  • PavedDrive - {N: 0, P: 1, Y: 2}
  • CentralAir - {N: 0, Y: 1}
  • LandSlope - {Gtl: 1, Mod: 2, Sev: 3}

Feature Selection

For the linear regression model, an F-Test revealed six coefficients that were statistically insignificant. We opted to drop these features from the linear regression model.

Using Data to Predict Housing Prices in Ames, Iowa

Since the F-test also revealed GrLivArea as having statistical significance, outliers were filtered out for all models.

Using Data to Predict Housing Prices in Ames, Iowa

Model Assessment for Housing Prices Data

Linear Models

SalePrice distribution is right skewed as shown below. To improve model fit, SalePrice was made normally distributed with log transformation. ElasticNet with alpha 0.001 and rho 0.6 performed the best of all the linear models.

Using Data to Predict Housing Prices in Ames, Iowa

Advanced Regression Models

The two advanced regression models tested were random forest and gradient boosting regression. Both algorithms are more robust compared to linear regression. While random forest uses a bagging method to take aggregates of random samples of small subsets of data, gradient boosting converts weak learners into stronger ones and subsequently improves each tree. After obtaining the best parameters of each model using grid search cross-validation, gradient boosting yielded the best R2 score compared to all models tested.

Housing Prices Conclusion

Using Data to Predict Housing Prices in Ames, Iowa

Using Data to Predict Housing Prices in Ames, Iowa

As mentioned previously, gradient boosting yielded the best R2 score, compared to the linear regression and random forest model, with a test R2 value of 0.924. Based on the gradient boosting model, the top housing variables that are important when predicting sale price are HeatingQC and Street. In all, feature engineering played a key role in improving the accuracy score for each model. Simpler regression models, like multiple linear regression, achieved lower accuracy or R2 scores compared to gradient boosting. Therefore, the advanced gradient boosting regression model is the best model to use when determining important features that predict house prices.

Future Directions

In order to improve the model, we would like to continue exploring methods that can minimize overfitting, including cross-validation and parameter tuning. Since there are still a number of data manipulations that can be done and explored, further feature engineering is another possibility, along with improving feature selection for each model.

About Authors

Kristin Teves

Kristin has a Masters in Business Administration with a concentration in Business Analytics from California State University, Fullerton and a Bachelor's in Biological Science from University of California, Irvine. She is excited to combine previous business domain knowledge...
View all posts by Kristin Teves >

Van Vu

Data science fellow with a background in clinical pharmacy. Demonstrated commitment in reducing hospital readmissions and improving patient health outcomes. Showcases expertise with over ten years of experience in healthcare. Eager to combine clinical expertise and data science...
View all posts by Van Vu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI