Home Sales The Smart Choice: Predicting Housing Prices

Posted on Apr 2, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


The scope of this project was to predict housing prices for houses sold in Ames, IA between 2006 and 2010. The data consisted of approximately 80 housing characteristics and 1460 observations. This was a collaborative project with two additional teammates.

Together, we decided to utilize linear regression to maintain model interpretability. Our objective was to provide data-informed insights to realtors and prospective home buyers and home sellers in Ames, IA. Beyond predicting housing prices our team analyzed important features that impact housing prices. This can give our target audience useful information when trying to buy or sell homes in the area. 

Throughout this post I will discuss a brief background about Ames, our exploratory analysis, feature selection process, and model results. Finally, I will conclude with ideas for future work on this project.


Home Background

In 2010, according to the United States Census Bureau (2010), Ames, IA had a population of approximately 59,000 individuals. The median population age was 23.8 and about 29% of people were aged 20 to 24. Most people identified as white (84.5%), followed by Asian (8.8%), Black or African American (3.4%), or other (3.3%). Additionally, there were about 23,000 households with an average household size of 2.25 and about 24,000 total housing units.

Ames’ motto is the Smart Choice (City of Ames, 2020). There are several factors that make Ames the smart choice when choosing to live somewhere. According to CNN Money (2008 & 2010) Ames had low unemployment, low personal and property crimes, low commute times, and great air quality. These, among other factors, led the magazine to rank Ames as one of the top 100 best small cities to live in for its 2008 and 2010 publications. The largest employer in the city is Iowa State University which employed approximately 20.4% of total employment. This could be the reason for 29% of people being aged 20 to 24, indicating that people are attending the university for their education. 


Home Analysis

Going into the project, in order to provide easily interpretable results to our clients, linear models were chosen to predict housing prices and for feature importance. Consequently, one major assumption of linear models assumes that the target variable is normally distributed. However, when examining the distribution of home sale price there is a clear right skew. In order to correct this, the log of sale price was taken. The log price helps to control the fanning seen in sale prices beyond the average price. What this means is that model predictive accuracy will be increased by utilizing the log sale price.


Due to a large number of housing features, an exhaustive discussion of all features will not be provided here. Instead, I will focus on a select few features for the exploratory analysis, feature selection, and feature engineering section. Another important assumption with linear models is to reduce multicollinearity. With a large number of features there is a chance that multiple features are correlated which will negatively impact model performance.

There were three variables that encompassed square footage for the home which were first floor square feet, second floor square feet, and basement square feet. The housing square footage variables were one example of multicollinearity in the data set. Consequently, to satisfy this assumption of linear models the three home square footage variables were combined into a single square footage variable: total square feet.

When examining the scatterplot of the total square feet compared to log sale price a clear linear trend is observed, however, there appears to be a few outliers with high square footage but low sale price. Linear regression is sensitive to outliers because outliers are not captured in the general trend which the regression tries to capture.


Examining Outliers in Home Data

Outliers were examined using Cook’s distance in our features and determinations on removing outliers were made as a group. It is important to note that with a larger data set outliers could be excluded from model training without sacrificing valuable information from those observations. However, with a smaller number of observations this has to be considered more carefully because you could cause your model to overfit by getting rid of too many observations.

Additional feature selection and feature engineering relied on examination of variable correlations, forward and backward selection examined by AIC or BIC, and lasso regression. However, to examine feature selection importance using penalization, variables had to turned into a format the model could accept. This included making some variables binary, combining other features to reduce multicollinearity, reducing outliers in accordance to Cook’s distance, or utilizing one hot encoding methods.

Some notable important features resulting from the lasso penalization included:

  • overall quality,
  • total square feet, and
  • total baths

Ultimately, this helped reduce our feature space to 21 features instead of 80 (79 if sale price is excluded). The reduction in feature space helps to ensure the model does not overfit to the training data. Overfitting can be a major problem when testing the model on unseen data. The result of overfitting a model is that it performs poorly on testing and unseen data.


Home Model Results

After preprocessing the data it was time to fit the model to the training data. As mentioned previously, linear regression was chosen to maintain ease of model interpretability. Scikit-learn’s linear regression was used to fit and test the model. We split the training data into train and test sets using 80% of the data for training and 20% of the data for the test set. Initial results when fitting the model to the split training data, a test R squared of 89.4% was observed. In other words, the model accounted for 89% of the target sale price.

Furthermore, when utilizing the statsmodels regression package there were a few variables that did not support inclusion in the final model. The three variables that were removed were bedrooms above grade, home age, and lot frontage. Reasons to exclude these three variables were based on their t-test results when compared to the sale price with a p-value greater than 5%. The final model was calibrated after dropping these three features. The final training and test score were both about 89% indicating the model is not overfit.

Finally, to confirm the assumptions of our model there are a few visualizations provided below. A quantile-quantile (Q-Q) plot was utilized to confirm that residuals were normally distributed. Prediction errors were analyzed to confirm that residuals were distributed evenly and that linearity existed between the model and the sale price. Finally, there does not appear to be any discernible pattern in the residuals to indicate that errors are not independent.




Additionally, the variance inflation factor (VIF) was examined to confirm that multicollinearity did not exist between the remaining features. It is important to note that generally a VIF of five or less is accepted as no multicollinearity. All features had a VIF of less than five.  

Finally, important features indicated by the model that have positive impact on sale price were house square footage, overall home quality, the number of high quality features in the home, and bathrooms. High quality features were influenced by kitchen and basement finish quality. Features that impacted sale price negatively were lack of air conditioning and home age. These are features that home owners and buyers would want to consider when negotiating sale price and home value.



To summarize, Ames, the smart choice city, is a great place to live due to several social and economic factors which helped obtain a top 100 best places to live designation. In an effort to provide data-informed insights to realtors, home buyers, and home sellers, housing sales were analyzed between 2006 and 2010. Linear regression was chosen to maintain model interpretability for the target audience. The model suggested important features that impact sale price included home square footage, home quality, and bathrooms to name a few.


Future Work on Home Prices

Future work for this project includes additional feature selection and feature engineering. One thing to consider in the neighborhood could be the median income. This data would have to be accounted for across different years and merged with the existing data. Non-linear models could also be explored to see if they can provide greater insight into important features or greater performance. The trade-off with other models is increased complexity which would have to be considered when discussing model performance with the target audience.


City of Ames. (2010). Comprehensive annual financial report: For year ending June 30, 2010. Retrieved from https://www.cityofames.org/Home/ShowDocument?id=11004

City of Ames. (2020). About us. Retrieved from https://www.cityofames.org/about-ames/about-ames

CNN Money. (2008). Best places to live: Money’s list of America’s best small cities. Retrieved from https://money.cnn.com/magazines/moneymag/bplive/2008/states/IA.html. 

CNN Money. (2010). Best places to live: Money’s list of America’s best small cities. Retrieved from https://money.cnn.com/magazines/moneymag/bplive/2010/snapshots/PL1901855.html. 

United States Census Bureau. (2010). American fact finder. Retrieved from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=C

About Author

Tyler Kotnour

Tyler Kotnour graduated with his Master of Public Administration degree in 2018. Upon graduation, Mr. Kotnour worked in consulting conducting research and program evaluation. His primary role involved analyzing public health data for governments and non-profits. A major...
View all posts by Tyler Kotnour >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI