Studying Data to Predict Housing Prices in Ames

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


     The Ames Housing data set supplies sale price information for close to 3000 homes in Ames, IA, depending on some 79 features. Such a feature-rich dataset provides excellent opportunity to apply machine learning techniques to predict the sale price of houses. As such, the main goal of this project is to explore the various predictive models and gain a better understanding of the mechanisms behind them. In the course of finding the model that gives us the most accurate results, we hope to acquire deeper insight into the different models we used, the dataset itself, and the process as a whole.

Data Exploration and Cleaning

     As a first step to finding a suitable model to relate the sale price of houses to the other variables, we explored the data to get a better sense of the different features present and how to handle them. One of the first things that we noticed was the histogram of the target variable, sale price, when plotted, is clearly not normally distributed.

Studying Data to Predict Housing Prices in Ames

Distribution of Sale Price

     The plot has a distinct rightward skew. Such a skew is not surprising, since you would expect there would be more houses with higher than mean prices than otherwise. Given this information, if we were to do a linear fit of the sale price, we should log transform the data first so that the column is actually normally distributed.

 Correlation Between Various Features and Sale Prices

   Next, we investigated the correlation between the various features and sale price. The top ten features with a correlation of 0.5 or higher with sale price were plotted and we noticed many of these features were not only highly correlated with sale price but also with each other. As we do feature selection and engineering later on, this is something we should keep in mind.

Studying Data to Predict Housing Prices in Ames

Correlation Matrix for Top Ten Features with Sale Price

     Continuing our investigation of the dataset we looked for missing values and figured out how to deal with them. The first figure is a graph showing percentage that each column is filled(not null) and the second is the actual number of missing values for each column. As we can see, there are quite a lot of missing values in several columns. In general, we classified them based on what the missing value represent and impute them accordingly.

Studying Data to Predict Housing Prices in Ames


 The majority of the missing data correspond to “No such Feature” and we impute ‘None’ or 0 for them.  With the other missing values we took a more granular approach. For the missing numerical features we grouped the data according to its corresponding Neighborhood, and then imputed the mean or median for the Neighborhood, whichever seem more appropriate. This is reasonable as we expect houses in each neighborhood should share similar features.

For the missing categorical features we grouped the data similarly but imputed the mode of the Neighborhood instead. With these steps we were able to take care of the majority of the missing values. There were a few special cases which we handled individually, such as the “Garage Year Built”, which we imputed the same year as the house built year, and kitchen quality, which we scaled according to its overall house quality instead of going by neighborhood average.

Feature Engineering

     For feature engineering we began with the simplification of features.  Because many features are related to each other, and are highly correlated as shown in the previous chart, we condensed many columns into one.  Altogether, we built five new features:

  1. Total Bathrooms: Sum of Above ground full and half baths and Basement full and half bath.
  2. House Age: The difference between Year Sold and Year Remodeled
  3. Remodeled: Binary value representing if the has remodeled year was different than house built year
  4. Is New: If Year Sold equals Year Built
  5. Neighborhood Wealth: A categorical value (1-4) of different groups of houses based on disparities in their neighborhoods median wealth.

New Engineered Features

Data Transforming / Scaling

     After noting that several of the variables such as Ground Living Area showed a mostly linear relation to sale price, we decided we were going to use linear models to fit the dataset. We were unsure whether it will be the best method, but we wanted to give it a try.  Due to this fact, we need to scale our data as well as transform our categorical values to dummy variables. For this we simply used Scikit-Learn’s “Standard Scaler” method to scale the data (subtract the mean and divide by the standard deviation) and Pandas “Get Dummies” method to one hot encode our categorical features.  As mentioned earlier, we also performed a log transformation on the target variable to normalize it.


Below is a list of methods we used:

  1. Linear Models
    1. Ridge Regression
    2. Lasso Regression
    3. ElasticNet Regression
    4. Support Vector Machine
  2. Tree based
    1. Random Forest Regression

The results are summarised in table below. A detailed discussion of each model will followed.

Results Summary

Linear Models: Regularized Linear Regressions

     The linear models we tried were regularized models, such as Ridge, Lasso, and the ElasticNet regressions. Based on the general linear trend among the target and the predictors as mentioned earlier, we expected the linear models to work fine with the data. Since the data had more than 200 features and we do not have an exact way to choose them according to their importance for predicting the house price, it would be difficult to use the general linear regression models. We decided to directly try the regularized regression models so we can select the meaningful features for the prediction, mitigate overfitting and overcome multicollinearity problems at the same time.

     Because Lasso and Ridge regressions put constraints on the size of the coefficients associated to each variable, which depend on the magnitude of each variable, standardization as we mentioned before was necessary. Second, we removed the outliers which were significantly far off from the linear relationship between the target and some of the main predictors as shown in the plots below. Although the outlier removal might have caused information loss, we saw that it did improved the performance of the models when comparing the results from before and after their removal.


 The results of the Ridge, Lasso and ElasicNet models, with the hyper parameters used are shown below. The hyper parameters, λ and the L1 ratio, were optimized by using grid searches with the LassoCV/RidgeCV/ElasticNetCV (K=10)  functions from the Scikit-Learn package. For the model evaluation, 10-folds cross validations were used for each model.

Linear Model Summary

     In the plots comparing our prediction from the Ridge/Lasso models to the original target, all the models seemed to agree pretty well. All the models got R2 around 0.92, RMSE of less than 0.12, and got the best Kaggle leaderboard scores among the other models that we ran. As we expected, the target variable, sale price (log-transformed), showed a relatively linear relationship with the predictors. The ridge model received the best Kaggle leaderboard score, but the other models show similar performances as well.


 Coefficient Plots

   We also used the Lasso method to generate the coefficient plots below showing the importance of the different variables. In the plot, the later a variable turns to zero, the more it affects the target. The two variables that most influence the model are Total Square Ft. and Overall Quality. Other important variables are the Ground Living Area, Year Built, and Overall Condition.

     Another way to view which feature importance is shown below. The top-20 variables ranked by magnitude of the coefficients from our best lasso model is plotted, showing the same variables, Total Square Ft, Overall Quality, etc. affect the house sale price the most. One should note that in general, the size of the coefficients may not be an indicator of feature importance. But since we have scaled all our variables, we can use this metric as measure of feature importance more readily.

Magnitude of Coefficients for Lasso Fit

Support Vector Machine Regression

     After trying the Ridge / Lasso based linear models, we tried the SVM based regression to see if we can use a different model to get results that are just as good or better by using parameter tuning methods such as grid search and cross-validation (CV). For experimental purposes, we first tested SVR without parameter tuning, then obtained benchmark results with parameter tuning. We recorded the RMSLE benchmark using a 5-fold CV done on the training set, Kaggle leaderboard score, and finally the computational time of each configuration. Below is a table summary:


The results we obtained through SVR showed us several key points. First, the choice of kernel in the SVR model plays a critical role in all three statistics. The linear kernel produced extremely small RMSLE even before parameter tuning, indicative of severe over-fitting. While the Gaussian kernel had relatively large RMSLE, but actually showing an improved Kaggle score over the linear kernel.

     The next trend we observed was that in general SVR training time increases by at least one order of magnitude when we used GridSearchCV as a parameter tuning framework. This suggests that in projects involving larger datasets, one is advised to first run the model without parameter tuning as a benchmark, as model performance based on different kernels correspond well with performance after parameter tuning. For example, the RBF (Gaussian) kernel achieved the best Kaggle leaderboard result both before and after tuning. Conversely, Poly and Sigmoid kernels performed poorly both before tuning and after tuning.

     The last conclusion we can draw is that the 5-fold CV benchmark on the training set for different model kernels is a good indicator for the Kaggle performance of the kernels. If a model performs well under the 5-fold CV benchmark, it is likely to perform well in the test set as well.

Random Forest

     Due to the high number of categorical features we felt the next best course of action would be to train a Random Forest model because of its inherent resiliency to non-scaled and categorical features. It would also allowed us to Even though it may not be the most time efficient process, we implemented a Grid Search Cross Validation method to tune for the best hyperparameters.  We started with a fairly coarse grid search tuning over large gaps in the parameters and ended with a very fine search to hone in on the best parameters.  

     To test the usefulness of these hyperparameters we also modeled a base random forest estimator, using just 10 trees and the rest as default settings.  With this base estimator we achieved an accuracy of 99.20% with an average error rate of 0.0954. Our tuned model achieved an accuracy of 99.26% with an average error rate of 0.0888.  With only a 0.06% increase in model accuracy, in most cases it would not have been worth it to spend the time tuning, especially for large datasets. This shows that our hyperparameter optimization process is not as efficient as it could be. 

Looking Forward / Summary

     As we completed our analysis of the dataset, we thought of ways that we can improve our model. One idea that we discussed but did not have time to implement was to perform some sort of classification before doing the modeling. We could add our own classes or groupings as variables and check feature importance to see if and how our models changed based on this new variable. These classification can even be done with unsupervised methods such as clustering to discover hidden groupings within the data and utilize them as new variables. Finally, we could have use ensemble methods to combine our models to obtain the best results.

     In conclusion, this is a basic analysis of the dataset using relatively rudimentary modeling techniques. Given the relatively simplicity of the data, despite the large number of features, it is not surprising that we obtained the best results with our linear models. With more time and now a greater understanding of what other modeling processes are out there, we feel that a much more in depth analysis and subsequent modeling process can be done.

About Authors

Arthur Yu

I have a PhD in Physics from UC Irvine, following my B.A. degree in Physics and Mathematics from NYU. My main interests in data science are understanding and improving machine learning models and algorithms. I hope to learn...
View all posts by Arthur Yu >

Will Thurston

Will is currently a student at New York City Data Science Academy. He graduated from Rochester Institute of Technology with a BS in Computer Security in 2016. He then spent the following year as a Network Engineer gaining...
View all posts by Will Thurston >

Jason Lai

I graduated from NYU with a bachelor's degree in Math.
View all posts by Jason Lai >

Leave a Comment

Hakon May 11, 2019
Could you please add a link to the website where you got the photo from:
Data science KPHB September 4, 2018
Nice article <a ref " " Data science course hyderabad

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI