Predicting House Prices with Machine Learning Methods

Posted on Aug 21, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

LinkedInGitHubEmail | Data | Web App

Background

The goal of this project was to compete in an introductory Kaggle competition to predict the price of houses using the Ames Housing data set. This data set consists of just about every feature you can think of to describe homes sold in Ames, Iowa. It has a total of 79 predictive variables with a mix of categorical and numerical features. Many features are highly related, with 9 variables alone describing different aspects of the basement.

The competition evaluates submissions based on RMSLE (Root Mean Squared Log Error), which uses the log of predicted and actual target values so that errors in predicting cheap or expensive houses will impact the score equally. Building good predictive models requires advanced machine learning regression algorithms and gathering an excellent understanding of how features relate to sales price and impact the models.

Iterative Process

This project generally followed the below order of exploring the data to gain an understanding of how it looks, conduct some initial data processing based on intuition on how I can improve the models, build initial models and score them. After seeing how the models perform, I went back around to the data processing step to see how I could improve it and then rebuilt models and see how they improved. Finally I made ensembles of the final models using two different methods (averaging and stacking) and submitted the final scores for evaluation.

Data Exploration

Goals:

  • Look at relationships between features and the target variable
  • Find variables with a strong correlation with the target variable
  • Explore multi-collinearity between predictor variables
  • Opportunities for feature engineering or feature selection
  • Explore the shape of the data

The data exploration stage is an opportunity to get an idea of the relationships between features and the target variable, to find strong correlations with the target variable. Additionally evaluating possible feature engineering options or features that can be dropped is also beneficial. This stage also yields information for the data processing stage about how the data needs to be processed before building models.

Above is a correlation matrix of the top 5 positively correlated features. From it, we can see that multi-collinearity is already evident in this subset of the features. In general, many of the predictors in the data set exhibit some relationship. The features that refer to different aspects of the same house amenities have strong relationships.

A handful of variables had a significant skew to the right, with a long tail for higher values, as can be seen in the above histogram of sale prices and the below histogram of Lot Area. In the final models, the log of variables like these were used so that high values would hopefully cause less of a bias in the models

Categorical variables were tougher to explore in mass, but made up about half the data set. The above graphs show two categorical features with the strongest relationship with sale price (as determined by later feature importance exploration in the intermediary models).

Outliers are another source of potential bias in the final models. While processing my data I decided to remove very large houses (‘Ground Living Area’ > 4,000) or houses that sold for very high prices (‘SalesPrice’ > 700,000) to reduce bias. This only removed 4 observations from the data set.

Data Processing

This phase of the project involved the following major steps:

  • Categorical Variables: Dummify / Encode
  • Handle missingness
  • Feature engineering

Once the models were built and evaluated, I went back to the data processing stage. The following steps were decided to be beneficial to improving the model:

  • Feature Selection for important variables
  • Removing outliers
  • Transform skewed variables

Models

The models built to predict housing price included MLR, Penalized Regression, Random Forest, Gradient Boosting, and XGBoost. Overall, XGBoost performed the best, even compared to ensemble models (just barely). Below I compare the results from the different models.

Linear Regression Models

  • Checking the assumptions of MLR confirmed the data did fit a linear regression task
  • Required dummification: 79 variable turns into 250+
  • Penalization reduced the effects of multi-collinearity on the model
  • Best performance from ridge regression

Tree Based Models

  • No dummification required
  • Feature Importance
  • Built custom scoring method to use RMSLE as the gridsearch/cv tuning metric
  • Marginal improvement in CV results, but vastly improved how robust the models were when submitting on Kaggle

Submission Results

The Average Ensemble model had the best performance on my test set, but it was narrowly outperformed by the XGBoost model on the Kaggle leaderboard. Averaging can help reduce overfitting of any one model. Stacking Ensemble in theory builds on the different strengths of different models. Its poorer performance suggests to me that the base models may use similar information to predict values.

The best Kaggle score for the XGBoost model scored in the top 25% percent on Kaggle (placed #1225).

The below graph shows the feature importances in the XGBoost model. The ‘Overall Quality’ feature was the single most important predictor in every model. Other important variables include other quality-related features and size-related features.Interestingly, most of these variables also showed up with the highest correlation with sale price in the initial data exploration.

Conclusion

  • There were diminishing returns on subsequent models, especially on tuning a single model for better performance
  • All models performed better on the unseen/set aside test set than on predicting the Kaggle test data set
  • Improvements may include more experimenting with how changes in data processing influences model performance.
  • Additionally, there were more types of models or variations for model ensembling I was interested in applying to this data set that realistic constraints didn’t allow for.

About Author

Brenna Botzheim

Brenna Botzheim is an associate EOV Analyst at StormGeo. Brenna holds a Bachelors degree from San Francisco State University where she studied sociology and mathematics. In her spare time, Brenna continues to develop her skills in statistical data...
View all posts by Brenna Botzheim >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI