Machine Learning: Housing Data

Heather Kleypas
Posted on Feb 12, 2019


  1. Introduction
  2. Exploratory Data Analysis
  3. Data Transformations / Feature Engineering
  4. Modeling
  5. Results and Future Improvements


1. Introduction

When the time comes to buy or sell your first home, there are many factors to research and consider. A home is one of the most expensive investments in a person's life so picking the right one is crucial. When selling, it'd be beneficial to know which features of the home to improve in order to maximize the resale value of the house. 

What are some factors that may impact the price of a home? Square footage, location, and a garage are some factors that initially come to mind. Although there are many other factors, the goal of this machine learning project is to analyze 79 house variables and predict the price using a dataset of over 1,450 houses sold in Ames, Iowa.

The plan is to explore the data, drop multicollinear and insignificant features, correct missing values,  create new features, and then apply different machine learning models to the cleaned dataset.

2. Exploratory Data Analysis

The dataset is sourced from a Kaggle competition and comes in two files as a training and testing dataset. Starting off, I studied the correlations between the variables and Sale Price using a heatmap. Some pairs are correlated naturally like Garage Cars and Garage Area while others are correlated by deduction such as TotalBsmtSf and 1stFlrSF.  In the next section, I transform the highest correlated variables and drop some of the lowest correlated variables to Sale Price.

Next, I want to see if any features have skewness using pairplot. Features like SalePrice, TotalBsmtSF, YearBuilt, and TotalRmsAbvGrd are clearly skewed and will be transformed in the section. 

3. Data Transformations / Feature Engineering

  • Log Transformations

The first order of business is to transform the target variable Sale Price by calculating the log of it and then moving it out of the training data. 

  • Missing Data

Next, I split the training set into numerical and categorical to look for any patterns in the missingness. For the categorical variables, I filled the Nan's with "No" and then transformed them into numerical variables by encoding them. Next, I combined existing features and dropped insignificant ones. For the remaining numerical variables, the median was used to fill missing values. 

  • Skewness

To figure out which variables have a nonlinear shape, I used the above pairplot to examine them. Then, a log transform of the skewed numerical features was then done to lessen the impact of outliers. 

4. Modeling

The first models used were linear models --Lasso, Ridge, and ElasticNet. 

Ridge and Lasso regression are techniques to reduce model complexity and prevent over-fitting resulting from simple linear regression. In ridge regression, a penalty term is added to shrink the coefficients and helps reduce model complexity and multicollinearity. In Ridge regression, all coefficient magnitudes are taken into account which helps with feature selection. ElasticNet is a combination of both that attempts to shrink and do a sparse selection at the same time. I optimized the model using cross-validation against different alphas. The results show it only used 44% of the features. 


Next, I chose two ensembling models, Random Forests (RF) and Gradient Boosting.

They're easy to implement and both predict by combining the outputs from individual decision trees. Boosted Trees is usually preferred over RF because it trains on new trees that compliment the ones already built. This normally gives you better accuracy, which was the case here.

We can see from the above graphs that neighborhood features ranked much higher in the linear models than with non-linear due to their different learning algorithms. Gradient Boosting feature importances are also more evenly dispersed than in the linear models.

5. Results and Future Improvements

For the competition, submissions are evaluated on root-mean-squared-error (RMSE) between the logarithm of the predicted values and the logarithm of the observed sales price that was initially missing from the initial test set. Overall, the Random Forest model gave me the lowest training RMSE of 0.05431. However, it's Kaggle score was the worst! I'm sensing some overfitting happening here, which I will improve upon with further hyperparameter tuning. 

The ElasticNet model gave me the best Kaggle score of 0.12135. The training wasn't too far off but surely a bit more feature engineering will improve it.

For all preprocessing and modeling, the code is available on GitHub. 



About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp