Predicting House Prices in a Kaggle Machine Learning Competition

Photo by Brendan Bell

I. Introduction

Our team, composed of Ansel Santos, Sal Lascano, Yicong Xu, and Moon Kang, joined a machine learning (House Prices: Advanced Regression Techniques) competition in Kaggle. Participants are competing with each other to find the most accurate model for predicting house prices using the data provided by the website. We created a model which gave us a score of 0.11599 that made us the Champions within our cohort (12th Cohort) and put our group in the top 9% on Kaggle's public leaderboard.

We used the following references for our work: Stacked Regressions Top 4 on Leaderboard by serigne for the modeling and stacking code, Comprehensive Data Exploration with Python by pmarcelino for out Exploratory Data Analysis (EDA), A Study on Regression Applied to the Ames Dataset by juliencs for our Features Engineering, and Regularized linear models by apapiu for setting up our first pipeline.

II. Exploratory Data Analysis

We did an initial analysis of the data by using Python’s Pandas and Plotly. Looking at the graphs of the train data, we saw that there are outliers that needed to be removed. Also, a look at the distribution of the SalePrice variable revealed that it is skewed and required a Log Transformation. We proceeded by dropping the outliers whose SalePrice was below 300,000 and GrLivArea above 4,000. Then the SalePrice was Log Transformed to reduce its skewness.

The correlation heat map was useful in giving us an overview of which numerical features or variables are important, and which variables are highly correlated with each other and can be combined.  The variables which have a more yellow color are the ones which have a higher correlation to the target variable while the variables which are more green are less (negatively) correlated.

III. Features Engineering

Features engineering had three parts, namely, filling the missingness, transformations, and applying the box-cox transformation to numerical variables and dummifying categorical variables.

Filling the Missingness

We counted missingness from each column to see which feature had the most missingness. We handled missingness in two ways. The first way was by using the description.txt provided by Kaggle. The description.txt contains information on what empty data points meant for some of the columns, which helped us on the imputation.

The second way we addressed missingness was by determining what kind of missingness occurred and then deciding how to impute. There are three kinds of missingness, namely, missing at random, missing not at random, and missing completely at random. Based on this classification, we decided on the imputation method to use.

Plots for Analyzing Transformations

Before moving to the transformations, we ran some plots to see relationships of the features with the SalePrice. We did scatterplots, boxplots and distribution plots to check the relationship to SalePrice. The visualizations guided us in identifying which transformations can potentially increase the accuracy of our predictions.


This section is divided into two parts, specifically, numeric variables that needed to be transformed into categories and categorical variables that needed to be transformed into numeric values.

The MSSubclass, MoSold, and YrSold features were numeric, though once they are analyzed they should be categorical. Assuming a linear model is to be used, a house with a subclass of 180 is not nine times more valuable than a house of class 20. Therefore, this variable should be categorical. The same concept can be applied to the MoSold and YrSold features because housing market prices do not go in only one direction.

The team manually did label encoding by looking for categorical features that can be simplified by converting into integers. An example is the basement condition variable whose categories were transformed as follows: No basement into 0, poor into 1, fair into 2, typical into 3, good into 4, and excellent into 5.

We created new variables simplifying our ordinal numeric variables. We did this by grouping values within a range together. For the variable describing the overall quality of the house, it was simplified by grouping 1 to 3, 4 to 6, and 7 to 10 together.

With most of the data set cleaned and transformed we noticed that you can combine variables. This was the case for overall quality and condition. Since these features are similar to each other, multiplying them together will allow our models to interpret them as one, which can increase the accuracy of our predictions.

Box-Cox Transformation and Dummifying Variables

After doing all of the transformations, the numeric variables whose distribution have high skewness were transformed using a box-cox transformation, while categorical variables that were not label encoded were dummified.

IV. Modeling

Now that the data is ready, we can start creating the model for predicting the LogSalePrice! The group tested various models but ended up with two models that were stacked. We found that the Lasso Regression and Gradient Boosting models, when stacked, made the best prediction of the target variable.

Cross-validation scores were computed to help us decide which model to use. The team ran Lasso Regression, Ridge Regression, Elastic Net, Extreme Gradient Boosting, Gradient Boosting, Light Gradient Boosting, and Random Forest. The lowest cross-validation score came from Lasso Regression, a linear model. We were not surprised upon viewing the results as the exploratory data analysis we did showed that features had a noticeable linear relationship with the target variable.

The Gradient Boosting model was selected not because it provided the best cross-validation score, but because it improved our model when it was stacked with Lasso. Gradient Boosting, being a tree-based model, complemented Lasso regression on features which did not have a clear linear relationship with the target. We believe that this is the reason why stacking it with Lasso increased prediction accuracy.

We further improved the models by tuning the parameters using the GridSearchCV and RandomizedSearchCV function in python's sklearn package. The grid search for Lasso’s alpha variable gave a value of 0.0001 and Gradient Boosting’s learning_rate and min_samples_leaf variables a value of 0.11 and 13 respectively. We adjusted the variables to 0.0005, 0.05 and 13 respectively as the ones that grid search gave overfit within the training sample and needed to be adjusted manually for the test data.


With the models chosen and their parameters set, we ran a stacking code with Lasso and Gradient Boosting as the base models and Lasso as our meta model. Stacking is a type of ensembling which improves model accuracy by combining a list of base models using a meta model. For the predictions made by the base models, since we are using Lasso as our meta model, a beta will be multiplied to each of these predictions which are calculated by running a Lasso regression.

Stacking enables the combination of models which has the ability to improve the score further. As in the modeling that we did, the cross-validation score of 0.1119 using a plain Lasso model improved to 0.1069 when we used stacking.

V. Results

The team got a score of 0.11599 when the test set predictions were uploaded to Kaggle, which is the best within our cohort and is in the top 9% in the public leaderboard.

VI. Conclusion

This exercise gave us the experience of working in a data science team environment. We realized how important it is to not allow each member's ego to get the best of our team. Also, constant communication with each other should be exercised, as this makes the team identify and resolve potential problems before they happen.

About Authors

Ansel Andro Santos

Ansel has a Master in Applied Mathematics from Ateneo de Manila University. He worked in the investments industry before deciding to become a data scientist. His interest in data science started while making the factor based investment strategy...
View all posts by Ansel Andro Santos >

Sal Lascano

Sal received his B.S. from Saint Peter's University in Jersey City, NJ with a major in Mathematics and a minor in Secondary Education. He worked for four years as an account manager and sales manager in the Interior...
View all posts by Sal Lascano >

Yicong Xu

Master student in Electrical Engineering at New York University and graduate assistant in CAN lab. Research area focused on self-driving, reinforcement learning and computer vision. One-year work experience in a Vehicle-Safety Inspection company and Two-year machine learning experience...
View all posts by Yicong Xu >

Moon Kang

Moon graduated from Baruch College Where he studied finance. He is Analytical, dedicated, and success-driven professional with solid understanding of Data Science. Equipped with superior research, presentation, communication and coding skills, coupled with exceptional creativity.
View all posts by Moon Kang >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI