Posted on Mar 4, 2018


A lot of people hear about the subject of real estate and generally have an intuition on what factors affect a house's sale price. Some are based on feelings whereas others are grounded in learned domain knowledge. In short, there are so many factors to consider when thinking about the end goal of predicting a house's final price. Through this project, we wanted to take a peek into the plethora of features that can influence the sale price of a home by employing machine learning models to our given dataset.


Our dataset is a very famous one that was procured from a Kaggle competition. It is a residential housing dataset for the city of Ames, Iowa from the years of 2006 to 2010. This dataset was composed of 79 features and 1460 observations and had a column of sales prices that we used to train our models to in order to predict sales prices for test and validation datasets.


Our dataset is a very famous one that was procured from a Kaggle competition. It is a residential housing dataset for the city of Ames, Iowa from the ye

Figure 1. Wireframe of project


I will be talking about the pre-processing step regarding imputating null values.

At first, we focused on the null values in our training data, but the test data also contained null values that should be filled.

Therefore, we merged the Training Data and Test Data.

There was a total of 34 columns that contained null values.

For each column, we considered how to fill the null value through intuition and looking at the data description.

We divided all features into four parts and assigned them. With this, we each performed simple EDA, imputed the null values, and finally combined the fragmented features.  

Then, we analyzed the relationship between columns by separating the categorical and numerical columns of each group of features.

If the column had no relationship and null values themselves meant "NO," if categorical, the column was filled with "NA" or "None" and if numerical, the column was filled with 0.

For example, the PoolQC column is related to the Pool Area.

And since pool quality assumed a null value only when pool area had a 0 value, the NA value corresponding to "no pool" was filled. 

Other methods of imputing null values are examined in the next page..

These six categorical columns cannot contain null values due to data characteristics. 

Also, when we checked the distribution of data excluding null values, we could see that certain feature values had much higher counts. (??? ) -> 데이터가 특정값에 몰려 있다. 

Because the mode takes into account the most frequent value of a feature, the mode was suitable for and used to impute the values of these 6 columns.


Next, for these two columns, we checked the value counts by analyzing their correlation to other columns .

They could not be imputed with a simple "0". 


For lotfrontage, we considered two different methods of imputation:

The first case was by relating lotfrontage to lotarea. 

We calculated the ratio of the mean value of lotfrontage and lotarea excluding null values and filled the null values of lotfrontage by multiplying the value of the respective observation’s lotarea and the lot_ratio.


The second case was relating the neighborhoods to lotfrontage. 

Because houses are typically similar in neighborhoods, we grouped the data by neighborhood and found the median value of lotfrontage.

Using this median value, we imputed the null values based on an observation’s respective neighborhood.

Because we didn't know which method would be better for our output,

At the time of model running, both cases were used to determine which outputted better results, which ended up being case 2.


The garage year built feature was imputed by relating it to the year the house was built. 

After calculating the difference of year built and garage year built of each observation, we rounded this value so that we could maintain integer values for year.

Then the garage year built was imputed by adding the aveDiff to the year built value.

Exploratory Data Analysis


Feature Engineering

When starting this project, the general thought was that the more correctly one could tune a model, the more accurate the final price would end up being. However, after repeated trial and error, we realized that the utmost pivotal aspect of this project was the feature engineering. 

Model Fitting

This is how we applied the machine learning model to the data set after we finished feature engineering.

The first model we used was a stacked ensemble model.

Stacked regression uses the results of several submodels as an input to the meta regressor to prevent overfitting and reduce bias.

As many of you may be aware, by including the random forest regressor, the accuracy of the whole model is reduced. So we decided to exclude it from the final model.

We chose SVR as a meta-regressor to make the most stable score.

And we ensembled this stacked regression with other models without using it immediately to produce a final output.

This was because we wanted to distribute weights to other individual models.

This basic structure was not built at once, but through multiple trials and error.

Model Tuning

Then, we tuned our parameters

First, we re-selected the features. This is not an original feature but a modified feature through the process of feature engineering. Not all features created or changed through feature engineering produced good results. So we found the best combination of features by including or excluding features one by one, choosing ones that led to higher scores.

Second, we found the best hyper parameters for each models. Because it takes too long to find all of the best parameters at once, we found the best parameters by rotating them one by one from the elastic-net to lightgbm models.

And finally we adjusted the weight among each of the models. Stacked regression had a good result by itself, but we tried to give more weight to individual models as well. So we simply multiplied the results of the model by a respective weight to produce a new result.

Thus, we performed the process of estimating the score and tuning the hyperparameters repeatedly to get a better final score.


This is our score on Kaggle. This is not so good, but also not bad for our first kaggle challenge.

Here are some important things we learned from this project:

Feature engineering is more important than model tuning. Model tuning is necessary to achieve very high scores, but it must be preceded by good feature engineering.

Feature engineering should be done using accurate analysis such as Z-score/VIF/chi^2, not merely guesswork.

Not all logical predictions bring good results, but start with those intuitions

Some complex models like XGBoost or GBM bring good results even with a small effort. But sometimes, simple models like ridge or lasso are better depending on your data.

Some models are sensitive to feature engineering. For example, support vector regression can make an incorrect prediction if standardization is wrong.


For further improvements, there are the basic approaches of applying other combinations of models or hyperparameters

Also, we could look into post-processing techniques of higher scoring submissions

We hope to try out more competitions and hone these skills

About Authors

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp