Using Data to Predict House Prices in Ames, Iowa

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


One might wonder what drives the price of a house? Is it the neighborhood? The size of the house? The amenities? Or something else? We tried to find answers using advanced machine learning techniques. We used the Ames Housing data set from Kaggle to predict house prices in Iowa.

In this blog, we outline our approach to exploratory Data Analysis (EDA), data cleaning, feature engineering and machine learning modeling that enabled us to obtain the top Kaggle score (out of 12 competing groups at NYC Data Science Academy boot camp).

The Data set and Competition  

The Ames Housing dataset was compiled by Dean De Cock and is commonly used in data science education, it has 1460 observations with 79 explanatory variables in train dataset describing (almost) every aspect of residential homes in Ames, Iowa. The test data comprises of 1459 observations with 79 explanatory variables. This dataset is part of an ongoing Kaggle housing price prediction competition that challenges you to predict the final price of each home.

The goal of our project was to utilize supervised machine learning techniques to predict the housing prices for each home in the dataset. It was clear that numerous predictors and a heterogeneous dataset, accurately predicting a response variable was going to be a non-trivial task. Our steps towards creating a highly accurate model were as follows:

  1. Exploratory Data Analysis (EDA)
  2. Data cleaning
    1. Missingness imputation
    2. Outlier removal
    3. Dummification
  3. Feature engineering
    1. Add new features
    2. Scaling
  4. Pre Modeling
  5. Cross-Validation ( Hyperparameter tuning)
  6. Modeling

Exploratory Data Analysis (EDA)


We started by exploring and understanding the dataset. We divided our variables into categories: Continuous, Nominal Categorical, Ordinal Categorical, Target variables.

Using Data to Predict House Prices in Ames, Iowa

Target Variable:

Sale price is the value we are looking to predict in our project, so it made sense to examine this variable first. The sale price exhibited a right-skewed distribution that was corrected by taking the log. Once the log was taken, we were no longer violating the normality assumption for regressions.

Sale Price before log transformation:

Using Data to Predict House Prices in Ames, Iowa

The distribution of price is right-skewed. Log transformation techniques will be used to make the distribution more normal.

Using Data to Predict House Prices in Ames, Iowa

The Quantile - Quantile plot also shows that the price value is not normally distributed.

Sale Price after log transformation:

Using Data to Predict House Prices in Ames, Iowa

After the log transformation, the price distribution looks much more linearly distributed.

Using Data to Predict House Prices in Ames, Iowa

After the log transformation, the Quantile- Quantile plot, shows a much more linear price variable.

This transformation can help improve the linear model’s performance.

Missingness and Imputation:

Next, we decided to look at missing values by feature in the train and test dataset. As you can see below, there was significant missingness by feature across both these datasets. Most of the missing data corresponded to the absence of a feature. For example, the Garage features, mentioned in the below table, showed up as "NA" if the house did not have a garage. These were imputed as 0 or "None" depending on the feature type.

We identified missingness type (MAR: Missing at random, MNAR: Missing not at random, MCAR: Missing completely at random) to decide upon the imputed value. We wanted to test multiple imputation methods on our model, so we designed two imputation methods for a few features. Below is a breakdown of how we handled imputation across all the features.

Using Data to Predict House Prices in Ames, Iowa

Correlation Levels:

Another graphical view of our analysis is the following correlation plot that indicates levels of correlation amongst continuous variables, and between continuous features and the response variable (SalePrice).

Using Data to Predict House Prices in Ames, Iowa

It definitely aided in the exploration of the data. We found that the Sale Price is strongly correlated with these continuous variables, so we focused on finding outliers from these predictors.

Predictor: Correlation with Price

GrLivArea : 0.708624

GarageCars : 0.640409

GarageArea: 0.623431

TotalBsmtSF : 0.613581

1stFlrSF : 0.605852

TotRmsAbvGrd : 0.533723

FullBath : 0.560664


Categorical variable visualization:

There are more than 40 categorical variables in this data set. We will display some plots to show how these variables affect the price. We imputed missing data for the LotFrontage variable with the mean value of its neighborhood because most of the time, houses in the same neighborhood are of a  similar structure and size.

Below is the boxplot of the neighborhood against house prices.

Using Data to Predict House Prices in Ames, Iowa

Since the neighborhood is a nominal variable, we do not expect to see a pattern in this boxplot, but it shows that different neighborhoods have different median values and price distributions.

Central air conditioning is an amenity that can increase the price of the house. Below boxplot shows a house with a central air conditioner is generally more expensive than one without.

Using Data to Predict House Prices in Ames, Iowa


We would like to see if there is multicollinearity between each numeric columns. The plot below shows there are some columns with extremely high multicollinearity.

Using Data to Predict House Prices in Ames, Iowa
Highly multicollinear columns are:

  1. BsmtFinSF1
  2. BsmtFinSF2
  3. BsmtUnfSF
  4. TotalBsmtSF
  5. 1stFlrSF
  6. 2ndFlrSF
  7. LowQualFinSF
  8. GrLivArea
  9. BsmtFullBath


We wrote a function to detect outliers as shown below:

Using Data to Predict House Prices in Ames, Iowa

We first used the Interquartile range to identify a row that contains outliers. Firstly, calculate the interquartile range by using the third quartile subtract first quantile multiply 1.5 to get bound, then we can identify upper/lower bound as code shown below. After identifying outliers, we set N threshold that returns the index of the row that contains N outliers we are given.

Following were the identified outliers in ‘GrLivArea’ that were removed.

                  Before Removing:                                      After Removing:

Using Data to Predict House Prices in Ames, Iowa Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

Transformation of Predictors:

We then checked for skewness in predictor variables with the idea of applying logarithm / Box cox transformation for highly skewed variables. Following plot helped us in visualizing and identifying them (We transformed the ones tagged with the star):

Using Data to Predict House Prices in Ames, Iowa


We applied logarithm transformation, which helped in fixing the skewness.

Before Transformation:

Using Data to Predict House Prices in Ames, Iowa

After Transformation:

Using Data to Predict House Prices in Ames, Iowa


Feature Engineering

We added new variables to provide enriched information to our model. 

  1. Added boolean columns that indicate if the house has a basement, garage, 2nd floor,.etc.
  2. Added total area of house, total full bath number, total half bath, total bathroom above ground
  3. Added ratio between columns:
    1. Total bathroom number / total house area
    2. Total bedroom number / total house area
    3. Kitchen number / total house area
    4. Total room number / total house area
    5. Garage cars / total house area
    6. Bedroom number / total bathroom number
    7. Total room number / total bathroom number
    8. Kitchen number/total bedroom number
    9. Garage cars / total room number

Using Data to Predict House Prices in Ames, Iowa

      4. Log transformed skewed columns:

    1. ‘GrLivArea’, ‘1stFlrSF’, ‘LotArea’, ‘LotFrontage’, ‘GarageArea’, ‘BsmtUnfSF’

      5. Added following new features:

    1. Yearsale = YrSold - YearBuilt
    2. Yearredmod = YearRemodAdd - YearBuilt
    3. Remod = boolean column depends whether 'Yearremod' is 0

      6. Scaled data using Scaler:

      1. MinMaxScaler
      2. StandardScaler
      3. RobustScaler - Less sensitive to Outliers


We applied 6 different models on our data sets as follows:

  • Ridge model
  • Lasso model
  • Elastic Net model
  • Basic Tree model
  • Random Forest model
  • XGBoost model

All models used grid search cross-validation function to find the optimized lambda.

In the pre-modeling phase, the train data set have been further divided into training and testing data sets, therefore, we are able to use the price label to calculate the RMSE and R^2 score.

Using Data to Predict House Prices in Ames, Iowa

This result shows how different imputation methods and feature engineering can affect the outcome of the prediction. There are several ways of imputing and feature engineering the data set. We tried to compute imputation methods against feature engineering with 3 different linear models. The best result in pre modeling phase is 0.1201 by using Ridge Model with imputation method 2 and feature engineering 1,2,3.

In the final result, we found that adding the rest of 4, 5, 6 feature engineering and change to imputation method 1 improves the result. The record will be shown in the final result section.

We also tried to reduced model by selecting best chi-square score variables.

Using Data to Predict House Prices in Ames, Iowa

Compared with the full model, the R^2 is decreased from 0.935592 to 0.848292.

Coefficients of Ridge and Lasso model:

After using a cross-validation grid search, the best lambda for Ridge model is 22.222 and Lasso model is 1e-10.

Coefficient of Ridge model as hyperparameter lambda increases.

Using Data to Predict House Prices in Ames, Iowa

Notice that none of the coefficients become 0, as they are going infinitely close to 0.

Coefficient of Lasso model as hyperparameter lambda increases.

Using Data to Predict House Prices in Ames, Iowa

In Lasso model, many coefficients became 0 within a lambda value range of 0 to 1e-5. What’s left is coefficients with important features. The pink line and brown lines are two of the important features.

Model complexity analysis:

Ridge Model:

Using Data to Predict House Prices in Ames, Iowa

As expected, the RMSE decreases with the increased number of features.

Lasso Model:

Using Data to Predict House Prices in Ames, Iowa

When compared with Ridge model, Lasso model’s train and test RMSE difference are smaller than Ridge model. This means Ridge model is more likely to be overfitted. Also, the RMSE of the Lasso model decreases much faster than the Ridge model. Theoretically, the Lasso model will perform better than the Ridge model. However, in our project, the Ridge model had a lower RMSE than the Lasso model.

Final Results: Kaggle Submission

Using Data to Predict House Prices in Ames, Iowa

In the final submission, we submitted six different model predictions to Kaggle. Compared with the RMSEs calculated from our pre modeling phase, Kaggle RMSEs are much higher, which indicates that our models are overfitting.

The final RMSE is visualized in the following graph (the lower, the better):

Using Data to Predict House Prices in Ames, Iowa

The Ridge model has the lowest RMSE as 0.11694. In contrast, the Tree model has the worst RMSE as 0.19273. XGBoost is the second-lowest. Lasso and Elastic Net models have similar results in the range of 0.137. From this resulting graph, we can tell that this data set is closer to linearly distributed.


This was definitely a rewarding project. Our participation in this competition exposed us to the challenges of machine learning projects and the mindset needed to approach data science problems.

For data cleaning and imputation, the most important thing was to identify the categorical variables and numeric variables. The variable like MS SubClass is a numerical data type, but it actually is a categorical variable. For feature engineering, the data normality for both features and target variables is important to prediction accuracy. For the modeling part, to solve the regression problem, Linear models tend to outperform tree-based in terms of speed and score. Lasso helped to the feature selection because it shrinks a relatively unimportant coefficient to zero.

Another consideration that would actually expand the scope of the problem and its solution is to include and analyze external data involving local policy changes and economic trends in the housing market specific to Ames, Iowa. Perhaps, adding even more data such as school zoning or transportation and commercial information would produce models with more predictive power.

Additionally, as we hone our craft and expand our skills, one aspect we would have liked to explore is the use of more models and different approaches to identify the best solution for this problem. We choose to keep our methods simple and robust in order to learn and ensure our understanding, but perhaps being able to apply newer methods and models will yield better results.

In the future, we would like to apply a stacking technique to improve our model’s score.

Thank you for taking out time to learn about our work. We welcome constructive feedback.


About Authors

Priya Srivastava

Priya Srivastava is an analytical thinker with business acumen. Her first love was STEM, which she pursued in earning a bachelor’s degree in Engineering and building a career as Software Engineer and data warehousing consultant in the technology...
View all posts by Priya Srivastava >

Zhuoyi Liu

Zhuoyi is an aspiring data scientist who like the challenge of drawing on creative solutions to problems. Alongside completing Master's Degree at New York University (Expected Dec. 2019), He is also a fellow at the NYC data science...
View all posts by Zhuoyi Liu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI