Statistical Data Inference on Housing Prices

Posted on Aug 3, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background and Motivation

Home buyers and home sellers are often faced with the difficult task of purchasing or selling a house at a good value. But how can we determine the value of a house? What features of a house are most important when trying to estimate its price? What is the dollar impact on price that one particular feature has compared to another feature? Answering these question would make the lives of home buyers and home sellers a lot easier. In this project, I will try to use data to answer these questions.

The Dataset

The dataset used for this project is taken from Kaggle. The dataset was already broken up into a train and test set, each set containing 79 feature variables and 1,460 observations. Each observation in the dataset describes the sale of a house in Ames, Iowa from 2006 to 2010. The dataset contains feature variables that are numeric, ordinal categorical, and nominal categorical.ย 

Getting to Know the Data

To get a gauge of the data I was working with, I started off with doing some basic exploratory analysis. First, I wanted to see if any values in the dataset were missing. The plot below show columns in the training set that contained missing values along with the number of missing values they contained.

Statistical Data Inference on Housing Prices

While there may appear to be a large number of missing values, many of the columns have missing values that represent houses that did not have a particular feature. For example, the missing values for pool quality ("PoolQC") actually just represent houses that do not have pools. This is also the case for feature variable like fireplace, alley, fence, etc. For values such as these, I either replaced them with 0 or dropped them from the dataset. Feature variables that contained missing values not due to the situation above were imputed using a random forest model.ย  The input to the random forest model were all other independent variables in the dataset.

Scatterplots of Numeric Features Dataย 

The plots below are some of the numeric features in our dataset plotted against the target variable sale price.

Statistical Data Inference on Housing Prices

Looking at the plots above, we can see that a lot of the independent variables in our dataset appear to have positive correlation with sale price. One other thing to notice is the presence of outliers. We can definitely see some points that fall very far away from the other points.

Feature Value Counts Data

The plots below show the value counts for some of the feature variables in the dataset.

Statistical Data Inference on Housing Prices

Looking at the plots above, we can see that some of the feature variables have value counts that are very imbalanced, where one value appears a lot more frequently than the other values.ย  It may be worth it to drop such features as they will likely not provide too much information.

Distribution of the Target Variable

Below is a plot of the distribution of the target variable sale price.

Looking at the above plot, we can see that distribution of sale price is a little right-skewed.ย  Performing a log-transformation on the sale price column can help make its distribution more normal.ย  By normalizing our target variable, we can help make the residuals of our regression models more normal as well. Normal residuals are an underlying assumption of regression.ย  Meeting this assumption can help the accuracy of our regression models and the reliability of our coefficient estimates.

Data on Feature Engineering

There are additional feature variables I created by combining other variables in the dataset. One such variable I generated was the total square footage of the house. I created this variable by combining the square footage for all floors and all outside property of the house. I also combined the full-bath and half-bath variables by summing the number of full-baths by one-half the number of half-baths. In addition, I generated binary variables that were 1 if a house had a particular feature and 0 if a house did not have a particular feature. For example, I created a column called 'hasPool' where a value of 0 represented a house with no pool and a value of 1 represented a house with a pool.

Data on Feature Selection


Looking at the image below, we see pairs of variables that have the highest absolute correlations.ย  There is a high amount of multicollinearity in our data.

While multicollinearity might not be a huge issue for prediction purposes, it will definitely be an issue for statistical inference purposes. Multicollinearity will increase the variance of our coefficient estimates and make them unreliable. Thus, we want to select features in a way where multicollinearity is reduced and only the most important variables are chosen.ย ย 

Univariate Selection

I utilized Python's SelectKBest method to test for the individual effect of all feature variables. The correlation of all variables was computed against the target variable and then converted to an F score. The sorted bar chart below shows the returned scores. Higher scores indicate a variable that is more important.

From the above plot, we can see that variables related to the house's square footage appear to be most important for predicting sale price.

Multivariate Selection

In addition to the univariate selection method above, I also utilized the returned importance scores of models such as ridge regression, lasso regression, and gradient boosting regression. For ridge and lasso regression, the magnitude of the coefficients returned are used as the importance scores.ย  For gradient boosting regression, importance scores are determined based off how well a variable did in reducing variance during decision tree building. The scores returned for all three models are shown below.ย 

Ridge Regression Scores
Lasso Regression Scores
Gradient Boosting Regression Scores

The above plots generally choose the same features as being important. Variables such as total square footage, overall quality, and the year a house was built appear to be important in determining price.

Tree Based Models

Encoding Categorical Columns

In preparing my data for the tree based models, I ordinal encoded all categorical variables where values in each column were now represented by an integer.


After ordinal encoding the categorical variables, I removed outliers using Cook's distance. Cook's distance essentially measures the influence of a data point. More specifically, it tells us how much a regression model changes when a observation is deleted. A higher Cook's distance indicates that a particular point changes a regression model a lot when it is deleted. I removed all observation's with a Cook's distance greater than one. The observations that were deleted and are shown below.


The two tree based models I used were a random forest regressor and a gradient boosting regressor. Using 5-fold cross validation, the random forest regressor resulted in a mean r-squared of 0.892. The standard deviation of the r-squared across all 5 folds was 0.019. The gradient boosting regressor resulted in a mean r-squared of 0.910 and a standard deviation of 0.019.

Linear Models

Encoding Categorical Columns

In preparing my data for the linear models, I ordinal encoded all ordinal categorical columns and dummy encoded all nominal categorical columns. Dummy encoding the nominal categorical columns resulted in a new column for each category in the variable.ย  Each observation in the new columns contained a 1 if an observation took on that category and a 0 if an observation did not. I dropped one of the dummy encoded columns to reduce multicollinearity.

Scaling the Data

To get a better interpretation of the importance of the coefficient estimates, I standardized all the feature variables in the dataset where each column now had a mean of 0 and a standard deviation of 1.

Log Transforming Sale Price

As mentioned earlier, the distribution of the target variable sale price was slightly right-skewed as shown below.

When running a regression model on sale price, and then plotting the fitted values against the residuals, we obtain the results shown below.

The above plot shows a pattern in the residuals and indicates to us that they are not normally distributed. Running a regression model after we perform a log transformation of the target variable results in the residual plot below.

The plot above shows no pattern in the residuals and better meets the assumption that the residuals follow a normal distribution. Thus, I log transformed the target variable sale price.ย ย 


The same process described above for removing outliers using Cook's distance was also applied here.

Feature Selection

For all linear models, I used a subset of features that reduced multicollinearity and kept only important features.ย ย 


I tested numerous linear models. The linear model I believed to be the best in terms of prediction accuracy and reliability of coefficient estimates was a linear regression model that used the features selected by lasso regression.ย  The model had a mean r-squared of 0.914. The standard deviation of the r-squared across all 5 folds of cross validation was 0.009. The coefficient estimates are shown below.

Each coefficient estimate can be interpreted as the increase in the log sale price for a one standard deviation increase in the explanatory variable. Because this interpretation is somewhat difficult to understand, I converted these results into actual dollar impacts which are shown in the table shown below.

The values in the right columns can be interpreted as the dollar impact of a one-unit change in each explanatory variable on the average house price. For example, a one unit increase in a house's total square footage will increase the average house price by around 19 dollars.


As one might expect, houses built more recently will increase a house's selling price more than older houses. The number of fireplaces in a house and the house's heating quality also appear to be important predictors of its price. Since Iowa can get very cold in the winter, these features may be important to people there. The most important feature for determining house price is the total square footage, with a one unit increase in the total square footage increasing the average house price by around 19 dollars.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI