Housing Prices Prediction using Machine Learning

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Project Overview


What makes a house a good buy? Which particular aspects of your house or the house that you’re looking to buy can make it more or less expensive, given all other things stay intact? When is the best season to sell, or best season to buy? How much less will your house cost if you wait another year to sell it? Is having a large basement makes a house more attractive to potential buyers? 

The Data

Data itself was obtained from the Kaggle website, particularly from the House Prices: Advanced Regression Techniques competition. The dataset contains 79 variables that describe almost every possible aspect of a residential property. Features range from the size of the property to its geographical positioning relative to city facilities, from the presence of the additional amenities to their overall quality, from the materials used to build the roof to the year when the house was last renovated

Data Preprocessing

Missingness and Imputation

Like any data collected from in the ‘real world’, some of the variables in the dataset had a fair share of missing values. However, not all missingness is created equal. Therefore each of the variables required a closer examination. 

First type of missingness, which we called “fully explained missingness”, is mainly connected to the presence or absence of certain amenities. For instance, if the house does not have a garage, it has an “NA” in the column “Garage Type”. This means that the house simply does not have a garage, and data is not actually missing. Therefore, such NA values have been substituted with “No Feature” variables.

Another type of missingness is entirely derived from the previous one. In other words, the value in the particular column was missing or for numeric values was equal to zero solely based on the values of another column. Continuing with the Garage example, if the house had no Garage, therefore has a zero in the column “Garage Area” or “Garage Year Built”. Thus, it didn’t require additional imputation.

Third type of missingness is what is called “Missingness Completely at Random”, which means that several values in some of the columns seem to be missing without any evident pattern and any connection to other features. Their missingness can best be explained by mistake during data collection. In this case, our team used ‘median’ imputation for numeric variables, and ‘majority vote’ imputation for categorical variables.


One of the most important parts of the data preprocessing is the removal of outliers. Outliers, even when their amount is insignificant, can noticeably skew the model, and, especially for linear models. That’s why their examination and subsequent removal or transformation is often crucial.

By plotting the relationship between independent variables vs. the price, we were able to observe the outliers mentioned above.

Some of the outliers were observed in variables such as the Living Room area and First Floor Area.

Another type of outliers occurred in categorical variables (both nominal and ordinal) and could be mostly explained by category having a very small number of homes in it. 

Insignificant Features

Sometimes, even after taking into account all the types of missingness and performing all the appropriate imputations, variables still contain such an enormous amount of missing values, that using them will not only not contribute to a better model, but will only make it less accurate in predicting the actual values. 

One of the examples of such features is the Pool Area and Pool Quality. Both of which are missing 99.65% of all the values. They have virtually no variance, and cannot add any information to our future model. Therefore, these kinds of variables were dropped.

Another example of insignificant features includes LowQualitySF, where 98% of values are zeros, or Heating Type, where 98% of the values belong to one category.

New Features Creation

Some of the features, while being useful overall, are not presented in the form best suitable to perform model building. Also, some of the features are too sparse, which makes them have significant missingness on their own, but when combined, they can give us a much better picture and can significantly aid the analysis.

Some of the features created from the manipulation of other features included:

Age = Year Sold - Year BuiltTotal porch = Open Porch + Enclosed Porch + Screen Porch + 3Ssn Porch

Garage Age = Year Sold - Year Built

Data Transformation

Before data can be “fed” into the model, we need to examine it for normality, and then try to transform it if it is skewed,  or not normally distributed. Some of the transformations applied to our features included log transformation, square root transformation, as well as squared log transformations.

For example, the log transformation has been applied to a newly created Total Porch variable in order to remedy its skewness of the data. 

For the purposes of analysis, the dependent variable (House Price)  has also been log-transformed to ensure greater normality and linearity.

Preliminary Assumptions

First, before starting model building, it is crucial to take a general look at the data. What are some of the assumptions that we can make about features’ influence on the housing price before we go on with creating a model? Is there any apparent linearity that can give us a hint on how this feature makes the price go up or down?

We can observe a degree of correlation between some of the variables and the final house price. Many of the variables seem to confirm our intuition. For example, the older the house, the lower on average its price appears to be. Another great linearity example is the “Overall Quality” variable, whose correlation with the price seems to follow our reasonable assumption. The higher the quality of the house, the higher the price of the house.

Standardization and Encoding

The final step before training the model is to standardize some variables and encode the others. 

Ordinal variables, such as the ones that demonstrate the quality, for example, were transformed into numeric variables, as they have a definite order in their structure. (e.g., “excellent” is clearly greater than “good”).

Numerical and ordinal variables have been standardized using the sklearn StandardScaler, which subtracts the mean and then transforms all variables to unit variance. The mean of the standardized collection of observations is equal to zero, and observations follow a standard normal distribution.

For the categorical values, the OneHotEncoder was used to encode values and prepare them for model tuning.

Machine Learning

Model Performance

For predictions, the team has built, tuned, and trained four models. The models are Ridge, Lasso, Gradient Boosting, and Ensemble model, which included all three models mentioned prior. The best Kaggle score that the team was able to achieve is 0.12163, which is the Root Mean Squared Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. The model that gave the best result is the Ridge Penalized Regression. Other models provided us with some other decent scores.

Key Findings

Arguably, an even more important part than the model accuracy is the model interpretation. In other words, by looking at variables and coefficients, how appropriately and comprehensively can we explain the importance of a particular variable, and how can we measure its importance.

Building several models, with different accuracy, allowed us to examine differences in “feature importance”, and compare which variables seem to stay important between models. While some of the models give us slightly less accurate predictions, they can tell us more about which aspects of the residential property are more important.

Model Interpretation

After the model is ready, it is time to find out what the coefficients actually mean. Since we transformed, merged, and standardized our features, it is now crucial to convert the coefficients back into the more “direct” numeric form.

The table below demonstrates some of the examples of how coefficients can be transformed into the “price percentage increase”.

Feature importance

The picture below demonstrates the importance of features according to the two most interpretable models.

We can observe that features like the Size of the Living Area, Age of the House, and Overall Quality stay very significant across the models. 

Some of the most important features have been grouped into four categories, presented in the picture below. Key variable groups are Location, Size(sq. footage), Presentability, Time & Circumstance.


Ways to add value

Finally, we want to see what are the actual, straightforward suggestions that the model can give us regarding the changes that can be made to the house, in order to increase its values or suggestions that can help us to choose the time to buy a home when prices are generally lower.

One of the variables we can take a look at is the Month of the Sale. The base month, to which we are going to compare price variations, is June. The worst month to sell a house is October, since prices in October tend to be, on average, 3% lower than the price for the exact house in June. On the other hand, the best time to sell, and hence probably the worst time to buy is May, owing to the fact that the average price of the house in May tends to be more than 2% higher than in June.

Another variable that has a direct influence on the price of the house, and is the feature that the house owner can actually control is the “Roof Material”. Changing the material with which the roof was built can directly increase the average sale price of the house. 

In our case, we can compare the price of different roof option installation, to the most common roof material, which is Composite Shingle. The average price of the installation of Composite Shingle Roof for the 1500 sq.ft size house will is - $5,500. At the same time, the installation of the Wood Shingle roof, for the house of the same size, will cost, on average, $9,000, but at the same time, it will increase the average sale price by almost $25,000. We conclude that changing the roof material is, in most cases, absolutely worth it.

Another example of a useful exterior modification is Exterior Covering material. According to the model, changing your house covering can increase the final price of your house, with other things equally considered. Most noticeably, covering your house in Brick Face, can increase the price of the house by more than $13,000. While installing the brick face costs only $6,500 more than the most common exterior covering - Vinyl Sliding.


In general, the models built, allow us to both make reasonably confident predictions of the actual prices, and understand the importance and the role of the features crucial in making the predictions. 

In many ways, the outcome has proven right many of our intuitive assumptions regarding what determines the value of the house. Features like Overall Quality, Total Living Area, and Age are expected to have the strongest influence on the price of the house. On top of the other features directly or indirectly connected to the area, such as Lot Area, Garage and Basement Area. Finally, as it is often said, “Location, Location, Location”, often plays an important role. Neighborhood in which the house is located, as well as its proximity to various amenities or disturbances, have a significant influence on the final house price.

Thank You

Thank you for taking the time to read our blog. You can follow the link to the GitHub if you’re interested in learning more about the process behind the analysis. 

The Team

The group “Tea-Mates” consists of Ting Yan, Marina Ma, Alex (Oleksii) Khomov, Lanqing Yang.

About Authors

Oleksii Khomov

Data Scientist with a strong analytical background in the fields of marketing, research and management consulting. Experienced in providing actionable insights derived from data to client organizations, including senior management. Oleksii holds a masters degree in marketing from...
View all posts by Oleksii Khomov >

Ting Yan

Ting Yan is a Data Science fellow at NYC Data Science Academy. She holds a Ph.D. in Astrophysics with a solid background in mathematics developing analytical models on diverse subjects. She has excellent programming skills with applications on...
View all posts by Ting Yan >

Lanqing Yang

Lanqing graduated from Fordham University with a bachelor's degree in New Media and Digital Design. She is currently pursuing a master's degree in the quantitative field to further strengthen her knowledge and skills. Through her problem-solving experience as...
View all posts by Lanqing Yang >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI