Using Data Science to Predict Ames House Prices

Posted on Feb 10, 2022

Thanks to the advances of technology, people can explore almost everywhere online. Based on data collected, when it comes to real estate, while many people still prefer seeing a property in person, they often start with a virtual visit. Pictures alone don't provide all the information that buyers need to know like the actual measurements of the rooms. Other factors, such as the house built year, size of the lot, bathroom quantity, and basement quality, are also carefully considered by house buyers.

The purpose of this project was to study the data of the house features and find out what would affect the price of a house. This house price dataset from Kaggle contains 79 explanatory variables for houses in Ames, Iowa, and one target variable as the house price with 2930 observations.

For a more specific study for the dataset, the 79 variables can be broken down as 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. The categorical variables range from 2 to 28 classes with STREET variable, which has gravel and paved only, and NEIGHBORHOOD variable, which includes all areas in Ames. The discrete variables represent the number of rooms in the house. The continuous variables contain square feet of each area of a house which could be tricky because some of them indicate the same measurement in different terms.

Prior to finding a model, the target variable, sale price, had to be investigated and then feature engineering. I also had to do feature selection to remove outlier observations and duplicated variables as well as to create dummified values from categorical variables.

First of all, I made a histogram of sale price. It was right-skewed, so, the logarithm transformation was applied as Figure1. The correlation between dummified features from each categorical variable and the sale price was calculated. I kept the features that have a moderate correlation. Then, the correlation within a variable to the rest was researched to see the multicollinearity and to remove it if there is any. Figure2 shows the correlation with all variables after dummification.

figure1 : histograms of sale price, original and log transformed

figure2 : correlation heat map for sale price with all variables

After taking care of given categorical variables, all discrete variables were also checked to know whether the variable is close to categorical type, which requires a dummify process, or to continuous one.

For example, house overall quality was treated as a continuous while full bathroom as a categorical. Moreover, since some of the continuous variables included "none" in the values, they could cause the model to become inaccurate without appropriate processing. For example, Figure3 shows the garage year built and it would register as if the garage was built in 0(zero) instead of none-meaning no garage. Therefore, I made the kind of variables in the categorical as well such as none, older, and newer.

figure3: garage year built. This will cause inaccurate regression model

After all, I chose 46 features which have a correlation score higher than 0.2 to the sale price.

In this project, I used Linear, Lasso, Ridge, ElasticNet, SGDRegressor, RandomForestRegressor, SVR, KernelRidge, and XGBoost to find out the best model. Grid search was used to obtain the best estimator for each model and the result of R2 score and RMSE(Root Mean Square Error) is shown in the table below.

Model RMSE R2
LinearRegression 0.158 0.835
Lasso 0.158 0.834
Ridge 0.157 0.837
ElasticNet 0.157 0.836
SVR 0.198 0.742
KernelRidge 0.158 0.834
SGDRegressor 2.3e^12 -4.968
RandomForestRegressor 0.155 0.845
XGBoost 0.150 0.856

As a result, I can see that SGDRegressor didn't work well because R2 score has a negative value and an unreasonably high RMSE value. Other models have a decent R2 and low RMSE. Since XGBoost had the highest R2 and the lowest RMSE, I chose this model for house price prediction with parameters as figure 4.

figure4 : XGBoost parameter values

To sum up, if you are a seller in Ames city, you may want to focus on quality for house items such as kitchen and basement to increase the price. Other features are important as well, but it's difficult to change things like the total square feet of living area or full bathroom quantity unless you undertake major renovations.

One possible way to improve the prediction result would be trying different features.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author

Jungu Kang

Passionate to challenge problems with certification as a Data Scientist and with experience in engineering background and project management in the food industry. Detail-oriented, eclectic, industrious, easy-going.
View all posts by Jungu Kang >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI