Kaggle Competition: House Price Prediction 2017

and
Posted on Jan 24, 2017
Contributed by Wann-Jiun Ma and Sharan Naribole. They are currently attending the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on their fourth class project - Machine Learning. We balance the workload of the project members and finish the project (excluding writing blog post) in two weeks (part-time) by the same people.

Introduction

We have learned so many different machine learning algorithms from supervised, unsupervised to reinforcement learning. Now, it's time to use them to solve a real problem. We found this new and interesting competition on Kaggle. It is not a fancy competition and its goal is to predict house prices in Ames, Iowa using different features of houses collected in 2010. There are 79 explanatory features describing every aspect of residential homes in Ames, Iowa. We found this competition friendly because the detailed explanatory features have been fully provided to the participants. And it is a great opportunity to practice our knowledge of advanced machine learning such as XGboost and other ensemble and stacking approaches. Our strategy is to stand on the shoulders of giants, i.e., utilize public feature engineering and machine learning models posted at Kaggle such as
and also blog posts from other data scientists such as
and write our own codes to further improve the prediction score. We are currently placed top 4% out of more than 3000 teams in this open Kaggle competition at the time of the machine learning project submission.

Exploratory Data Analysis

Let's first do EDA to gain some insights from our data. Let's plot the distribution of sale price (target). Figure shows that only a few houses are worth more than $500,000.

histo_priceSize of living area may be an indicator of house price. Figure shows that there are only a few houses are more than 4,000 square feet. This information may be used to filter out outliers.

histo_sizeAlso, linear distance of street connected to property may be a useful feature. We group by neighborhood and fill NA using the median of the group's linear distance. Figure shows that there are only a few outliers having a distance of less than 40 feet.

histo_streetIt may be useful to characterize the properties by the months in which they are sold. Figure shows that May to August are the hottest months in terms of number of sales.

histo_monthWe also found some numerical features which are highly correlated with the sale price and plot the correlation matrix of these features. This information is useful to determine the correlation of features. We all know that multicollinearity may make it more difficult to make inferences about the relationships between our independent and dependent variables.

correlation

These are some basic EDA for our house price data set. In the next section, we are going to perform feature engineering to prepare our train set and test set for machine learning.

Featuring Engineering

We consider numerical and categorical features separately. The numerical features of our data set do not directly lend themselves to a linear model and the features violate some of the necessary assumptions for regression such as linearity, constant variance or normality. Therefore, we perform log(x+1) transformation for our numerical features, where x is any numerical feature, to make numerical features more normal.

Also, it is a good idea to scale our numerical features.

For categorical features, we perform several transformations as summarized below.

-- Fill NA using zero.

-- Group by neighborhood for linear feet of street connected to property and fill NA using the median of each group (neighborhood).

-- Transform "Yes" and "No" features such as having central air conditioning or not to one and zero, respectively.

-- Using "map" to transform quality measurements to ordinal numerical features.

-- Perform one-hot encoding on nominal features.

-- Sharan's three strategies. [Add the strategies here]

We also generate several new features summarized below.

-- Generate several "Is..." or "Has..." features based on whether a property "is..." or "has...". For example, since most properties have standard circuit breakers, we create a column "Is_SBrkr" to characterize those properties having standard circuit breakers.

-- Generate some aggregated quality measures to simplify the existing quality features. We aggregate those features into three broad classes, bad/average/good, and encode them to values 1/2/3, respectively.

-- Generate features related to time. For example, we generate a "New_House" column by considering if the house was built and sold in the same year.

For dealing with outliers, we filter out the properties having a living area of more than 4,000 square feet above grade (ground).

There are also some minor features considered here. The total number of features is 389 and we have 1456 and 1459 samples for the training and test sets, respectively. Now, let's do machine learning.

Ensemble Methods

We consider six machine learning models: XGboost, Lasso, Ridge, Extra Trees, Random Forest, GBM.

For each models, we perform grid search with cross-validation to find the best parameters for the corresponding models. For example, for Kernel Ridge,

We found that Random Forest, GBM, and Extra Trees have serious overfitting problem.

Finally, we use an ensemble model which consists of Lasso, ridge, and XGboost with equal weights as our model.

Stacking

We consider out-of-folder stacking. At the first level, we use XGboost, Random Forest, Lasso, and GBM as our models. At the second level, we use the outputs of the models from the first level as the new features and use XGboost as our combiner to train our model. We perform cross-validation for each model to find the best set of parameters.

Feature Selection

We consider feature importance provided by XGboost to select the important features.

selectinThe figure of feature importance shows that the feature importance decreases exponentially. We should consider using the most important features to train our model.

We use a loop to see how the score varies with different number of features included in the training set and set a threshold to determine which features we want to drop from the data set.

Conclusions

In two weeks (two people, part-time), we have done EDA, feature engineering, ensembling, stacking, and feature selection. We observed that there is a huge score jump from the score without featuring engineering to the one with feature engineering. The second score jump is from the score without ensembling to the one with ensembling. Out of folder stacking didn't improve the score too much. It may be because the models are already statistically equivalent. Since the data set is very small, to improve the prediction score, We can consider different featuring engineering such as using different distributions to create different features or using feature interaction to generate new features automatically.

About Authors

Wann-Jiun Ma

Wann-Jiun Ma (PhD Electrical Engineering) is a Postdoctoral Associate at Duke University. His research is focused on mathematical modeling, algorithm design, and software/experiment implementation for large-scale systems such as wireless sensor networks and energy analytics. After having exposed...
View all posts by Wann-Jiun Ma >

Sharan Naribole

Sharan Naribole is a PhD Candidate in Electrical & Computer Engineering department at Rice University supported by Texas Instruments Fellowship. Sharan's research focuses on next-generation Wireless Networks protocol design and experimentation. Sharan has undertaken data science internship at...
View all posts by Sharan Naribole >

Related Articles

Leave a Comment

feet problems February 8, 2017
Excellent post. І wаs checkin constantly thіs blog and I am impressed! Very helpful іnformation ѕpecifically thе ast pаrt :) I care fⲟr sucһ information a lot. I was seeking this ⲣarticular info forr ɑ very ⅼong timᥱ. Thаnk yoս and Ьest oof luck.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI