House Prices: Advanced Regression Techniques (Kaggle)

Samuel Odulaja
Posted on Feb 1, 2020

Intro

In this blog, I will be discussing my procedure in the Kaggle competition Housing Prices: Advanced Regression Techniques. The goal of this competition is to predict the sale price of houses in Ames, Iowa, given 79 explanatory variables, which are describe here.

The full code and data for this article are available in my Github

Read Data

Concatenate training & test features

I concatenated features so I wouldn't have to impute missing values, transform features, etc. I also removed houses with ground living area greater than 4,500 square feet from the training sets.

SalePrice Distribution

SalePrice Distribution

Impute missing values

The plot below shows the number of missing values in columns with at least one missing value. 

Engineer features

I created new features for the dataset. The features were TotalSF, TotalPorchSF, and TotalBath. The code is shown below: 

Categorize MSSubClass and YrSold

Looking at the description below, the levels for the MSSubClass don't seem to have a natural order. So I decided to represent the MSSubClass as a categorical feature order rather than a numerical feature. Also decided to represent YrSold as a categorical feature because it allowed for a more flexible relationship with SalePrice. 

Transform features

To better highlight the recurring patterns in SalePrice, I transformed the MoSold feature using the code below:

I also transformed the highly skewed features using the code below

And I used pd.get_dummies to convert all categorical values into dummy variables. 

Removing outliers from training data

I fitted a linear model to the training data and removed examples with a studentized residual greater than 3. 

Define random search

The code below defines random search as a function, and I used random search to optimize hyperparameters for each of our models. I also used a 5-fold cross validation to score each iteration. 

Model Scores

Overall the models did well with Gradient Boosting performing the best. These are the scores for each model performance: 

  • Ridge: 0.0778
  • Lasso: 0.0796
  • SVR: 0.0712
  • LGBM: 0.0640
  • GBM: 0.0436

Creating Predictions and RSME

I stored the predictions of the based learners and stacked ensemble in a list. Then I averaged the predictions and gave a weight of 0.13 to the based learners and 0.35 to the stacked ensemble. My RSME score was 0.3848232

Conclusion

Overall the models seemed to perform well, but the RSME score seemed a little bit high, and this might be due to an error in the code. So in the future I will see what I can do to improve on the RSME. 

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp