A Home Pricing Model for Educated Sellers

Posted on Feb 1, 2022


Real estate marketplace pricing model have been known to be sporadic. As Sam Chandan, Dean of NYU’s Schack Institute of Real estate observed in the WSJ article “What Went Wrong With Zillow? A Real-Estate Algorithm Derailed Its Big Bet, a number of factors can affect house prices. A specific example that Chandan mentioned was the difficulty in determining the pricing of the layout of a house.

The complete aesthetic of a single-family home cannot be captured by a simple data point. In Kris Frieswick’s article on WSJ entitled: “Zillow’s Zestimate Is the Algorithm We Love to Hate. Why Can’t We Quit It?” the algorithm is summed up to be finicky and oscillating so much that people have even sued Zillow on the grounds that the company misrepresented the prices of their homes. Even though Zillow’s Zestimate pricing tool or Redfin’s pricing tool may be used by hundreds of millions of people to purchase homes all across the country, no pricing model can be 100% accurate and the oscillations are evidence of inaccuracies.


Instead of adjusting the price every day or whenever a new feature is available, the aim of this project is to set a baseline price that will not fluctuate very often or drastically. An unwavering baseline price for homes will ultimately allow sellers to be more confident about how much their home is worth. The fluctuations in house prices based on other pricing models can be seen instead as confidence intervals. Any increases in prices using other pricing models will show the potential return on investment, and decreases can gauge the current condition of the market. The aim is to eventually be able to use selective feature engineering to expand our pricing model to the rest of the country.



The dataset - containing 81 features (including ID and sale price) and 1460 rows - and the validation set - containing 80 features (including ID) and 1459 rows - were downloaded from Kaggle. In the initial analysis, two rows were deemed to be outliers (bottom right of Graph 1) and were removed. Based on the price per square footage those two homes seemed to have been sold for a much lower price than expected, so they were removed as outliers.

Graph 1: House Prices in Ames, IA


The pricing distribution for the homes that were sold from 2006 to 2010 (Graph 2) shows that, although more homes were priced around $150K, the overall average price of homes in the market was higher. In order to reduce the skew in the pricing, the logarithm of the prices was taken to create a more even distribution (Graph 3), which will allow for a more accurate prediction when using the validation set.

Graph 2: House Price Distribution
Graph 3: House Price Distribution Adjusted Using Logarithm


Some simple analysis was done to determine if any features might have an obvious impact on the prices of homes. Graph 4 shown below is an example of the simple analysis performed. Based on the trend seen in the boxplot graph, we can expect the median price of a home to be at least somewhat affected by the neighborhood in which the home is located. 

Graph 4: How Area Affects Prices


The null values were then imputed. A function (named my_fillna) was written whereby null imputations were performed by inputting the median or the mode depending on whether the features were numerical or categorical, respectively. Some imputations were performed based on the possibility that certain columns would be dependent on another column. One such example is Masonry Veneer Type (MasVnrType) and Neighborhood, which may be dependent. \

With some MasVnrType values missing, my_fillna was used to impute the missing values using the most common MasVnrType found in each neighborhood (Table 1). Veneer type pertaining to certain neighborhoods can be seen where new cookie-cutter home neighborhoods are constructed. Then features were dummified and scaled.

Table 1: Pre- and Post-Imputation Using the Function my_fillna


After null values for each feature were imputed, model training was performed using Multilinear regression, Lasso regression, ElasticNet regression, Random Forest regression, Gradient Boosting regression, and XGBoost regression to determine a baseline Root Mean Squared Error (RMSE). Graph 5 is an example of the first round of model training using ElasticNet Regression. The blue dots are the comparison of the Predicted vs the Actual Log Price using the Train set of the train-test-split. The red dots are the same using the Test set of the train-test-split. 

Graph 5: Actual vs Predicted Log Prices of Train and Test Splits Using ElasticNet Regression


After the initial model training, any features that could be ordinal were converted to ordered numeric values. Some string values were rewritten to keep consistency across features that used the same values. Then features were dummified and scaled.

Model training was performed again using the 6 previously mentioned regression models to determine if there were any significant increases in RMSE that would point towards removing certain engineered features. Performance increases were noted in the test set with Multilinear regression, Gradient Boosting regression, and XGBoost regression.


In order to perform the best feature engineering for the dataset, a peculiar method was used. Initially, 9 functions were written and each performed some kind of feature engineering to the dataset and added to a list. 9 more functions were written to remove the features that were used in each of the previous 9 functions (if such feature exists) and added to another list.

A powerset (which is a list of all possible combinations of the contents of a list) was created for each of the aforementioned lists. The permutation of the resulting 2 lists was performed to create a list of 92,378 combinations of functions. Since these functions actually applied feature engineering to the dataset, these combinations were applied as functions to the copies of the dataset in a loop. 


To clarify what was just mentioned regarding the permutation of powersets, there is a simplified explanation in Figure 1. In this example, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, and ‘f’ are functions. We can see that list 1 has formulas a, b, and c, and list 2 contains formulas d, e, and f. Using a formula, a new list was created from the first list containing all the possible combinations (the powerset). The same was done for the second list.

Then a list was created with the permutation of both lists. The permutation allowed for the formulas to be paired and executed without constant manual input. The full list was then used to engineer features of the copies of the dataset and the RMSEs for the 6 models were calculated. The outcome seemed to point towards improved scores across the models.

Figure 1: Permutation of Powersets


The models were tuned after the best datasets were engineered. The WSJ Prime Lending rate and seasons of the year were also added during the tuning process, but only Gradient Boosting’s RMSE was lowered upon adding those features.


Table 2 shows the RMSEs for each model after imputation, after the permutation of powersets, and after tuning. The Kaggle Score is available as well. An interesting observation to note is that Graph 5 above shows the Train and Test split RMSEs for ElasticNet’s initial training yielded values that were most similar. The Train set RMSE was 0.1025 and that of the Test set was 0.1029, which is very close and was one of the least overfitting models. 

Table 2: Table of Scores

Another observation was that despite the fact that Gradient Boosting had a worse RMSE than did Lasso regression post-tuning, it ended up with the best Kaggle score. We can see in Graphs 6a and 6b that the Kaggle score does not quite match up with SciKitLearn’s RMSE. We can definitely see the difference in scores output by each model and that the RMSEs for the forest-based regressions are ever so slightly lower than those of the linear model. The Kaggle Scores also reflect this trend.

Although we were not able to obtain a perfect Kaggle Score, we will most likely be able to use the information gathered to improve our scores in the future using ElasticNet, Lasso, GBM, and XGBoost. 

Graph 6a: Line Graph of Scores
Graph 6b: Bar Graph of Scores Grouped by Model


In the future, we will improve the scores for the pricing model and ensure a smaller RMSE. We could use KNN Imputer to impute null values rather than using the complicated function that I wrote. We might also improve scores by looking at the skew of each of the features rather than just the skew of the label. Combining the predictions from several models by averaging them may also help.

We will also look at adjusting prices based on inflation but still provide the baseline price as a comparison so that sellers will know the comparative value of their homes. We will gather data from other housing markets to determine which features actually impact the prices of homes in those areas and use that information to create pricing models similar to this one.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author


Theodore is a jack of many trades and an expert in overthinking. He has worked in healthcare, healthcare administration, and finance and has experience in medical research. Having volunteered with medical missions abroad, managed building a new primary...
View all posts by Theodore >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI