Creating an End-to-End Machine Learning Pipeline for a Nation-wide Homebuilder

, and
Posted on Jan 11, 2022

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background Data

A nation-wide home construction company in operation for over 50 years approached our data science team to collaborate in creating a pricing analytics tool. Their research and sales teams needed a way to benchmark the prices of homes they build to the overall market.

Our team’s assignment was to help create an end-to-end machine learning pipeline that incorporated the following three features: 

(1) data extraction from public multiple listing services (MLS)

(2) data cleaning and storage on Azure cloud

(3) Machine learning predictions on both the price of houses and the desirability of living in certain geographic areas 

The final deliverable was an easy-to-use interactive application for the stakeholder's teams. A public version of the app is located here.

Data Extraction

The client gave us a list of 7000 target ZIP codes to cover the range of the eight states in which it operates. We obtained data on houses in those ZIP codes from public multiple listing services (MLS) sites such as Zillow, Redfin, Trulia, etc. by writing scraping bots that automatically downloaded csvs. Individual house data included features such as listing price, bedrooms, bathrooms, square footage, etc.

To further enrich data and to be able to create a desirability score for geographical areas, we obtained the following Metro Area or ZIP code level data:

  • School zone ratings scraped from
  • population and other demographic data
  • Personal income data from U.S. Bureau of Economic Analysis API
  • Economic and Home Price Index data from the Federal Reserve (FRED) API
  • Zillow House Value Index

Exploratory Data Analysis

For the exploratory data analysis and mapping portions of our project, we worked with two underlying Redfin datasets* — both scraped via Selenium — one with current active listings (left) and the other with historical transactions (right, from last 3 years)

Due to state specific guidance, some states’ (ex: Florida) historical sale data is not available or ready to download by csv on Redfin.

Visualizing Active Listings

Distribution of active listings on Redfin by State and Property Type as of Dec 2021

Visualizing Relevant School Districts

In order to make the most meaning out of available information regarding school districts, our team synthesized two sources of data to create a series of mapping visualizations: 

  • School District Boundaries data in JSON form from NCES
  • Information at the individual school level on enrollment, rating, % of low income population as well as the parent school district that each school belongs to — originally scraped from for the 7000 zip codes of interest

Although the school district names were not a one-to-one match between the two datasets, we were able to join the two sets of information through some data processing and cleaning. That allowed us to visualize patterns of enrollment and socioeconomic status grouped by the school districts recognized by the NCES.

Combining School District Data with existing Redfin Data

With the insights that datasets (1) and (2) generated, next, the team looked into layering our collected data from Redfin on historical and active listings to visualize the relationship or any patterns between the quality of school district and the listing price.

Creating the Desirability Score

In order to create a desirability score and understand the characteristics of different geographical areas, we decided to use K-Means clustering using zip code level data. K-Means would allow us to group by similar variables and analyze how the different clusters are unique from one another. In order to successfully create these clusters, we used data from the sources mentioned above and tested many different combinations. Ultimately, the best combination for creating clusters was using population data, median household income data and average house value per zip code. 

After scaling these features using MinMaxScaler, we created four clusters using K-Means. Below is a scatterplot of the results with the x-axis representing population, the y-axis representing income per household and the size representing average house value. 

In addition, we can easily understand the characteristics of each ZIP code with the following chart: 

After analyzing the two above charts, we ranked the desirability of each area by the ZIP codes. Below is a representation of how we interpreted each cluster. This interpretation allowed us to better understand and rank the desirability of each ZIP code. 


The above analysis and interpretation allows end users at the stakeholder company to better understand the areas in which they are pricing homes. In addition, further analysis can allow for these clusters to aid in price optimization and opportunity identification. The sales team can determine the price elasticity of each group based on the above characteristics and even use the clusters to identify where they can build their next homes to maximize profit. 

The Affordability Index Data

The client asked us to develop an app feature that acted as an affordability index. The home prices and home prices scraped from Redfin made up the data used in this index. The end goal was to deliver an index that showed the stakeholder where their houses were too expensive. 

The crux of this index was calculating the difference between Redfin house prices and the client’s house prices. If Redfin house prices were higher than the client’s house prices, we would consider that area affordable, and vice versa. Rather than directly comparing house prices, we chose to compare the premium of a house, which took depreciation into account. The client provided us with a premium guide for Redfin houses.

After getting the stakeholder and Redfin premium for a specified area, we would output an affordability ranking. If the stakeholder’s premium was higher than the Redfin premium the app would output the area as unaffordable for the stakeholder’s developments

Building the Machine Learning Data Model

The final piece of the interactive dashboard was a machine learning model used by the sales team to price new building projects, trained on competitor MLS pricing information as well as the ZIP code level data on neighborhood desirability. We started with a base model including only basic house features (SF, beds, baths, etc) and ZIP code, which will be used as a benchmark for all other models. 

As the ZIP code feature is categorical, and there were thousands of codes in the dataset, the train-test splitting had to be manually coded such that each set of data points per ZIP code was split (test size = 0.25).Results using Linear Regression, Random Forest, and Gradient Boosting (Catboost) are summarized in table 1. Since train-test splitting needed to be manually coded, k-fold cross-validation was also manually performed by running the model multiple times after randomizing the split. The scores below are the average R2 values of all iterations.

Base Model Scores Linear Regression Random Forest Gradient Boosting
Train R2  0.758 0.832 0.816
Test R2 0.664 0.757 0.791
Table 1: Base Model Train and Test scores

Gradient Boosting (GB) resulted in the best score as well as the smallest difference between train and test scores. Random Forest (RF) had the best train score but a lower test score than GB which is a sign of overfitting (Note: As train-test splitting was a manual process, RF hyper-parameter tuning was also performed manually. Since RF scores did not come close to beating the GB model scores, and Catboost’s GB model does not require parameter tuning, going forward we no longer model using Random Forest). 

Next, we created a ‘full’ model which included all house and ZIP code level data. The feature importances based on the GB model are shown in the following chart:

We ran the same model multiple times, starting with only the single most important feature and iteratively adding features one by one. Chart 2 below graphs the test scores for each model iteration and shows that after 10 or so features there are marginal returns to R2 for any additional feature.


In order to simplify the model and overcome issues with multicollinearity of features, we consolidated the Zip code level information using k-means clustering and principal component analysis (PCA) and added the resulting data on top of the base model features (model scores shown in table 2).

Model Details Linear Model Test Score Gradient Boost Test Score
Base + Cluster Clustered on all ZIP code Level Data 0.718 0.793
Base + Cluster 4 Clusters (Pop, Econ, Age, FRED) 0.674 0.827
Base + 7 PCAs 4 PCA sets (Pop, Econ, Age, FRED) 0.709 0.850
Base + 5 PCAs 5 PCAs for All ZIP Level Data 0.733 0.839
Table 2: Cluster and PCA model test scores

The selected model was the GB model using base features plus 5 principal components as it was the simplest model but had a test score just as high as the full model and a mean absolute percentage error of roughly 22%. We then performed another model test on the stakeholder’s newest projects (about 400 houses). Chart 3 below depicts the distributions of actual vs. predicted prices on the left, as well as the density plot of prediction errors on the right. The model tended to overprice the stakeholder’s homes and had an average absolute error of +$90,000.

We believe this is due to the stakeholder standing by its mission statement of providing “affordable homes” to the communities in which it operates. Furthermore, the training dataset used scraped MLS listings containing expensive condominiums or luxury homes, which skewed the model. The stakeholder was made aware of the discrepancy but advised us to retain all the original data.



Overall, our work for the stakeholder proved to add value for the sales and pricing team by providing insights and accurate benchmarking. The tools we created will help a variety of teams in the following ways: 

  • The price recommendation model will provide a price range based on user input, similar houses in the dataset, and geographical areas for the pricing team to use as a reference point.
  • The affordability index allows the sales team to understand different geographical areas and the cost of living in each area. 
  • The desirability index will allow the pricing and sales teams to understand where and why people want to live where they do. This information will help guide pricing decisions based on the price elasticity in each cluster. 
  • The data visualization and school district breakdown allows for competitive analysis and opportunity identification from the sales team.

The work done by our group will help the stakeholder in the above ways and can be expanded upon for further work. It is also an example of how data science can be used in the real estate industry to help optimize prices, understand the characteristics of different geographical areas, and identify opportunities. 

About Authors

Daniel Nie

Data Scientist with background experience in both Healthcare Administration and Finance. A versatile thinker that enjoys deep data exploration and generating business value with machine learning in both Python and R
View all posts by Daniel Nie >

Tony Pennoyer

Data Scientist who brings a unique creative and analytical approach to the table. Interested in data story telling and creating actionable insights through data analysis.
View all posts by Tony Pennoyer >

Jack Copeland

After graduating from the University of Virginia in 2019 with a degree in Computer Science, I went on to join Anheuser-Busch as a Global Management Trainee. I received cross functional training in sales, marketing, supply and more before...
View all posts by Jack Copeland >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI