NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > NYC Real Estate - A Data Webscraping Project

NYC Real Estate - A Data Webscraping Project

Sabbir Mohammed
Posted on Mar 7, 2019

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
This is a web scraping project that focused on completed residential real estate sales (over 13,000) in New York City, as listed on the Trulia website (from May 2018 - January 2019). The website was scraped using a Python Scrapy "spider".

The code for this project can be found here.

Project Motivation


Real estate across the United States is certainly a popular topic of discussion. My hometown of New York is definitely making headlines due to increasing real estate prices and is ranked among the highest in the world. There is plenty of discussion in the media about its effects and causes. Data reports state that "New York ranks No. 1 in losing residents to other states" with an estimated drop of 186,000 residents between 2015 and 2016.

However, for someone who is actually a long-time New York City resident and a first-generation immigrant in the country, the American dream of owning a home feels like an ever-increasing impossibility. This is a difficult reality underneath all of the numbers and statistics. That is where I got the idea for this project.

Aside from the potential causes and effects of the current state of the real estate market,  I was curious about existing trends. While there is very comprehensive and thorough data on New York housing available, I was interested in analyzing the data being held and displayed on the real estate listing search tools we use and perhaps even being able to glean further trends than what is readily available on these sites. I decided to scrape trulia.com and more specifically, their "sold" section for New York City.

Project Data Source


I have made use of the Python scrapy module to build what is called a "spider". A scrapy spider basically "crawls' a website and scrapes the desired data under URLs and pages as outlined within the program. The source code for my project is available here.

The starting point for my spider is this page below within the Trulia site, as described earlier. It is the "recently sold" section for New York City. I actually had my spider re-initialize for each of the five boroughs of New York  City and scrape the data from all of the result pages.

NYC Real Estate - A Data Webscraping Project

Screenshot of Trulia website

 

The spider would pull the following information for each listing being displayed on the result page (as marked with red boxes above) - there are 30 listings on each Trulia result page:

  • Sale Price (in USD)
  • Sale Date
  • Area (in sqft.)
  • Street Address
  • City
  • Postal Code

The city and zip code was actually present in the HTML source page.

The original plan was to also visit each listing page from the results page and scrape some additional data for analysis such as:

  • Property Type
  • Property Age

However, the spider was terminated after it responded to the site's requests. As such, I proceeded with fewer variables than desired. Once the data was scaped, I actually added two new variables:

  • Borough
  • Price per Area (in USD/ sqft)

The borough variable was generated based on the city variable and the price per area is simply the final sale price divided by the total property size of the listing.

After fully scraping the desired portions of the site, there were over 13,000 listings in the data set and the sale date ranged from May 2018 till January 2019 (a nine-month sale data availability restriction from the site itself)

Project Data Cleaning (Handling Missingness and Errors)


Missingness:

Prior to analyzing the scraped data, the missingness and errors need to be addressed in order to reduce bias.  The first step is taking a look at the data frame:

NYC Real Estate - A Data Webscraping Project

Sample observations (first ten) of the data frame

 

At this point, there are 13,267 listings of which, 1,952 are missing final sales price information. Missing Sale Prices are listings that were shown in Trulia website's 'Sold' section, but did not actually have a Sale Price or showed that the properties were re-listed on the market. Further investigation showed that these cases were most likely listings whose sale was not ultimately completed, or they were removed from the site entirely.

Since their final sale price or status cannot be determined, for the purposes of this project, they can be removed. These missing values are random (we are not dismissing any potential patterns).

Sale Price Errors:

A visual study of all Sale Prices in the dataset has shown that there are 23 unusual listings with a final sale price below $3,000 (actually grouped far below a $10K "boundary") that do not make sense as real estate prices in New York City during this nine-month period.

Sale Prices below $10,000

 

Additionally, for these same listings, the range of price/ sqft of the property is roughly below $3 /sqft. which is very far from the dataset mean of $573.92/ sqft (for all of NYC).

Price/ Area of Listings below $10,000

 

As such, my decision was to have these listings removed as potential data entry errors. A closer inspection on the trulia.com site shows these entry errors are random (no clear trend with other variables and neighborhood prices).

Property Size Errors:

On the lower end of the spectrum of property sizes, there seems to be a clear outlier with a Sale Price north of $40M. That listing will be investigated individually later, in the next section.

Property Size vs. Sale Price

 

In this section, we will look at the very largest properties with unusually low sale prices.

NYC Real Estate - A Data Webscraping Project

Property Size vs Sale Price (Listings above 500,000 sqft.)

 

The price range for the properties above 500,000 sqft. are in the upper range of sale price as well. Furthermore, the largest listings seem to be grouped by their zip codes, meaning they might be similar properties in a similar neighborhood.

However, the two largest properties above 850,000 should certainly be closely investigated. Both properties are actually in the same zip code, 11201 in Brooklyn, NY

Two largest properties in the data set

 

A quick look in that specific neighborhood reveals:

  • There are 121 sold listings from 11201 on the Trulia site.
  • The median property size was 1,160 sqft.
  • The 95th percentile of property size was 171,000 sqft.

These two properties are far above that range. As such, they were individually investigated on the Trulia site.

One listing was a studio apartment and the other was a 1 bedroom co-op. The sizes listed for those properties are not reasonable at all. There is a chance that the property sizes that are shown belong to the entire apartment complex and not just the unit being sold. As such, since we do not have accurate values of the sizes of those properties,  those listings can be removed for our purposes.

Price Per Unit Area Errors:

Visualizing the price per unit area feature below clearly shows an unusual listing.

Plot of Sale Price Per Unit Area of Listing

 

Checking the specific listing, following up with it on the Trulia site and comparing the other listings in the same neighborhood, this was clearly a data entry error and as such, the listing must be removed (as shown below):

Sale price per unit area outlier

 

Coincidentally, this was also the largest Sale Price listed at $44M (by a very large margin: the next highest listing is $18M) and lends to the notion that this property really is an error.

Project Data Analysis and Visualization


A few summary statistics of the final, cleaned data set are as follows:

  • Remaining number of listings after removing errors: 11,289
  • Mean sale price: $878,191.73
  • Median sale price: $705,000.00
  • Maximum sale price: $18,000,000.00
  • Median property size: 1,520 sqft.
  • Median price per unit area: $451.61/ sqft.

Data Analysis by Sale Price and Property Size

Generating a scatterplot of the logarithmic (base 10) values of the sale price and property size of the cleaned data set reveals a very telling picture. A trend becomes apparent in two very distinct groupings of NYC listings. The reason the log values are used, is to account for the large range of magnitudes in the values. Additionally, the listings have been colored by their borough locations to also reveal interesting trends and groupings.

Property size vs. sale price

 

The scatterplot reveals one large grouping of home listings centered around median sale price and property size, with no clear or distinct linear correlation. There seems to be another loosely scattered grouping of larger property sizes, whose sale prices are not that far outside the overall range.

This seems to depict a picture where property size is not a clear or ultimate describing factor in real estate prices in New York. In fact, the colors of the boroughs seems to show that trend. The correlation between size and price is a low -0.05. If the data were available on the number of bedrooms and bathrooms rather than just the total area of the listings, perhaps a more explanatory image would emerge.

However, the scatterplot does seem to show that Manhattan dictates the majority of the higher sale prices, while Queens and Brooklyn occupies most of the center of the main grouping of listings. The data also reflects the larger borough populations of Queens and Brooklyn through the number of listings as well.

The histograms of the log values of sale price and property size separately, reflects the distribution of data shown in this scatteplot.

Histogram of Log 10 of Sale Price

 

The sale prices across all the boroughs seem rather centralized towards just below $1 million as confirmed by the closely-tied median and mean sale prices ($705K and $878K respectively).

Histogram of Log 10 of Property Size

 

Unlike sale price, the property sizes actually reflect the distinction between common and larger listings. The majorty of the properties cluster around the median property size of 1,520 sqft. while the fewer larger properties skew the mean property size up to 5,881 sqft.

Data Analysis by Location

As the scatteplot above indicates, location might be a better indicator of final sale price for the city of New York than property size. The following is the latest census data (from here) on the population in each borough to give us a general background of the numbers in this dataset.

Borough Populations (US Census Bureau, 2010)

To start with, here are some visualizations:

Number of Listings per Borough

 

Rightaway, looking at the spread of just the number of listings across the boroughs, we see that Queens and Brooklyn have the most sold homes on Trulia, reflecting their larger populations. Manhattan had the least number of sales out of the rest of the boroughs, despite having a larger population than the Bronx and Staten Island.

Median Sale Price per Borough

 

The median sale prices for each of the boroughs start to show a different trend with Manhattan vastly out-stripping the rest of the city in terms of cost. The median sale price there was about $1.2 million, whereas Brooklyn and Queens were $875K and $730K respectively.

Median Property Size per Borough

 

Again, looking at the median property size by borough, we see that the Bronx typically had the larger listings, followed by Brooklyn and Queens and ending with Staten Island and Manhattan at the bottom. The median price and number of listings by borough, alongside this graph shows that Manhattan, despite having a large population had the least number of but most expensive home sales. It seems that neither size of the properties, or the size of the population can explain home sale prices better than the boroughs themselves. Location seems to be the dominating factor.

Median Price per Area by Borough

 

To basically summarize the previous graphs, plotting the median price per area cost for each of the borough paints the most complete picture. We see the effects of the median sale price and property size being refected here and ranking the boroughs by cost. We can truly visualize how expensive the island of Manhattan is within the city, which is in and of itself tremendously expensive by all measures.

Project Conclusions


The Trulia website is very extensive and clearly contains a lot of data for the New York metropolitan area. However, the site does come with its limitations.

The cap on the highest sale price seems to be $18 million, meaning (quite obviously) higher priced, more exclusive listings are not publicly displayed/ listed on the site. This definitely limits any view into the potential "trickle-down" effects of the more expensive listings on the rest of the market. 

Additionally, I was able to find quite a significant amount of data entry errors that limited the overall analysis. This ranged mainly from the status of the listing or the actual sale price to the size of the property listed.

In essence, this project is a study on the site itself and what it shows as a search tool for the NYC real estate market, rather than a thorough market analysis. It also revealed tangible insights into the more common portions of the market according to price, location and size.

Future Work


In terms of future work, I would be interested in applying some more advanced machine learning techniques and interactive visualizations to make further use of this data.

Another useful venture might be to execute scheduled "scrapes" of the site every nine months to build a continuous and seasonal/ yearly data set that would allow an analysis of the market over the sale dates.

From the perspective of web scraping, I want to be able to scrape the more detailed data from the individual page listings, a level deeper into the site rather than just the results page. On it, there is available more information regarding the sale history, the number of bedrooms, bathrooms and other amenities and features of each listing that can provide more colored and detailed analysis.

 


 

Thank you for viewing my project!

-  All suggestions and comments are welcome  -

About Author

Sabbir Mohammed

Sabbir is an aspiring data scientist with a recent certification from the NYC Data Science Academy. He obtained his BS in Mechanical Engineering from Rensselaer Polytechnic Institute and has since spent several years in logistics and procurement for...
View all posts by Sabbir Mohammed >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application