NYC Real Estate - A Data Webscraping Project

Posted on Mar 7, 2019

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
This is a web scraping project that focused on completed residential real estate sales (over 13,000) in New York City, as listed on the Trulia website (from May 2018 - January 2019). The website was scraped using a Python Scrapy "spider".

The code for this project can be found here.

Project Motivation


Real estate across the United States is certainly a popular topic of discussion. My hometown of New York is definitely making headlines due to increasing real estate prices and is ranked among the highest in the world. There is plenty of discussion in the media about its effects and causes. Data reports state that "New York ranks No. 1 in losing residents to other states" with an estimated drop of 186,000 residents between 2015 and 2016.

However, for someone who is actually a long-time New York City resident and a first-generation immigrant in the country, the American dream of owning a home feels like an ever-increasing impossibility. This is a difficult reality underneath all of the numbers and statistics. That is where I got the idea for this project.

Aside from the potential causes and effects of the current state of the real estate market,  I was curious about existing trends. While there is very comprehensive and thorough data on New York housing available, I was interested in analyzing the data being held and displayed on the real estate listing search tools we use and perhaps even being able to glean further trends than what is readily available on these sites. I decided to scrape trulia.com and more specifically, their "sold" section for New York City.

Project Data Source


I have made use of the Python scrapy module to build what is called a "spider". A scrapy spider basically "crawls' a website and scrapes the desired data under URLs and pages as outlined within the program. The source code for my project is available here.

The starting point for my spider is this page below within the Trulia site, as described earlier. It is the "recently sold" section for New York City. I actually had my spider re-initialize for each of the five boroughs of New York  City and scrape the data from all of the result pages.

NYC Real Estate - A Data Webscraping Project

Screenshot of Trulia website

 

The spider would pull the following information for each listing being displayed on the result page (as marked with red boxes above) - there are 30 listings on each Trulia result page:

  • Sale Price (in USD)
  • Sale Date
  • Area (in sqft.)
  • Street Address
  • City
  • Postal Code

The city and zip code was actually present in the HTML source page.

The original plan was to also visit each listing page from the results page and scrape some additional data for analysis such as:

  • Property Type
  • Property Age

However, the spider was terminated after it responded to the site's requests. As such, I proceeded with fewer variables than desired. Once the data was scaped, I actually added two new variables:

  • Borough
  • Price per Area (in USD/ sqft)

The borough variable was generated based on the city variable and the price per area is simply the final sale price divided by the total property size of the listing.

After fully scraping the desired portions of the site, there were over 13,000 listings in the data set and the sale date ranged from May 2018 till January 2019 (a nine-month sale data availability restriction from the site itself)

Project Data Cleaning (Handling Missingness and Errors)


Missingness:

Prior to analyzing the scraped data, the missingness and errors need to be addressed in order to reduce bias.  The first step is taking a look at the data frame:

NYC Real Estate - A Data Webscraping Project

Sample observations (first ten) of the data frame

 

At this point, there are 13,267 listings of which, 1,952 are missing final sales price information. Missing Sale Prices are listings that were shown in Trulia website's 'Sold' section, but did not actually have a Sale Price or showed that the properties were re-listed on the market. Further investigation showed that these cases were most likely listings whose sale was not ultimately completed, or they were removed from the site entirely.

Since their final sale price or status cannot be determined, for the purposes of this project, they can be removed. These missing values are random (we are not dismissing any potential patterns).

Sale Price Errors:

A visual study of all Sale Prices in the dataset has shown that there are 23 unusual listings with a final sale price below $3,000 (actually grouped far below a $10K "boundary") that do not make sense as real estate prices in New York City during this nine-month period.

Sale Prices below $10,000

 

Additionally, for these same listings, the range of price/ sqft of the property is roughly below $3 /sqft. which is very far from the dataset mean of $573.92/ sqft (for all of NYC).

Price/ Area of Listings below $10,000

 

As such, my decision was to have these listings removed as potential data entry errors. A closer inspection on the trulia.com site shows these entry errors are random (no clear trend with other variables and neighborhood prices).

Property Size Errors:

On the lower end of the spectrum of property sizes, there seems to be a clear outlier with a Sale Price north of $40M. That listing will be investigated individually later, in the next section.

Property Size vs. Sale Price

 

In this section, we will look at the very largest properties with unusually low sale prices.

NYC Real Estate - A Data Webscraping Project

Property Size vs Sale Price (Listings above 500,000 sqft.)

 

The price range for the properties above 500,000 sqft. are in the upper range of sale price as well. Furthermore, the largest listings seem to be grouped by their zip codes, meaning they might be similar properties in a similar neighborhood.

However, the two largest properties above 850,000 should certainly be closely investigated. Both properties are actually in the same zip code, 11201 in Brooklyn, NY

Two largest properties in the data set

 

A quick look in that specific neighborhood reveals:

  • There are 121 sold listings from 11201 on the Trulia site.
  • The median property size was 1,160 sqft.
  • The 95th percentile of property size was 171,000 sqft.

These two properties are far above that range. As such, they were individually investigated on the Trulia site.

One listing was a studio apartment and the other was a 1 bedroom co-op. The sizes listed for those properties are not reasonable at all. There is a chance that the property sizes that are shown belong to the entire apartment complex and not just the unit being sold. As such, since we do not have accurate values of the sizes of those properties,  those listings can be removed for our purposes.

Price Per Unit Area Errors:

Visualizing the price per unit area feature below clearly shows an unusual listing.

Plot of Sale Price Per Unit Area of Listing

 

Checking the specific listing, following up with it on the Trulia site and comparing the other listings in the same neighborhood, this was clearly a data entry error and as such, the listing must be removed (as shown below):

Sale price per unit area outlier

 

Coincidentally, this was also the largest Sale Price listed at $44M (by a very large margin: the next highest listing is $18M) and lends to the notion that this property really is an error.

Project Data Analysis and Visualization


A few summary statistics of the final, cleaned data set are as follows:

  • Remaining number of listings after removing errors: 11,289
  • Mean sale price: $878,191.73
  • Median sale price: $705,000.00
  • Maximum sale price: $18,000,000.00
  • Median property size: 1,520 sqft.
  • Median price per unit area: $451.61/ sqft.

Data Analysis by Sale Price and Property Size

Generating a scatterplot of the logarithmic (base 10) values of the sale price and property size of the cleaned data set reveals a very telling picture. A trend becomes apparent in two very distinct groupings of NYC listings. The reason the log values are used, is to account for the large range of magnitudes in the values. Additionally, the listings have been colored by their borough locations to also reveal interesting trends and groupings.

Property size vs. sale price

 

The scatterplot reveals one large grouping of home listings centered around median sale price and property size, with no clear or distinct linear correlation. There seems to be another loosely scattered grouping of larger property sizes, whose sale prices are not that far outside the overall range.

This seems to depict a picture where property size is not a clear or ultimate describing factor in real estate prices in New York. In fact, the colors of the boroughs seems to show that trend. The correlation between size and price is a low -0.05. If the data were available on the number of bedrooms and bathrooms rather than just the total area of the listings, perhaps a more explanatory image would emerge.

However, the scatterplot does seem to show that Manhattan dictates the majority of the higher sale prices, while Queens and Brooklyn occupies most of the center of the main grouping of listings. The data also reflects the larger borough populations of Queens and Brooklyn through the number of listings as well.

The histograms of the log values of sale price and property size separately, reflects the distribution of data shown in this scatteplot.

Histogram of Log 10 of Sale Price

 

The sale prices across all the boroughs seem rather centralized towards just below $1 million as confirmed by the closely-tied median and mean sale prices ($705K and $878K respectively).

Histogram of Log 10 of Property Size

 

Unlike sale price, the property sizes actually reflect the distinction between common and larger listings. The majorty of the properties cluster around the median property size of 1,520 sqft. while the fewer larger properties skew the mean property size up to 5,881 sqft.

Data Analysis by Location

As the scatteplot above indicates, location might be a better indicator of final sale price for the city of New York than property size. The following is the latest census data (from here) on the population in each borough to give us a general background of the numbers in this dataset.

Borough Populations (US Census Bureau, 2010)

To start with, here are some visualizations:

Number of Listings per Borough

 

Rightaway, looking at the spread of just the number of listings across the boroughs, we see that Queens and Brooklyn have the most sold homes on Trulia, reflecting their larger populations. Manhattan had the least number of sales out of the rest of the boroughs, despite having a larger population than the Bronx and Staten Island.

Median Sale Price per Borough

 

The median sale prices for each of the boroughs start to show a different trend with Manhattan vastly out-stripping the rest of the city in terms of cost. The median sale price there was about $1.2 million, whereas Brooklyn and Queens were $875K and $730K respectively.

Median Property Size per Borough

 

Again, looking at the median property size by borough, we see that the Bronx typically had the larger listings, followed by Brooklyn and Queens and ending with Staten Island and Manhattan at the bottom. The median price and number of listings by borough, alongside this graph shows that Manhattan, despite having a large population had the least number of but most expensive home sales. It seems that neither size of the properties, or the size of the population can explain home sale prices better than the boroughs themselves. Location seems to be the dominating factor.

Median Price per Area by Borough

 

To basically summarize the previous graphs, plotting the median price per area cost for each of the borough paints the most complete picture. We see the effects of the median sale price and property size being refected here and ranking the boroughs by cost. We can truly visualize how expensive the island of Manhattan is within the city, which is in and of itself tremendously expensive by all measures.

Project Conclusions


The Trulia website is very extensive and clearly contains a lot of data for the New York metropolitan area. However, the site does come with its limitations.

The cap on the highest sale price seems to be $18 million, meaning (quite obviously) higher priced, more exclusive listings are not publicly displayed/ listed on the site. This definitely limits any view into the potential "trickle-down" effects of the more expensive listings on the rest of the market. 

Additionally, I was able to find quite a significant amount of data entry errors that limited the overall analysis. This ranged mainly from the status of the listing or the actual sale price to the size of the property listed.

In essence, this project is a study on the site itself and what it shows as a search tool for the NYC real estate market, rather than a thorough market analysis. It also revealed tangible insights into the more common portions of the market according to price, location and size.

Future Work


In terms of future work, I would be interested in applying some more advanced machine learning techniques and interactive visualizations to make further use of this data.

Another useful venture might be to execute scheduled "scrapes" of the site every nine months to build a continuous and seasonal/ yearly data set that would allow an analysis of the market over the sale dates.

From the perspective of web scraping, I want to be able to scrape the more detailed data from the individual page listings, a level deeper into the site rather than just the results page. On it, there is available more information regarding the sale history, the number of bedrooms, bathrooms and other amenities and features of each listing that can provide more colored and detailed analysis.

 


 

Thank you for viewing my project!

-  All suggestions and comments are welcome  -

About Author

Sabbir Mohammed

Sabbir is an aspiring data scientist with a recent certification from the NYC Data Science Academy. He obtained his BS in Mechanical Engineering from Rensselaer Polytechnic Institute and has since spent several years in logistics and procurement for...
View all posts by Sabbir Mohammed >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI