NYC Real Estate - A Data Webscraping Project
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
This is a web scraping project that focused on completed residential real estate sales (over 13,000) in New York City, as listed on the Trulia website (from May 2018 - January 2019). The website was scraped using a Python Scrapy "spider".
The code for this project can be found here.
Real estate across the United States is certainly a popular topic of discussion. My hometown of New York is definitely making headlines due to increasing real estate prices and is ranked among the highest in the world. There is plenty of discussion in the media about its effects and causes. Data reports state that "New York ranks No. 1 in losing residents to other states" with an estimated drop of 186,000 residents between 2015 and 2016.
However, for someone who is actually a long-time New York City resident and a first-generation immigrant in the country, the American dream of owning a home feels like an ever-increasing impossibility. This is a difficult reality underneath all of the numbers and statistics. That is where I got the idea for this project.
Aside from the potential causes and effects of the current state of the real estate market, I was curious about existing trends. While there is very comprehensive and thorough data on New York housing available, I was interested in analyzing the data being held and displayed on the real estate listing search tools we use and perhaps even being able to glean further trends than what is readily available on these sites. I decided to scrape trulia.com and more specifically, their "sold" section for New York City.
Project Data Source
I have made use of the Python scrapy module to build what is called a "spider". A scrapy spider basically "crawls' a website and scrapes the desired data under URLs and pages as outlined within the program. The source code for my project is available here.
The starting point for my spider is this page below within the Trulia site, as described earlier. It is the "recently sold" section for New York City. I actually had my spider re-initialize for each of the five boroughs of New York City and scrape the data from all of the result pages.
The spider would pull the following information for each listing being displayed on the result page (as marked with red boxes above) - there are 30 listings on each Trulia result page:
- Sale Price (in USD)
- Sale Date
- Area (in sqft.)
- Street Address
- Postal Code
The city and zip code was actually present in the HTML source page.
The original plan was to also visit each listing page from the results page and scrape some additional data for analysis such as:
- Property Type
- Property Age
However, the spider was terminated after it responded to the site's requests. As such, I proceeded with fewer variables than desired. Once the data was scaped, I actually added two new variables:
- Price per Area (in USD/ sqft)
The borough variable was generated based on the city variable and the price per area is simply the final sale price divided by the total property size of the listing.
After fully scraping the desired portions of the site, there were over 13,000 listings in the data set and the sale date ranged from May 2018 till January 2019 (a nine-month sale data availability restriction from the site itself)
Project Data Cleaning (Handling Missingness and Errors)
Prior to analyzing the scraped data, the missingness and errors need to be addressed in order to reduce bias. The first step is taking a look at the data frame:
At this point, there are 13,267 listings of which, 1,952 are missing final sales price information. Missing Sale Prices are listings that were shown in Trulia website's 'Sold' section, but did not actually have a Sale Price or showed that the properties were re-listed on the market. Further investigation showed that these cases were most likely listings whose sale was not ultimately completed, or they were removed from the site entirely.
Since their final sale price or status cannot be determined, for the purposes of this project, they can be removed. These missing values are random (we are not dismissing any potential patterns).
Sale Price Errors:
A visual study of all Sale Prices in the dataset has shown that there are 23 unusual listings with a final sale price below $3,000 (actually grouped far below a $10K "boundary") that do not make sense as real estate prices in New York City during this nine-month period.
Additionally, for these same listings, the range of price/ sqft of the property is roughly below $3 /sqft. which is very far from the dataset mean of $573.92/ sqft (for all of NYC).
As such, my decision was to have these listings removed as potential data entry errors. A closer inspection on the trulia.com site shows these entry errors are random (no clear trend with other variables and neighborhood prices).
Property Size Errors:
On the lower end of the spectrum of property sizes, there seems to be a clear outlier with a Sale Price north of $40M. That listing will be investigated individually later, in the next section.
In this section, we will look at the very largest properties with unusually low sale prices.
The price range for the properties above 500,000 sqft. are in the upper range of sale price as well. Furthermore, the largest listings seem to be grouped by their zip codes, meaning they might be similar properties in a similar neighborhood.
However, the two largest properties above 850,000 should certainly be closely investigated. Both properties are actually in the same zip code, 11201 in Brooklyn, NY
A quick look in that specific neighborhood reveals:
- There are 121 sold listings from 11201 on the Trulia site.
- The median property size was 1,160 sqft.
- The 95th percentile of property size was 171,000 sqft.
These two properties are far above that range. As such, they were individually investigated on the Trulia site.
One listing was a studio apartment and the other was a 1 bedroom co-op. The sizes listed for those properties are not reasonable at all. There is a chance that the property sizes that are shown belong to the entire apartment complex and not just the unit being sold. As such, since we do not have accurate values of the sizes of those properties, those listings can be removed for our purposes.
Price Per Unit Area Errors:
Visualizing the price per unit area feature below clearly shows an unusual listing.
Checking the specific listing, following up with it on the Trulia site and comparing the other listings in the same neighborhood, this was clearly a data entry error and as such, the listing must be removed (as shown below):
Coincidentally, this was also the largest Sale Price listed at $44M (by a very large margin: the next highest listing is $18M) and lends to the notion that this property really is an error.
Project Data Analysis and Visualization
A few summary statistics of the final, cleaned data set are as follows:
- Remaining number of listings after removing errors: 11,289
- Mean sale price: $878,191.73
- Median sale price: $705,000.00
- Maximum sale price: $18,000,000.00
- Median property size: 1,520 sqft.
- Median price per unit area: $451.61/ sqft.
Data Analysis by Sale Price and Property Size
Generating a scatterplot of the logarithmic (base 10) values of the sale price and property size of the cleaned data set reveals a very telling picture. A trend becomes apparent in two very distinct groupings of NYC listings. The reason the log values are used, is to account for the large range of magnitudes in the values. Additionally, the listings have been colored by their borough locations to also reveal interesting trends and groupings.
The scatterplot reveals one large grouping of home listings centered around median sale price and property size, with no clear or distinct linear correlation. There seems to be another loosely scattered grouping of larger property sizes, whose sale prices are not that far outside the overall range.
This seems to depict a picture where property size is not a clear or ultimate describing factor in real estate prices in New York. In fact, the colors of the boroughs seems to show that trend. The correlation between size and price is a low -0.05. If the data were available on the number of bedrooms and bathrooms rather than just the total area of the listings, perhaps a more explanatory image would emerge.
However, the scatterplot does seem to show that Manhattan dictates the majority of the higher sale prices, while Queens and Brooklyn occupies most of the center of the main grouping of listings. The data also reflects the larger borough populations of Queens and Brooklyn through the number of listings as well.
The histograms of the log values of sale price and property size separately, reflects the distribution of data shown in this scatteplot.
The sale prices across all the boroughs seem rather centralized towards just below $1 million as confirmed by the closely-tied median and mean sale prices ($705K and $878K respectively).
Unlike sale price, the property sizes actually reflect the distinction between common and larger listings. The majorty of the properties cluster around the median property size of 1,520 sqft. while the fewer larger properties skew the mean property size up to 5,881 sqft.
Data Analysis by Location
As the scatteplot above indicates, location might be a better indicator of final sale price for the city of New York than property size. The following is the latest census data (from here) on the population in each borough to give us a general background of the numbers in this dataset.
To start with, here are some visualizations:
Rightaway, looking at the spread of just the number of listings across the boroughs, we see that Queens and Brooklyn have the most sold homes on Trulia, reflecting their larger populations. Manhattan had the least number of sales out of the rest of the boroughs, despite having a larger population than the Bronx and Staten Island.
The median sale prices for each of the boroughs start to show a different trend with Manhattan vastly out-stripping the rest of the city in terms of cost. The median sale price there was about $1.2 million, whereas Brooklyn and Queens were $875K and $730K respectively.
Again, looking at the median property size by borough, we see that the Bronx typically had the larger listings, followed by Brooklyn and Queens and ending with Staten Island and Manhattan at the bottom. The median price and number of listings by borough, alongside this graph shows that Manhattan, despite having a large population had the least number of but most expensive home sales. It seems that neither size of the properties, or the size of the population can explain home sale prices better than the boroughs themselves. Location seems to be the dominating factor.
To basically summarize the previous graphs, plotting the median price per area cost for each of the borough paints the most complete picture. We see the effects of the median sale price and property size being refected here and ranking the boroughs by cost. We can truly visualize how expensive the island of Manhattan is within the city, which is in and of itself tremendously expensive by all measures.
The Trulia website is very extensive and clearly contains a lot of data for the New York metropolitan area. However, the site does come with its limitations.
The cap on the highest sale price seems to be $18 million, meaning (quite obviously) higher priced, more exclusive listings are not publicly displayed/ listed on the site. This definitely limits any view into the potential "trickle-down" effects of the more expensive listings on the rest of the market.
Additionally, I was able to find quite a significant amount of data entry errors that limited the overall analysis. This ranged mainly from the status of the listing or the actual sale price to the size of the property listed.
In essence, this project is a study on the site itself and what it shows as a search tool for the NYC real estate market, rather than a thorough market analysis. It also revealed tangible insights into the more common portions of the market according to price, location and size.
In terms of future work, I would be interested in applying some more advanced machine learning techniques and interactive visualizations to make further use of this data.
Another useful venture might be to execute scheduled "scrapes" of the site every nine months to build a continuous and seasonal/ yearly data set that would allow an analysis of the market over the sale dates.
From the perspective of web scraping, I want to be able to scrape the more detailed data from the individual page listings, a level deeper into the site rather than just the results page. On it, there is available more information regarding the sale history, the number of bedrooms, bathrooms and other amenities and features of each listing that can provide more colored and detailed analysis.