Scraping the NYC Rental Market

Avatar
Posted on Aug 21, 2019

Introduction

    NYC is constantly flooded with advertisements for apartments and condos for a variety of openings in a diverse market. If you've had to scramble for an apartment in NYC, you can begin to understand just how hectic the process is here. Plenty of postings are available on commonly known sites like Zillow, StreetEasy, NakedApartments and others that seek to provide a platform for housing in New York. While these sites maintain significant depth in their searches, there are certain features or filters that don't allow for a deeper dive. Most potential renters want to find housing within their budget regardless of the agency. But what about those looking for high-end apartments and avoiding websites like StreetEasy? Comparing prices amongst different agencies is a more arduous task.

    Corcoran, Douglas Elliman, and Compass are three of New York's biggest real estate firms and have continued to dominate the market landscape for years. If you manage to get your head out of your phone on the subway, look up and you might just see one of their advertisements. Compass, in particular, has been acquiring other firms and gained footing in the revenue rankings, not far behind Corcoran and Elliman. These firms, amongst others at the top of the rankings, occupy much of what's available in the NYC market, and extending beyond to places like the Hamptons and Florida. How can each of these agencies compare themselves against the others?

Data Acquisition

     It would be a tedious task to go into every StreetEasy or Zillow posting to find out which agency listed the apartment. Often times, this information isn't available, particularly for lesser known agencies. With this in mind, I decided to scrape the websites for Corcoran, Douglas Elliman, and Compass to visualize each unique presentation of rental listings, ultimately with the goal of finding data needed to analyze.

     Regardless of each agency's unique web design, all sites had the relevant information: address, neighborhood, number of bedrooms, number of baths, price, and sometimes features like square footage (though rarely). All of this information is publicly available, provided you have a login to each site. As such, I used Selenium to systematically login to the relevant sites, scrape the data of interest (only for Manhattan, Brooklyn and Queens), and write it to a csv file to begin data cleaning and preprocessing for analysis.

Data Manipulation

    After scraping the data off each of the three websites, I needed to normalize the data between the three agencies so everything can be analyzed uniformly without missing any important information between agencies. Here's how the data looked before normalization:

     As we can see, some of the data for our metrics of interest are written in different formats, which presents a problem for easy analysis in returning redundant measures. After using Python to clean our original data, the data is uniform and ready for analysis:    

Analysis

    With the data normalized, we can begin to explore what general trends between the different agencies might look like. We have to remember that these three agencies are the top three in all of New York based on revenue, so though we anticipate similarities in trends between the three firms, we're looking to explore the nuances that might separate them. What's the average price of each agency by borough? What about the average price difference for agencies by number of beds?

     Exploratory data analysis shows trends that we may expect to see (with some prior knowledge of NYC prices) where average prices vary by borough and number of bedrooms, more so than they seem to differ by firm. In the above bar charts, we seem to have some conflicting evidence: Douglas Elliman seems to have the most expensive pricing in Manhattan and Brooklyn, yet they seem to be cheaper than Corcoran for all number of bedrooms except studios. Why?

     This is known as Simpson's Paradox, a phenomenon that stipulates data can signal different outcomes based on the grouping we've chosen. Digging deeper, we can see why this might be the case in looking at firm inventory, rather than the average price.

    So why is the average price of Douglas Elliman more expensive in Manhattan and Brooklyn? In the above bar chart, we can see that Elliman has more 3, 4, and 5 bedroom listings than any other agency. Since bedrooms are incrementally more expensive, Elliman's average price by borough is inflated when grouping prices by bedroom, hence why we see that Corcoran's average price by bedroom is more expensive than Elliman's.

    Moving further into data exploration, I began to realize that this kind of data (comparison of firms) doesn't quite satisfy consumer demand. Platforms like Zillow and StreetEast allow renters to filter their apartment search in a variety of ways. On the contrary, the real estate companies listing apartments have less transparency into how their prices compare to their competitors. Below, I've created a search capability for this dataset that allows the user to search by neighborhood where the output shows average prices by bedroom and number of beds for each agency.

     Although it's useful to visualize different trends that exist on a macro level, the data becomes more useful as we get more granular. This holds particularly true in relation to granularity by neighborhood. New York's housing market is vastly diverse between different areas, often changing between bordering neighborhoods. As such, firms interested in their market position, beyond the scope visible to the consumer, can opt to understand in what neighborhoods, and what sized apartment, they maintain a stronghold to their competitors.

Conclusion

     This tool can be developed into a comprehensive, real-time application for the sell-side. By creating a daily scraper, alongside a preprocessing framework and pipeline, we could create a growing database of unique listings for any optimal number of real estate agencies,  Increased transparency allows companies to assess their strongholds and work on their weaknesses, and I aim to optimize this concept by providing a platform for the sellers to do so.

    Such information can be viewed from a corporate strategy perspective in many ways. For example, Compass' inventory in 1 bedroom and Studio apartments is highest, while their average price for 1 bedroom and Studio apartments is lowest. On the other hand, Corcoran has the highest average price for 2 and 3 bedroom apartments, but only the highest inventory for 2 bedrooms, which could lead to more emphasis on taking a look at their 3 bedroom listings. The search portion of this tool further allows these agencies to take another look at their strengths and weaknesses from a matter of inventory and average price. The analysis itself may seem trivial but this information is not yet available for comparison on any of the major sites, bringing value to the organizations that want to utilize a tool like the one developed here.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp