NYC Rental Market, Scraping and Data Analysis

Posted on Aug 21, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

LinkedInGitHubEmail | Data | Web App



    NYC is constantly flooded with advertisements for apartments and condos for a variety of openings in a diverse market. If you've had to scramble for an apartment in NYC, you can begin to understand just how hectic the process is here. Plenty of postings are available on commonly known sites like Zillow, StreetEasy, NakedApartments and others that seek to provide a platform for housing in New York. While these sites maintain significant depth in their searches, there are certain features or filters that don't allow for a deeper dive. Most potential renters want to find housing within their budget regardless of the agency. But what about those looking for high-end apartments and avoiding websites like StreetEasy? Comparing prices amongst different agencies is a more arduous task.

    Corcoran, Douglas Elliman, and Compass are three of New York's biggest real estate firms and have continued to dominate the market landscape for years. If you manage to get your head out of your phone on the subway, look up and you might just see one of their advertisements. Compass, in particular, has been acquiring other firms and gained footing in the revenue rankings, not far behind Corcoran and Elliman. These firms, amongst others at the top of the rankings, occupy much of what's available in the NYC market, and extending beyond to places like the Hamptons and Florida. How can each of these agencies compare themselves against the others?

Data Acquisition

     It would be a tedious task to go into every StreetEasy or Zillow posting to find out which agency listed the apartment. Often times, this information isn't available, particularly for lesser known agencies. With this in mind, I decided to scrape the websites for Corcoran, Douglas Elliman, and Compass to visualize each unique presentation of rental listings, ultimately with the goal of finding data needed to analyze.

     Regardless of each agency's unique web design, all sites had the relevant information: address, neighborhood, number of bedrooms, number of baths, price, and sometimes features like square footage (though rarely). All of this information is publicly available, provided you have a login to each site. As such, I used Selenium to systematically login to the relevant sites, scrape the data of interest (only for Manhattan, Brooklyn and Queens), and write it to a csv file to begin data cleaning and preprocessing for analysis.

Data Manipulation

    After scraping the data off each of the three websites, I needed to normalize the data between the three agencies so everything can be analyzed uniformly without missing any important information between agencies. Here's how the data looked before normalization:

     As we can see, some of the data for our metrics of interest are written in different formats, which presents a problem for easy analysis in returning redundant measures. After using Python to clean our original data, the data is uniform and ready for analysis:    


    With the data normalized, we can begin to explore what general trends between the different agencies might look like. We have to remember that these three agencies are the top three in all of New York based on revenue, so though we anticipate similarities in trends between the three firms, we're looking to explore the nuances that might separate them. What's the average price of each agency by borough? What about the average price difference for agencies by number of beds?

     Exploratory data analysis shows trends that we may expect to see (with some prior knowledge of NYC prices) where average prices vary by borough and number of bedrooms, more so than they seem to differ by firm. In the above bar charts, we seem to have some conflicting evidence: Douglas Elliman seems to have the most expensive pricing in Manhattan and Brooklyn, yet they seem to be cheaper than Corcoran for all number of bedrooms except studios. Why?

     This is known as Simpson's Paradox, a phenomenon that stipulates data can signal different outcomes based on the grouping we've chosen. Digging deeper, we can see why this might be the case in looking at firm inventory, rather than the average price.

    So why is the average price of Douglas Elliman more expensive in Manhattan and Brooklyn? In the above bar chart, we can see that Elliman has more 3, 4, and 5 bedroom listings than any other agency. Since bedrooms are incrementally more expensive, Elliman's average price by borough is inflated when grouping prices by bedroom, hence why we see that Corcoran's average price by bedroom is more expensive than Elliman's.

    Moving further into data exploration, I began to realize that this kind of data (comparison of firms) doesn't quite satisfy consumer demand. Platforms like Zillow and StreetEast allow renters to filter their apartment search in a variety of ways. On the contrary, the real estate companies listing apartments have less transparency into how their prices compare to their competitors. Below, I've created a search capability for this dataset that allows the user to search by neighborhood where the output shows average prices by bedroom and number of beds for each agency.

     Although it's useful to visualize different trends that exist on a macro level, the data becomes more useful as we get more granular. This holds particularly true in relation to granularity by neighborhood. New York's housing market is vastly diverse between different areas, often changing between bordering neighborhoods. As such, firms interested in their market position, beyond the scope visible to the consumer, can opt to understand in what neighborhoods, and what sized apartment, they maintain a stronghold to their competitors.


     This tool can be developed into a comprehensive, real-time application for the sell-side. By creating a daily scraper, alongside a preprocessing framework and pipeline, we could create a growing database of unique listings for any optimal number of real estate agencies,  Increased transparency allows companies to assess their strongholds and work on their weaknesses, and I aim to optimize this concept by providing a platform for the sellers to do so.

    Such information can be viewed from a corporate strategy perspective in many ways. For example, Compass' inventory in 1 bedroom and Studio apartments is highest, while their average price for 1 bedroom and Studio apartments is lowest. On the other hand, Corcoran has the highest average price for 2 and 3 bedroom apartments, but only the highest inventory for 2 bedrooms, which could lead to more emphasis on taking a look at their 3 bedroom listings. The search portion of this tool further allows these agencies to take another look at their strengths and weaknesses from a matter of inventory and average price. The analysis itself may seem trivial but this information is not yet available for comparison on any of the major sites, bringing value to the organizations that want to utilize a tool like the one developed here.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI