NYC Rental Market, Scraping and Data Analysis
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
LinkedIn | GitHub | Email | Data | Web App

Introduction
NYC is constantly flooded with advertisements for apartments and condos for a variety of openings in a diverse market. If you've had to scramble for an apartment in NYC, you can begin to understand just how hectic the process is here. Plenty of postings are available on commonly known sites like Zillow, StreetEasy, NakedApartments and others that seek to provide a platform for housing in New York. While these sites maintain significant depth in their searches, there are certain features or filters that don't allow for a deeper dive. Most potential renters want to find housing within their budget regardless of the agency. But what about those looking for high-end apartments and avoiding websites like StreetEasy? Comparing prices amongst different agencies is a more arduous task.
Corcoran, Douglas Elliman, and Compass are three of New York's biggest real estate firms and have continued to dominate the market landscape for years. If you manage to get your head out of your phone on the subway, look up and you might just see one of their advertisements. Compass, in particular, has been acquiring other firms and gained footing in the revenue rankings, not far behind Corcoran and Elliman. These firms, amongst others at the top of the rankings, occupy much of what's available in the NYC market, and extending beyond to places like the Hamptons and Florida. How can each of these agencies compare themselves against the others?
Data Acquisition
It would be a tedious task to go into every StreetEasy or Zillow posting to find out which agency listed the apartment. Often times, this information isn't available, particularly for lesser known agencies. With this in mind, I decided to scrape the websites for Corcoran, Douglas Elliman, and Compass to visualize each unique presentation of rental listings, ultimately with the goal of finding data needed to analyze.
Regardless of each agency's unique web design, all sites had the relevant information: address, neighborhood, number of bedrooms, number of baths, price, and sometimes features like square footage (though rarely). All of this information is publicly available, provided you have a login to each site. As such, I used Selenium to systematically login to the relevant sites, scrape the data of interest (only for Manhattan, Brooklyn and Queens), and write it to a csv file to begin data cleaning and preprocessing for analysis.
Data Manipulation
After scraping the data off each of the three websites, I needed to normalize the data between the three agencies so everything can be analyzed uniformly without missing any important information between agencies. Here's how the data looked before normalization:

As we can see, some of the data for our metrics of interest are written in different formats, which presents a problem for easy analysis in returning redundant measures. After using Python to clean our original data, the data is uniform and ready for analysis:

Analysis
With the data normalized, we can begin to explore what general trends between the different agencies might look like. We have to remember that these three agencies are the top three in all of New York based on revenue, so though we anticipate similarities in trends between the three firms, we're looking to explore the nuances that might separate them. What's the average price of each agency by borough? What about the average price difference for agencies by number of beds?

Exploratory data analysis shows trends that we may expect to see (with some prior knowledge of NYC prices) where average prices vary by borough and number of bedrooms, more so than they seem to differ by firm. In the above bar charts, we seem to have some conflicting evidence: Douglas Elliman seems to have the most expensive pricing in Manhattan and Brooklyn, yet they seem to be cheaper than Corcoran for all number of bedrooms except studios. Why?
This is known as Simpson's Paradox, a phenomenon that stipulates data can signal different outcomes based on the grouping we've chosen. Digging deeper, we can see why this might be the case in looking at firm inventory, rather than the average price.

So why is the average price of Douglas Elliman more expensive in Manhattan and Brooklyn? In the above bar chart, we can see that Elliman has more 3, 4, and 5 bedroom listings than any other agency. Since bedrooms are incrementally more expensive, Elliman's average price by borough is inflated when grouping prices by bedroom, hence why we see that Corcoran's average price by bedroom is more expensive than Elliman's.
Moving further into data exploration, I began to realize that this kind of data (comparison of firms) doesn't quite satisfy consumer demand. Platforms like Zillow and StreetEast allow renters to filter their apartment search in a variety of ways. On the contrary, the real estate companies listing apartments have less transparency into how their prices compare to their competitors. Below, I've created a search capability for this dataset that allows the user to search by neighborhood where the output shows average prices by bedroom and number of beds for each agency.

Although it's useful to visualize different trends that exist on a macro level, the data becomes more useful as we get more granular. This holds particularly true in relation to granularity by neighborhood. New York's housing market is vastly diverse between different areas, often changing between bordering neighborhoods. As such, firms interested in their market position, beyond the scope visible to the consumer, can opt to understand in what neighborhoods, and what sized apartment, they maintain a stronghold to their competitors.
Conclusion
This tool can be developed into a comprehensive, real-time application for the sell-side. By creating a daily scraper, alongside a preprocessing framework and pipeline, we could create a growing database of unique listings for any optimal number of real estate agencies, Increased transparency allows companies to assess their strongholds and work on their weaknesses, and I aim to optimize this concept by providing a platform for the sellers to do so.
Such information can be viewed from a corporate strategy perspective in many ways. For example, Compass' inventory in 1 bedroom and Studio apartments is highest, while their average price for 1 bedroom and Studio apartments is lowest. On the other hand, Corcoran has the highest average price for 2 and 3 bedroom apartments, but only the highest inventory for 2 bedrooms, which could lead to more emphasis on taking a look at their 3 bedroom listings. The search portion of this tool further allows these agencies to take another look at their strengths and weaknesses from a matter of inventory and average price. The analysis itself may seem trivial but this information is not yet available for comparison on any of the major sites, bringing value to the organizations that want to utilize a tool like the one developed here.