A Feature Data Analysis of Brooklyn Apartment Prices
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Shelter is one of the most basic necessities for all humans. For those of us in a city who can't afford to purchase our own home, we meet this need by renting, usually an apartment. Every year or two we must again make the decision to extend our lease or strike out to find someplace that better meets our needs. But can we find an apartment we like that we can afford and what wants are we willing to sacrifice to meet our budget? In this data study project I set out to find the expected costs of the different features of an apartment in the Brooklyn market, as that's where I live and will be looking in the future.
Data Collection
For this project, I set out to gather and analyze as much data as I could on apartments in Brooklyn, NY as I could. I used the Scrapy package in Python to scrape the data I needed from the web. I decided to scrape trulia.com as that's the site I've used the most, its information and layout are consistent and it doesn't necessitate navigating javascript to reach it all.
On trulia I was able to consistently get price, neighborhood, number of bedrooms, number of bathrooms, relative crime rate, age on trulia and a description. I was able to pull over 2000 complete observations.
Data Analysis
Word Cloud
I started off with a word cloud of the words in the descriptions to get an idea of what is most frequently used.
Obviously the word "apartment' is used frequently in apartment listings. After that we have amenities like stainless steel appliances and hardwood floors. After that we see words like "hot water", which probably indicates it is included in the rent, and "washer dryer", which likely indicate that the unit has that luxury most only dream of: in unit laundry facilities.
Pricing
To understand and breakdown the cost of an apartment, I regressed the price on number of bedrooms, number of bathrooms, age on trulia, and dummy variables for levels of crime.
coef | std err | t | P>|t| | [0.025 | 0.975] | |
const | 236.9014 | 266.013 | 0.891 | 0.373 | -284.793 | 758.596 |
age | 1.5520 | 0.674 | 2.301 | 0.021 | 0.229 | 2.875 |
bath | 711.1585 | 41.546 | 17.117 | 0.000 | 629.679 | 792.638 |
bedrooms | 184.6081 | 20.374 | 9.061 | 0.000 | 144.652 | 224.564 |
lowestCrime | 161.9924 | 79.121 | 2.047 | 0.041 | 6.823 | 317.162 |
lowCrime | 11.8988 | 68.596 | 0.173 | 0.862 | -122.628 | 146.426 |
highCrime | 34.1149 | 73.176 | 0.466 | 0.641 | -109.395 | 177.625 |
Neighborhoods and Cost
I also included dummy variables for neighborhoods in the regression. This shows additional cost of living in the neighborhood over Brownsville, the neighborhood represented by zeros in the rest of the dummy fields. I chose Brownsville as it had the lowest mean price of any neighborhood. Only 27 of the original 49 showed significance. The price increase ranged from $544 in Sheepshead Bay to $2846 in Vinegar Hill.
The additional cost of the significant neighborhoods and their standard error are below:
coef | std err | coef | std err | coef | std err | |||||
Bay Ridge & Ft Hamilton | 566.43 | 273.64 | Crown Heights | 1048.15 | 272.43 | Park Slope | 1572.01 | 267.98 | ||
Bed-Stuy | 709.26 | 259.1 | Dtwn Bklyn | 2076.51 | 277.79 | Prospect Heights | 1195.4 | 289.27 | ||
Boerum Hill | 1692.57 | 290.77 | Flatbush - Ditmas Park | 689.75 | 270.47 | Lefferts Gdns | 1048.79 | 282.1 | ||
Brighton | 870.33 | 272.18 | Fort Greene | 1673 | 297.57 | Prospect Park S | 975.94 | 412.07 | ||
Bushwick | 652.52 | 261.4 | Gowanus | 1572.53 | 294.3 | Red Hook | 798.28 | 343.91 | ||
Carroll Gardens | 1743.02 | 289.89 | Greenpoint | 1670.9 | 264.09 | Sheepshead Bay | 543.97 | 267.65 | ||
Clinton Hill | 1225.16 | 280.11 | Greenwood | 830.49 | 288.9 | Vinegar Hill | 2846.16 | 315.97 | ||
Cobble Hill | 1291.15 | 326.7 | Kensington & Parkville | 725.34 | 298.14 | Williamsburg | 1477.84 | 264.79 | ||
Coney Island | 732.62 | 330.95 | Marine Park | 658.9 | 323.55 | Windsor Terrace | 983.85 | 317.43 |
Conclusion
Finally, we see that, on average across Brooklyn, an additional bedroom raises the price approximately $184, while an additional bathroom will cost you an extra $711. These are the relationships we would expect to see. Additionally, living in a lowest crime area costs $160 more a month than living in the highest crime areas.
While this analysis is a good start, there is more that can be done in this area. First, a full analysis of the descriptions could yield additional features that could be included in the regression, such as laundry facilities, included utilities, appliances, etc. Also, a reliable source of the square footage of an apartment would be useful. Although some of its descriptive abilities may be explained in the bedroom and bathroom variables, it would likely still hold much explanatory power in the price.
If I had more time on this project, a per neighborhood analysis would also likely prove very helpful. As there are around 50 distinct neighborhoods in Brooklyn, it creates too much information for a presentation like this. However, an app where a user could choose the neighborhood they are interested in would allow a filter to cut down the noise and give the user the information they need. This could be enhanced by allowing the user to pick the features they are looking for, e.g. number of bedrooms, etc., and give them an expected price for the particular neighborhood, so they know what to expect and look for.
Go here for the GitHub repository to see the relevant code and accompanying slide show.