Data Comparison Between NY & LA

Kweku Ulzen

Posted on Feb 5, 2018

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction:

As an avid traveler, I have always been interested in discovering what makes a city unique. Data tools informing travelers of unique landmarks and activities in the places to which they venture have been ubiquitous for ages. While I appreciate the different elements that make a city unique, I also have grown to understand that there are certain types of activities I would like to engage in when I visit a new place.

For my project, I set out to visualize demographic and quality of life data between two cities to compare them at the neighborhood level. If you have ever found yourself traveling to a new city, looking for an area similar to one you knew well, you may have some of the questions which drove me to work on this project.

Say you love curry and you have decided that in your hometown of New York, Flushing has all of the curry houses you love. What neighborhoods in Los Angeles would have similar types of restaurants which may also be highly rated by their patrons? Say I was moving to Toronto from Sao Paulo, but was still enrolled in English courses as I was adapting to the new city. Maybe I can try to find a neighborhood with a high native Portuguese-speaking population to help me get settled into my new environment. Does this neighborhood compare to my favorites in Sao Paulo in terms of available green space and public transportation quality?

Goal

My goal was to build a tool that could take these questions and return an answer quickly for a user. This tool would find congruence between any two data points at the neighborhood level. I chose to test this functionality using data from New York and Los Angeles--a city that I know very well and one where I can't find the airport without a map, respectively.

Los Angeles, California, USA

New York, New York, USA

Methodology:

My application sources US Census data for demographic and quality of life information. I augmented this data set with data from Walkscore. Walkscore is a website which rates a neighborhood's quality of transportation options. The rating is derived from a weighted analysis of the requested area's features and is not a relative index. Thus one neighborhood's Walkscore is not dependent on the score of another simply by value--although it may be affected if services in an area are reachable in the compared area.

I acquired my map polygons from public GeoJSON files which posed a bit of a challenge in data manipulation. I later found a more effective strategy for manipulating JSON data in R as I reached the conclusion of my project and will later talk about how I would incorporate that into the project if it were to continue.

Data

The specific categories in my question would be tough to answer without web scraping skills, which I am acquiring in a later module of the bootcamp. To accommodate my knowledge gap, I chose a smaller set of data to run the project as a proof of concept. My data fields for this project are:

Neighborhood Name
City
Neighborhood Population
Median Household Income
Average Household Size
Violent Crime Rate (per 1,000)
Property Crime rate (per 1,000)
Median Educational Attainment
Median Age
US Census Racial Categories
- White
- Black
- Asian
- Hispanic (non-race, ethnic)
Percent Foreign Born
Data from WalkScore.com
- WalkScore
- TransitScore
- BikeScore

The data for Los Angeles was very simple to acquire as the Los Angeles Times began mapping Los Angeles and gaining these insights with neighborhood granularity in 2009. Data collected by neighborhood in New York posed a greater, more time-consuming challenge as the census collects data at the census tract level, which is independent of neighborhoods and generally contains a portion of a single neighborhood or portions of multiple neighborhoods.

As an alternative, I found a data set which was analyzed by the Furman Center and used census data for the combined neighborhoods which were broken down by the City of New York to analyze data to the same level as the data prepared by the LA Times. At this stage, I decided that I would be better served to test functionality on demographic information and WalkScore's ratings which use similar data types to the questions that I had. I won't find out about restaurants in this proof of concept, but maybe I can find out which neighborhoods have young immigrant populations in each city, then match them.

Cities

To try and find a good representation of neighborhoods, I chose 20 per city: 5 which I anecdotally knew to be affluent, 5 which I anecdotally knew to be under-served, and 10 totally at random. The neighborhoods in this demo are:

New York

Los Angeles

Flushing/Whitestone

Central Harlem

Upper East Side

Upper West Side

Greenpoint/Williamsburg

Fort Greene/Brooklyn Heights

Coney Island

St. George/Stapleton

Elmhurst/Corona

Morrisania/Crotona

Hillcrest/Fresh Meadows

Astoria

Brownsville

Crown Heights/Prospect Heights

Clinton/Chelsea

Lower East Side/Chinatown

Greenwich Village/SoHo

Washington Heights/Inwood

Riverdale/Fieldston

Bushwick

Hollywood

Studio City

Beverly Hills

Brentwood

Culver City

Baldwin Hills/Crenshaw

Silver Lake

Echo Park

Van Nuys

Santa Monica

Venice

Chatsworth

Florence

Vermont-Slauson

Watts

East Los Angeles

Historic South Central

Downtown

Eagle Rock

Koreatown

The next steps after acquiring the data are to manipulate the information to provide answers to my new burning questions!

R Shiny Data Application:

I created a dashboard application in R Shiny to visualize neighborhood comparisons and to allow for user input. The dashboard consists of three sections:

Map:
a map created using GeoJSON files for boundaries in the leaflet for R package. I created a choropleth map which separated neighborhood rankings by percentile in groups of 5 (0-20%, 21-40%, 41- 60%, 61-80%, 81-100%). The percentile group was noted by color as defined in the map legend. This interface allowed for a user to pick a statistic to obtain rankings for in each city by neighborhood. The neighborhood polygon then pops up the neighborhood name and statistic upon an onmouseover event.

Matchmaker:
the key interface throughout my planning to answer the previously stated problems. I created a function to find the three nearest comparisons in percentile by taking the minimum absolute difference of a neighborhood's rank and a corresponding rank in another city. It uses an observe event as well to ensure that neighborhoods that are searched match the city specified in the prompt.
- The function takes a data row "z" and subtracts the value of the corresponding city's neighborhood percentile rankings "a." It looks for the three closest percentile points to find "matches." For example, if a user would like to know the most similar neighborhoods to Brentwood, Los Angeles, they could do so for either New York or the rest of Los Angeles like in the example below:

matchmaker = function(a,z){
b = abs(z - a)
b = which.min(b)
mymatch = c(b)
c = abs(z - a[-b])
c = which.min(c)
if (c > b){
c = c+1
}
mymatch = c(mymatch, c)
d = abs(z - a[c(-b,-c)])
d = which.min(d)
if (d > c){
d = d+1
}
if (d > b){
d = d+1
}
mymatch = c(mymatch, d)
mymatch
}

Insights: I had additional questions upon seeing the data which I address below.

Data Insights:

As a child, I was always fond of rebuffing norms and typical structure. My mother always insisted that if I wanted to achieve my (vain) childhood goal of making a lot of money, I would need to stay in school. I preferred curating my own learning experiences.

Over time, I relented and fulfilled my mother's wishes of scholastic pursuit. I realized that across the various neighborhoods in this data set, I could make a simple visualization showing the median household income for neighborhoods with particular levels of degrees attained by the typical person. As expected, my mother's hypothesis was backed up by the data, as income level tended to rise across neighborhoods with higher educational attainment.

Findings

I noticed that Los Angeles neighborhoods consistently had higher levels of property and violent crime. I researched the data a bit deeper and discovered that this comparison was imperfect for per capita comparisons between cities as LA Times reported misdemeanor and felony crimes while Furman Center only reported felonies in New York. However, I felt that this comparison would still be valid for percentile comparisons (relative to other neighborhoods in the same city) so I kept these statistics in the data set.

When I plotted the data, I noticed that there was a positive relationship between increased property and violent crimes, but what shocked me was an exaggerated outlier. I recognized that this point represented the crime rate in Downtown Los Angeles. With no familiarity of Downtown Los Angeles, I wondered, why does this single part of the city appear to have a crime rate that essentially guarantees any resident or visitor to be a crime victim? Was Downtown LA that much more dangerous than the other neighborhoods in my data set?

Maybe I was biased in my neighborhood choice and I neglected the more dangerous neighborhoods. I then remembered that Midtown Manhattan had a similar phenomenon, where the crime rate reported was much higher than to be expected, because of its daytime population. Downtown Los Angeles has a residential population of 34,811. The Los Angeles Times reported that the daytime population of the area exceeded 280,000!

Frequency

We were measuring the crime frequency of a neighborhood with a mid-sized American city's population against a small group of people as if the crimes committed in that area were only perpetrated against its own residents. This accounted for the bizarre numbers. More accurate crime numbers for the neighborhood would be 7.82 and 20.1 per 1,000 residents for violent crime and property crime respectively. This is much more indicative of a typical--even somewhat safe--Los Angeles neighborhood.

Upon this discovery that my crime data was flawed and would need to be re-investigated, I decided to investigate median age. The youngest neighborhood was Watts, Los Angeles (median age of 21) and the oldest was the Upper East Side in New York (median age of 47). Typically, the older neighborhoods experienced lower crime rates. Of the selected neighborhoods, the median age across both cities was 35, slightly lower than the US median of 38. With regards to this data set, New York and Los Angeles appear to be slightly younger than typical American cities.

Future Use Cases and Functionality:

While I did not answer my initial question and wanted to build much more, I am happy with how my first R project came together. In 10 days, I was able to read my information from a SQLite database into a dynamic user interface and write a function to answer similar questions. This shows me that my intended use case would be possible with a bit more development time and data wrangling.

I could see the core audience of an application such as this one coming from a wide array of perspectives. This can be useful to businesses looking for the best neighborhoods to open a location in an area heavily populated with their core customer base. Vacationers could use this to decide which neighborhood would be a great place to look for a hotel or bed and breakfast. A family moving from one city to another could use this to find the best school district and most green space. Because of this potential, I would endeavor to add more fields in the future such as:

Traffic data
- Delays
- Collisions
Cultural data
- Languages spoken
- Restaurants
- Places of Worship
Home value
Rent value
311 information

Conclusion

I found the JSON files difficult to manipulate in bulk with my additional data fields. I resorted to systematically manually importing the data, which allowed me to give a representation of the functionality across different variables. One of my colleagues showed me the merge command in rgdal toward the end of the project which allows for manipulation of the JSON file. This would make it much easier to add more features, neighborhoods, and cities to the data set. I also wanted to add Atlanta, Toronto, and Montreal to my project and would like to continue investigating that after the bootcamp.

I would also like to offer options to compare on both the percentile level, relative to the rest of the city, and also on a pure per capita or raw count basis. Optimally, I could expand the functionality to create a score between two chosen neighborhoods that quantifies the match.

Tools used:

R Packages:

shinyjs
shinydashboard
DT
data.table
googleVis
dplyr
leaflet
sp
ggmap
maptools
broom
httr
rgdal
V8
geojsonio
RColorBrewer

Thank you for reading and feel free to access my project from my Github repository. I have included all of the files to enable anyone to reproduce the application. Have a whirl by testing out the deployed prototype here!

About Author

Kweku Ulzen

A lover of technology and data, Kweku is always interested in exploring how the two intersect to affect society and provide insights into the most pressing and interesting issues. He is a graduate of the University of Alabama...

View all posts by Kweku Ulzen >

Machine Learning

Beware of Feature Importance for Business Decisions

Meetup

Building a Safer Future

Python

Tech Layoffs: Exploring the Trends and Industry Shifts

Meetup

Analysis of Mass Shootings and Gun Ownership in the United States

Capstone

How Fast Can You CitiBike?

Cancel reply

You must be logged in to post a comment.

No comments found.

Data Comparison Between NY & LA

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction:

Goal

Methodology:

Data

Cities

R Shiny Data Application:

Map:

Matchmaker:

Data Insights:

Findings

Frequency

Future Use Cases and Functionality:

Conclusion

Tools used:

R Packages:

About Author

Kweku Ulzen

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Comparison Between NY & LA

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction:

Goal

Methodology:

Data

Cities

R Shiny Data Application:

Map:

Matchmaker:

Data Insights:

Findings

Frequency

Future Use Cases and Functionality:

Conclusion

Tools used:

R Packages:

About Author

Kweku Ulzen

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!