Data Comparison Between NY & LA

Posted on Feb 5, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


As an avid traveler, I have always been interested in discovering what makes a city unique. Data tools informing travelers of unique landmarks and activities in the places to which they venture have been ubiquitous for ages. While I appreciate the different elements that make a city unique, I also have grown to understand that there are certain types of activities I would like to engage in when I visit a new place.

For my project, I set out to visualize demographic and quality of life data between two cities to compare them at the neighborhood level. If you have ever found yourself traveling to a new city, looking for an area similar to one you knew well, you may have some of the questions which drove me to work on this project.

Say you love curry and you have decided that in your hometown of New York, Flushing has all of the curry houses you love. What neighborhoods in Los Angeles would have similar types of restaurants which may also be highly rated by their patrons? Say I was moving to Toronto from Sao Paulo, but was still enrolled in English courses as I was adapting to the new city. Maybe I can try to find a neighborhood with a high native Portuguese-speaking population to help me get settled into my new environment. Does this neighborhood compare to my favorites in Sao Paulo in terms of available green space and public transportation quality?


My goal was to build a tool that could take these questions and return an answer quickly for a user. This tool would find congruence between any two data points at the neighborhood level. I chose to test this functionality using data from New York and Los Angeles--a city that I know very well and one where I can't find the airport without a map, respectively.

Data Comparison Between NY & LA

Los Angeles, California, USA


Data Comparison Between NY & LA

New York, New York, USA



My application sources US Census data for demographic and quality of life information. I augmented this data set with data from Walkscore. Walkscore is a website which rates a neighborhood's quality of transportation options. The rating is derived from a weighted analysis of the requested area's features and is not a relative index. Thus one neighborhood's Walkscore is not dependent on the score of another simply by value--although it may be affected if services in an area are reachable in the compared area.

I acquired my map polygons from public GeoJSON files which posed a bit of a challenge in data manipulation. I later found a more effective strategy for manipulating JSON data in R as I reached the conclusion of my project and will later talk about how I would incorporate that into the project if it were to continue.


The specific categories in my question would be tough to answer without web scraping skills, which I am acquiring in a later module of the bootcamp. To accommodate my knowledge gap, I chose a smaller set of data to run the project as a proof of concept. My data fields for this project are:

  • Neighborhood Name
  • City
  • Neighborhood Population
  • Median Household Income
  • Average Household Size
  • Violent Crime Rate (per 1,000)
  • Property Crime rate (per 1,000)
  • Median Educational Attainment
  • Median Age
  • US Census Racial Categories
    • White
    • Black
    • Asian
    • Hispanic (non-race, ethnic)
  • Percent Foreign Born
  • Data from
    • WalkScore
    • TransitScore
    • BikeScore

The data for Los Angeles was very simple to acquire as the Los Angeles Times began mapping Los Angeles and gaining these insights with neighborhood granularity in 2009. Data collected by neighborhood in New York posed a greater, more time-consuming challenge as the census collects data at the census tract level, which is independent of neighborhoods and generally contains a portion of a single neighborhood or portions of multiple neighborhoods.

As an alternative, I found a data set which was analyzed by the Furman Center and used census data for the combined neighborhoods which were broken down by the City of New York to analyze data to the same level as the data prepared by the LA Times. At this stage, I decided that I would be better served to test functionality on demographic information and WalkScore's ratings which use similar data types to the questions that I had. I won't find out about restaurants in this proof of concept, but maybe I can find out which neighborhoods have young immigrant populations in each city, then match them.


To try and find a good representation of neighborhoods, I chose 20 per city: 5 which I anecdotally knew to be affluent, 5 which I anecdotally knew to be under-served, and 10 totally at random. The neighborhoods in this demo are:

New York Los Angeles
Central Harlem
Upper East Side
Upper West Side
Fort Greene/Brooklyn Heights
Coney Island
St. George/Stapleton
Hillcrest/Fresh Meadows
Crown Heights/Prospect Heights
Lower East Side/Chinatown
Greenwich Village/SoHo
Washington Heights/Inwood
Studio City
Beverly Hills
Culver City
Baldwin Hills/Crenshaw
Silver Lake
Echo Park
Van Nuys
Santa Monica
East Los Angeles
Historic South Central
Eagle Rock

The next steps after acquiring the data are to manipulate the information to provide answers to my new burning questions!

R Shiny Data Application:

I created a dashboard application in R Shiny to visualize neighborhood comparisons and to allow for user input. The dashboard consists of three sections:

  • Map:

  • a map created using GeoJSON files for boundaries in the leaflet for R package. I created a choropleth map which separated neighborhood rankings by percentile in groups of 5 (0-20%, 21-40%, 41- 60%, 61-80%, 81-100%). The percentile group was noted by color as defined in the map legend. This interface allowed for a user to pick a statistic to obtain rankings for in each city by neighborhood. The neighborhood polygon then pops up the neighborhood name and statistic upon an onmouseover event.

  • Matchmaker:

  • the key interface throughout my planning to answer the previously stated problems. I created a function to find the three nearest comparisons in percentile by taking the minimum absolute difference of a neighborhood's rank and a corresponding rank in another city. It uses an observe event as well to ensure that neighborhoods that are searched match the city specified in the prompt.
    • The function takes a data row "z" and subtracts the value of the corresponding city's neighborhood percentile rankings "a." It looks for the three closest percentile points to find "matches." For example, if a user would like to know the most similar neighborhoods to Brentwood, Los Angeles, they could do so for either New York or the rest of Los Angeles like in the example below:

matchmaker = function(a,z){
b = abs(z - a)
b = which.min(b)
mymatch = c(b)
c = abs(z - a[-b])
c = which.min(c)
if (c > b){
c = c+1
mymatch = c(mymatch, c)
d = abs(z - a[c(-b,-c)])
d = which.min(d)
if (d > c){
d = d+1
if (d > b){
d = d+1
mymatch = c(mymatch, d)

  • Insights: I had additional questions upon seeing the data which I address below.

Data Insights:

As a child, I was always fond of rebuffing norms and typical structure. My mother always insisted that if I wanted to achieve my (vain) childhood goal of making a lot of money, I would need to stay in school. I preferred curating my own learning experiences.

Over time, I relented and fulfilled my mother's wishes of scholastic pursuit. I realized that across the various neighborhoods in this data set, I could make a simple visualization showing the median household income for neighborhoods with particular levels of degrees attained by the typical person. As expected, my mother's hypothesis was backed up by the data, as income level tended to rise across neighborhoods with higher educational attainment.


I noticed that Los Angeles neighborhoods consistently had higher levels of property and violent crime. I researched the data a bit deeper and discovered that this comparison was imperfect for per capita comparisons between cities as LA Times reported misdemeanor and felony crimes while Furman Center only reported felonies in New York. However, I felt that this comparison would still be valid for percentile comparisons (relative to other neighborhoods in the same city) so I kept these statistics in the data set.

When I plotted the data, I noticed that there was a positive relationship between increased property and violent crimes, but what shocked me was an exaggerated outlier. I recognized that this point represented the crime rate in Downtown Los Angeles. With no familiarity of Downtown Los Angeles, I wondered, why does this single part of the city appear to have a crime rate that essentially guarantees any resident or visitor to be a crime victim? Was Downtown LA that much more dangerous than the other neighborhoods in my data set?

Maybe I was biased in my neighborhood choice and I neglected the more dangerous neighborhoods. I then remembered that Midtown Manhattan had a similar phenomenon, where the crime rate reported was much higher than to be expected, because of its daytime population. Downtown Los Angeles has a residential population of 34,811. The Los Angeles Times reported that theΒ daytime population of the area exceeded 280,000!


We were measuring the crime frequency of a neighborhood with a mid-sized American city's population against a small group of people as if the crimes committed in that area were only perpetrated against its own residents. This accounted for the bizarre numbers. More accurate crime numbers for the neighborhood would be 7.82 and 20.1 per 1,000 residents for violent crime and property crime respectively. This is much more indicative of a typical--even somewhat safe--Los Angeles neighborhood.

Upon this discovery that my crime data was flawed and would need to be re-investigated, I decided to investigate median age. The youngest neighborhood was Watts, Los Angeles (median age of 21) and the oldest was the Upper East Side in New York (median age of 47). Typically, the older neighborhoods experienced lower crime rates. Of the selected neighborhoods, the median age across both cities was 35, slightly lower than the US median of 38. With regards to this data set, New York and Los Angeles appear to be slightly younger than typical American cities.

Future Use Cases and Functionality:

While I did not answer my initial question and wanted to build much more, I am happy with how my first R project came together. In 10 days, I was able to read my information from a SQLite database into a dynamic user interface and write a function to answer similar questions. This shows me that my intended use case would be possible with a bit more development time and data wrangling.

I could see the core audience of an application such as this one coming from a wide array of perspectives. This can be useful to businesses looking for the best neighborhoods to open a location in an area heavily populated with their core customer base. Vacationers could use this to decide which neighborhood would be a great place to look for a hotel or bed and breakfast. A family moving from one city to another could use this to find the best school district and most green space. Because of this potential, I would endeavor to add more fields in the future such as:

  • Traffic data
    • Delays
    • Collisions
  • Cultural data
    • Languages spoken
    • Restaurants
    • Places of Worship
  • Home value
  • Rent value
  • 311 information


I found the JSON files difficult to manipulate in bulk with my additional data fields. I resorted to systematically manually importing the data, which allowed me to give a representation of the functionality across different variables. One of my colleagues showed me the merge command in rgdal toward the end of the project which allows for manipulation of the JSON file. This would make it much easier to add more features, neighborhoods, and cities to the data set. I also wanted to add Atlanta, Toronto, and Montreal to my project and would like to continue investigating that after the bootcamp.

I would also like to offer options to compare on both the percentile level, relative to the rest of the city, and also on a pure per capita or raw count basis. Optimally, I could expand the functionality to create a score between two chosen neighborhoods that quantifies the match.

Tools used:

R Packages:

  • shinyjs
  • shinydashboard
  • DT
  • data.table
  • googleVis
  • dplyr
  • leaflet
  • sp
  • ggmap
  • maptools
  • broom
  • httr
  • rgdal
  • V8
  • geojsonio
  • RColorBrewer


Thank you for reading and feel free to access my project from my Github repository. I have included all of the files to enable anyone to reproduce the application.Β Have a whirl by testing out the deployed prototype here!

About Author

Kweku Ulzen

A lover of technology and data, Kweku is always interested in exploring how the two intersect to affect society and provide insights into the most pressing and interesting issues. He is a graduate of the University of Alabama...
View all posts by Kweku Ulzen >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI