Data Study on Crime and Demographics in New York City
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Some American fixations: football, taxes, and crime. Like its kin, crime sits squarely in the national consciousness; untold resources have been devoted to understanding, dissecting, and data analyzing all its facets. Obsession over criminal activity is perhaps nowhere more salient than in New York City, a city which found itself mired in crisis in the 1970s and 80s. The 4/5/6 subway line, which today handles the greatest share of riders, was affectionately called the "Mugger's Express" due to high incidences of daylight robbery. Meanwhile, gangs, prostitutes, and corrupt officials roamed the city unchecked.
Of course, if you're reading this, you know the end of this story already. With mayors David Dinkins, Rudy Giuliani, and Michael Bloomberg in office, New York crime plunged to unprecedented levels. Soho, once an industrial wasteland of sweatshops and abandoned factories, is now one of the most gentrified neighborhoods on the Eastern Seaboard. Brooklyn, once afflicted with staggering amounts of criminal activity, is now a hot zone for the new generation of yuppies. Indeed, The Economist ranked New York City as the #10 safest city in the world on its Safe Cities Index, all but memorializing the Big Apple's transformation into an alpha city.
The Data
So how did New York dramatically reduce its crime rate? Any prospective analyst would find challenge not in finding an answer (of which there are many), but rather in crafting a succinct narrative from the enormous hoard of American crime data. Approaches could be as varied as measuring the effectiveness of stop-and-frisk, or evaluating the impact of strict gun control introduced under Michael Bloomberg.
For my project, I chose to look at two separate data sets: the New York Police Department's (NYPD) Historical Crime Data, and the Census Bureau's American Community Survey (ACS).
The NYPD dataset grew out of Rudy Giuliani's Compstat initiative introduced in 1994. This initiative enforced a statistically-driven approach to crime-reduction; since its inception, all criminal offenses have been logged in a central database, along with relevant data on geographical location, offense type, and time. These data are further grouped by precinct. Datasets are updated weekly, providing impressive granularity and access to New York's crime trends. The currently available data span from 2000 to 2016.
ACS is a nationwide demographic survey conducted by the United States Census Bureau, which was founded in 2005 out of a need for annually aggregated household data. The ACS contacts approximately 3.5 million households per year and presents the data in an open-source, easily accessible format. Data are gathered on multiple categories, including income, education, and ethnic information. High geographic resolution has also been recently introduced by the Census Bureau in the form of Public Use Microdata Areas (PUMAs), which in essence are census blocks. Interestingly, these blocks do not correspond to any other geographic delineation.
Vision and Limitations of Data
My initial vision was to unify the NYPD and ACS datasets. In doing so I would construct a longitudinal study comparing demographic data with crime rate, grouped by geographic sub-areas within New York City.
Ideally, I would have tried to analyze the initial decline in crime rate which occurred throughout the late 80s and 90s. The criminal offense data were either not available online or were not recorded altogether. Thus any study seeking to use NYPD data could only feasibly catch the tail-end of the crime decline, from 2000 on.
I ran into further limitations with the ACS data. While nationwide New York ACS data are available online from 2000, data standardized into PUMA are not available until 2011. Any longitudinal study, combining NYPD and ACS data grouped by geography, could then only take in years 2011 or after.
But the most serious limitation came when I discovered that the geodata I had been using could not be overlaid on top of each other on my data visualization. And while it was indeed possible to collate the data in a different format, the problem was discovered too close to the project deadline to make a change. When I revisit this project, I will seek to rectify this problem and give the visualization the treatment it deserves.
Ultimately, I could not combine the data geographically, and I could not compare the datasets directly. But I decided that I could construct two separate studies and qualitatively assess the impact of certain variables. What you see below is an amalgamation of two different data visualization studies: a longitudinal study of crime in New York grouped by NYPD precinct, and a demographic snapshot of the city grouped by ACS PUMA.
Data Visualization
My first goal was to visualize crime and demographic data in a choropleth map. Below you can see each precinct color coded by crime rate (with a drop-down menu allowing selections between different types of crime [i.e. major felonies, minor felonies, misdemeanors, and violations], and a slider allowing selection of different years from 2000 to 2016). Figure 1.1 depicts the former, and showcases the hover-over function I implemented into the map.
Figure 1.2 is another close-up of the crime data choropleth. Data generally hold up with commonly-held assumptions: deep Brooklyn and the Bronx exhibit high rates of crime. An interesting outlier is New York's midtown and NoHo regions, where crime rates fall into the highest crime rate bucket. I could not glean a reason from the data that I had, but it presents an interesting problem for future analysis, should I revisit the data. |
Highest Crime Rates
My next step was to go through the same process, but with the ACS data. Figures 1.3 and 1.4 depict the same process, but with PUMAs instead of NYPD precincts. You will notice that areas of high crime (i.e. The Bronx and South Brooklyn) from Figures 1.1 and 1.2 roughly tend to correlate with areas of high unemployment and high labor force disengagement, with the exception of mid-town Manhattan. The outer edges of New York proper also exhibit high rates of labor force disengagement. I posit this is due to the outskirts being a more suitable residential area for the retired and family-rearing population, a trend we see in suburban commuters.
Crimes
My second goal was to construct a handful of graphs which visually represented the data and exposed interesting bi-variate trends. I first confirmed that the New York crime rate had indeed dropped significantly (see Figure 2.1). What's astounding is that since the turn of the millennium, the city-wide crime rate dropped from just under 250,000 offenses per year to a little above 150,000 offenses per year, almost a 40% decrease since the beginning of the NYPD data set. Some inter-borough disparities in crime volume can be explained by each borough's population size, with Brooklyn having by far the largest population. But in hindsight, a similar graph adjusting for population size would have been interesting to ponder.
Note: Staten Island data were incomplete from the dataset between 2000 and 2012 and an executive decision was made to disqualify these years
I next looked at income bracket distributions throughout the city to see if income correlated with crime rate. Not surprisingly, in 2015, Manhattan had the most number of families that made more than $200,000 a year. The Bronx stands apart with the most number of households with the least amount of income, and the least number of households with a high amount of income. Brooklyn exhibits a similar pattern, with a bolstered right tail, probably due to the gentrification of neighborhoods such as Williamsburg and Brooklyn Heights. Qualitatively, neighborhoods with fewer rich households in proportion to poor households seem to have a higher crime rate.
Median Income vs Unemployment
Finally, I plot mean and median income against unemployment rate. We can see that there is a relatively strong correlation between the two variables.