An Analysis of 311 Complaint Data
Background:
Beginning in 2010, NYC launched an initiative to expose government data via NYC Open Data in an effort to "improve the accessibility, transparency, and accountability of City government, this catalog offers access to a repository of government-produced, machine-readable data sets. "
For my Data Visualization project, I took inspiration from Ben Wellington's TED talk on storytelling through data and sought to create an interactive dashboard to unearth any notable insights on 311 complaints for the year of 2015.
The dashboard has also been published to shinyapps.io where you can interact with the data yourself here!
Data Pre-processing:
The initial dataset of all 2015 311 complaint data contained over 2 million rows coming from the various government departments--and within those rows were over 250 different types of complaints recorded. I sought for a way to consolidate these complaint types to make the analysis more interpretable and quickly found that each department had its own labels for complaints. For example, for the "tree" category I grouped the complaints through regular expressions on any type that contained the words "tree" or "branch", which returned the below list:
- Damaged Tree
- Overgrown Tree/Branches
- Dead Tree
- New Tree Request
After grouping the bulk of the remaining categories this way, I then ordered them by frequency to arrive at the below top ten categories:
- heating
- noise
- construction
- plumbing
- paint
- unsanitary
- car_related
- tree
- vermin
- street_sidewalk
After this consolidation, I then sought to uncover more insight into seasonality and timing of incident reporting. With that, I also generated additional features from the timestamps to be able to filter by month, week of the year, day of the week, day of the year, and hour.
Lastly, I wanted to identify whether any particular areas in New York had a disproportionate amount of incident calls within a given timeframe, so I retained the latitude/longitude data to be able to generate a geographical visualization.
Visualizations and Insights:
The app itself contains two interactive visualizations, a chart and a heatmap. I opted to use the R implementation of Plot.ly for the line chart given its interactivity and clear interface. For the geographic data, I used Leaflet with additional add-ons for custom tiles and layers.
To show some high-level details of the dataset, one can view apply chart filters to view the entire dataset's incident volume by month:
As you can see, in the left-hand navigation pane, one can filter by borough, complaint type, and time scale. Additionally, some basic measures of mean, median, maximum, minimum, and total are shown dynamically beneath the graph.
A somewhat straightforward interpretation of the above graph's 1,266,555 incidents shows that complaints tend to rise in the winter months, which may help a department forecast their staffing needs.
Moreover, the timescale and complaint types can be filtered on a more granular level to uncover further trends:
From this chart filtering by weekday, one can see that complaint volume is considerably lower on the weekends for environmental incidents such as vermin sightings, fallen trees, or sidewalk repairs. One could then investigate further as to whether these reports are more common in commercial or residential areas and understand if any of these factors could be affecting business.
For the second visualization, I developed a dynamic heatmap that can highlight areas of incident concentration from borough to borough. For a uniform user experience, the filtering options are again present on the left hand side. Along the top are displays of the total number of incidents as well as the count of incidents which are associated with the top 50 addresses given the specified filter criteria. This proportion was added to highlight whether a few locations were repeat offenders of a certain complaint compared to the average.
In addition to the heatmap, one can also zoom in and toggle the "clusters" and "top_50" displays to view a more granular distribution of the incidents, upon clicking on the house icon, additional data will appear in a popup:
Here I've zoomed in on a particular address in Northern Manhattan that appeared in the top 50 given my filtering criteria. One can additionally double-click on the cluster itself to see a full distribution of the complaints at a given address, from the below image one can see that while the top complaint at this address was "heating", a fair number of "unsanitary" and "noise" complaints were also filed when hovering over a pin:
On average, one can observe that certain neighborhoods in Manhattan such as Inwood and the Lower East Side have a particularly dense concentration of complaints given most measures. With that, this visualization tool could be used to further investigate a given address when considering a relocation or event in an area and determine whether it has historically received a large number of complaints.
Closing thoughts:
I found developing this Shiny App to be a deeply rewarding experience since I was not only working with a real-world, messy dataset but could also produce tangible results in a dashboard that can be easily extended to other types of analyses. I would love to further extend this app by bringing in external data sources such as pricing or crime data to unearth any correlations. Furthermore, I would want to add functionality that integrates directly with the NYC Open Data API to retrieve up to date data that would allow this dashboard to provide more timely insights.
Thanks for reading!
Thanks for browsing my work and don't hesitate to reach out if you have any additional questions or feedback on my approach and techniques used within this project-- Feel free to access my code repository on github here.