An Analysis of 311 Complaint Data

Posted on Dec 28, 2017




Beginning in 2010,  NYC launched an initiative to expose government data via NYC Open Data in an effort to "improve the accessibility, transparency, and accountability of City government, this catalog offers access to a repository of government-produced, machine-readable data sets. "

For my Data Visualization project, I took inspiration from Ben Wellington's TED talk on storytelling through data and sought to create an interactive dashboard to unearth any notable insights on 311 complaints for the year of 2015.

The dashboard has also been published to where you can interact with the data yourself here!

Data Pre-processing:

The initial dataset of all 2015 311 complaint data contained over 2 million rows coming from the various government departments--and within those rows were over 250 different types of complaints recorded. I sought for a way to consolidate these complaint types to make the analysis more interpretable and quickly found that each department had its own labels for complaints. For example, for the "tree" category I grouped the complaints through regular expressions on any type that contained the words "tree" or "branch", which returned the below list:

  • Damaged Tree
  • Overgrown Tree/Branches
  • Dead Tree
  • New Tree Request

After grouping the bulk of the remaining categories this way, I then ordered them by frequency to arrive at the below top ten categories:

  • heating
  • noise
  • construction
  • plumbing
  • paint
  • unsanitary
  • car_related
  • tree
  • vermin
  • street_sidewalk

After this consolidation, I then sought to uncover more insight into seasonality and timing of incident reporting. With that, I also generated additional features from the timestamps to be able to filter by month, week of the year, day of the week, day of the year, and hour. 

Lastly, I wanted to identify whether any particular areas in New York had a disproportionate amount of incident calls within a given timeframe, so I retained the latitude/longitude data to be able to generate a geographical visualization.

Visualizations and Insights:

The app itself contains two interactive visualizations, a chart and a heatmap. I opted to use the R implementation of for the line chart given its interactivity and clear interface. For the geographic data, I used Leaflet with additional add-ons for custom tiles and layers.

To show some high-level details of the dataset, one can view apply chart filters to view the entire dataset's incident volume by month:

As you can see, in the left-hand navigation pane, one can filter by borough, complaint type, and time scale. Additionally, some basic measures of mean, median, maximum, minimum, and total are shown dynamically beneath the graph.

A somewhat straightforward interpretation of the above graph's 1,266,555 incidents shows that complaints tend to rise in the winter months, which may help a department forecast their staffing needs.

Moreover, the timescale and complaint types can be filtered on a more granular level to uncover further trends:

From this chart filtering by weekday, one can see that complaint volume is considerably lower on the weekends for environmental incidents such as vermin sightings, fallen trees, or sidewalk repairs. One could then investigate further as to whether these reports are more common in commercial or residential areas and understand if any of these factors could be affecting business.

For the second visualization, I developed a dynamic heatmap that can highlight areas of incident concentration from borough to borough. For a uniform user experience, the filtering options are again present on the left hand side. Along the top are displays of the total number of incidents as well as the count of incidents which are associated with the top 50 addresses given the specified filter criteria. This proportion was added to highlight whether a few locations were repeat offenders of a certain complaint compared to the average.

In addition to the heatmap, one can also zoom in and toggle the "clusters" and "top_50" displays to view a more granular distribution of the incidents, upon clicking on the house icon, additional data will appear in a popup:

Here I've zoomed in on a particular address in Northern Manhattan that appeared in the top 50 given my filtering criteria. One can additionally double-click on the cluster itself to see a full distribution of the complaints at a given address, from the below image one can see that while the top complaint at this address was "heating", a fair number of "unsanitary" and "noise" complaints were also filed when hovering over a pin:

On average, one can observe that certain neighborhoods in Manhattan such as Inwood and the Lower East Side have a particularly dense concentration of complaints given most measures. With that, this visualization tool could be used to further investigate a given address when considering a relocation or event in an area and determine whether it has historically received a large number of complaints.

Closing thoughts:

I found developing this Shiny App to be a deeply rewarding experience since I was not only working with a real-world, messy dataset but could also produce tangible results in a dashboard that can be easily extended to other types of analyses.  I would love to further extend this app by bringing in external data sources such as pricing or crime data to unearth any correlations. Furthermore, I would want to add functionality that integrates directly with the NYC Open Data API to retrieve up to date data that would allow this dashboard to provide more timely insights. 

Thanks for reading!

Thanks for browsing my work and don't hesitate to reach out if you have any additional questions or feedback on my approach and techniques used within this project-- Feel free to access my code repository on github here.



About Author

Michael Chuang

Data enthusiast who is excited at the prospect of finding truth in data to guide direction and maintain accountability. Graduated from Duke University with a degree in Mechanical Engineering. From there, worked several years in consulting and technical...
View all posts by Michael Chuang >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI