Data Study on NYPD Vehicle Collision Report

Gregory Brucchieri

Posted on Feb 5, 2018

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Every year there are more than 200,000 vehicle collisions in the five boroughs of New York City. The NYPD began publishing a data set of every recorded vehicle collision in the city in July of 2011. I have created an app to gain some insight into the data using the Shiny package for R.

Data Set

The data set, at the time of download, contained around 1.2 million lines of observations. Every observation contains information on the date, time, generally some sort of location data -- e.g. longitude and latitude, street intersection, or address -- number of injuries, vehicles involved and contributing factors. I used the information from July 2012 through July 2018.

The problem with the location information is that there is no uniform method for officers or data entrants to enter the location information, type of vehicles or contributing factors in the data set. This will limit accuracy of statistics based on the location, and make analysis of the latter two features too difficult for the scope of this project, but the data set is still large enough that they will be usable.

Data Analysis App

The app allows the user to view a line chart of collisions per day of the type and in the time range they select. In addition, there are info boxes below the chart that give the maximum, minimum, mean and variance of collisions in that time range for the category selected. All years are given at the same time to show contrast and make the statistics more robust and meaningful.

On the left is a sidebar where the user can choose how to filter the data.

Feature Selection

A number of variables can be changed to select what data subset to analyze. The first variable is the category, where the user can choose to view all of the data or just collisions involving injuries, death, cyclists or pedestrians. Next is the borough selection. Only one borough at a time is allowed. next, the user can choose to view all data from the entire year or focus on a particular month. When the monthly radio button is chosen, an additional drop down for month selection appears. Finally, the user can focus on a particular time range within the day. The hours selected are shown below the drop down.

Code

I built the app in Shiny for R, and I started with a navbar page so I could add tabs with their own pages if necessary. I originally had separate pages for each category, but condensed them to one page using the drop down box and a switch command to simplify the app and try to cut down on memory usage. However, I did have to import css dependencies from Shiny dashboard to be able to use the infoBox tool to showcase the statistics, as the command for them is unique to that library.

I converted the csv from NYC Open Data to a SQLite database and built functions to construct the queries used to call the data. This will decrease memory usage needed, compared to using the csv, by limiting the number of tables loaded into memory. It does take a second on the initial load of the app to get everything into memory, but loads quickly thereafter. The functions that connect to the database and submit the query can be found on the queryfunc.R file.

Future Analysis and Updates to App

In the app.R code you can see that I have built a map page that shows either each location of a collision or a heatmap of collisions in the time range and for the category and borough selected by the user. I have limited the selections to individual month/year combinations to keep the map as readable as possible. It helps to zoom in to get a clearer picture. The map works exactly as intended on its own. However, it causes the app to crash when both parts are brought together. Due to time constraints I chose to push the app as is and try to fix this issue in future updates.

On the data side, I would like to fully clean the location data so that all observations can be used to provide a more thorough view of the situations. The time series analysis and predictions are something that can be looked into and provided.

Shiny app: https://gregmb.shinyapps.io/GregoryBrucchieriProj1/

Github repository: https://github.com/gregmb/TrafficCollisionApp

About Author

Gregory Brucchieri

Gregory has a Master of Arts in Economics from NYU. He is a former business analyst with Humana, Inc, where he maintained provider relations and contract databases for smaller, local networks Humana had paired with. He is driven...

View all posts by Gregory Brucchieri >

Machine Learning

Beware of Feature Importance for Business Decisions

Meetup

Building a Safer Future

Python

Tech Layoffs: Exploring the Trends and Industry Shifts

Meetup

Analysis of Mass Shootings and Gun Ownership in the United States

Capstone

How Fast Can You CitiBike?

Cancel reply

You must be logged in to post a comment.

No comments found.

Data Study on NYPD Vehicle Collision Report

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.