Data Study on NYPD Vehicle Collision Report

Posted on Feb 5, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Every year there are more than 200,000 vehicle collisions in the five boroughs of New York City. The NYPD began publishing a data set of every recorded vehicle collision in the city in July of 2011. I have created an app to gain some insight into the data using the Shiny package for R.

Data Set

The data set, at the time of download, contained around 1.2 million lines of observations. Every observation contains information on the date, time, generally some sort of location data -- e.g. longitude and latitude, street intersection, or address -- number of injuries, vehicles involved and contributing factors. I used the information from July 2012 through July 2018.

The problem with the location information is that there is no uniform method for officers or data entrants to enter the location information, type of vehicles or contributing factors in the data set. This will limit accuracy of statistics based on the location, and make analysis of the latter two features too difficult for the scope of this project, but the data set is still large enough that they will be usable.

Data Analysis App

The app allows the user to view a line chart of collisions per day of the type and in the time range they select. In addition, there are info boxes below the chart that give the maximum, minimum, mean and variance of collisions in that time range for the category selected. All years are given at the same time to show contrast and make the statistics more robust and meaningful.

Data Study on NYPD Vehicle Collision Report

On the left is a sidebar where the user can choose how to filter the data.

Feature Selection

A number of variables can be changed to select what data subset to analyze. The first variable is the category, where the user can choose to view all of the data or just collisions involving injuries, death, cyclists or pedestrians. Next is the borough selection. Only one borough at a time is allowed. next, the user can choose to view all data from the entire year or focus on a particular month. When the monthly radio button is chosen, an additional drop down for month selection appears. Finally, the user can focus on a particular time range within the day. The hours selected are shown below the drop down.


I built the app in Shiny for R, and I started with a navbar page so I could add tabs with their own pages if necessary. I originally had separate pages for each category, but condensed them to one page using the drop down box and a switch command to simplify the app and try to cut down on memory usage. However, I did have to import css dependencies from Shiny dashboard to be able to use the infoBox tool to showcase the statistics, as the command for them is unique to that library.

I converted the csv from NYC Open Data to a SQLite database and built functions to construct the queries used to call the data. This will decrease memory usage needed, compared to using the csv, by limiting the number of tables loaded into memory. It does take a second on the initial load of the app to get everything into memory, but loads quickly thereafter. The functions that connect to the database and submit the query can be found on the queryfunc.R file.

Future Analysis and Updates to App

In the app.R code you can see that I have built a map page that shows either each location of a collision or a heatmap of collisions in the time range and for the category and borough selected by the user. I have limited the selections to individual month/year combinations to keep the map as readable as possible. It helps to zoom in to get a clearer picture. The map works exactly as intended on its own. However, it causes the app to crash when both parts are brought together. Due to time constraints I chose to push the app as is and try to fix this issue in future updates.

Data Study on NYPD Vehicle Collision Report Data Study on NYPD Vehicle Collision Report

On the data side, I would like to fully clean the location data so that all observations can be used to provide a more thorough view of the situations. The time series analysis and predictions are something that can be looked into and provided.

Shiny app:

Github repository:

About Author

Gregory Brucchieri

Gregory has a Master of Arts in Economics from NYU. He is a former business analyst with Humana, Inc, where he maintained provider relations and contract databases for smaller, local networks Humana had paired with. He is driven...
View all posts by Gregory Brucchieri >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI