Data Analysis on Car Accidents in the United States

Posted on Jan 31, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

App | Github | Data

Introduction

You probably witnessed or encountered at least a few car accidents in you life. When my friend was driving me home one snowy night in New York City, we saw 2 car accidents within a few miles on the highway. I started wondering upon the specifics of its data, such as, which state in the US has the most car accidents? How does the weather condition contribute to the amount of car accidents? Do we more severe car accidents happen during the day or night?

I analyzed the 3 million car accidents data from Kaggle and created an interactive App to provide users with some insight and interesting findings around the car accidents in the US through its various visualization tools. This web app may be useful for a range of different people from day to day drivers to traffic enforcement agencies to even car insurance companies.

 

The Project

I adapted three R file structure for this project - global.R, ui.R, and server.R. I also created  a helper.R file to pre-process the huge raw data file (1GB) from Kaggle. helper.R performs data cleaning and processing and records the resulting data into a csv file for the actual project. The global.R takes the csv output file from the helper.R and imports data into a data table and assigns it into a variable for easy access.

The global.R also imports all the needed libraries, so they can be used without having to be imported repetitively. The ui.R file is responsible for all the user interface. I chose navbarPage layout to organize this web app. Server.R is the brain of the project where all the calculation and chart formations occur.It also ensures the outputs are rendered properly to be displayed in the UI.

 

The App

In this section, I will discuss the main features of the app and the logic of the calculations. The app can be divided into 3 main parts: the overview panel, explore panel, and the data panel. The “Overview” panel provides a holistic view of the car accidents nation-wide.

The Data 

As shown in Figure 1, this panel shows a US map with the car accident number aggregated for each of the 49 states (excluding Hawaii and Alaska but including DC). Users can select the year ranges from 2017 to 2019, inclusive and can also choose to view the number based on the count of accidents or accidents per 1000 residents. Car accidents per capita is provided as an option for users to view the car accidents information with respect to the population for each state. The overview also displays the top three states with highest car accidents based on user’s selection.

us

 

Figure 1: Car accidents counts and counts per capita for each of the contingent US states.

The “Explore” panel allows user to dive into a specific state for additional details. For the most part, this panel allows users to slice the data by State, Date Range, and Severity Level. It also offers various visualizations based on user selection, one of which is the leaflet map view (see figure 2 below). The map allows users to zoom into more granular level such as county, city, highway, streets to see where the car accidents occur for the specified state, date range and severity level. The Top 3 counties output will also update accordingly, based on user selection.

nyc

Figure 2: Car accidents map in New York City from 12/1/2019 to 12/31/2019.

Other tabs include summary, variables, trends, and interesting findings. The summary displays a bar chart that shows monthly average car accident counts for the selected state and severity. It’s worth noting that the monthly average chart does not take in the Date Range choice but rather calculates the monthly average based on all the available data. For example, the January average for car accident counts in NY is the mean of car accident counts in Jan 2017, in Jan 2018, and in Jan 2019.

The second bar chart shows similar information but divides each bar by Day and Night, enabling a user to see the correlations of that factor with accidents.. The variable shows the heatmap and histogram for user’s choices of weather variables. User can discover the weather condition where car accident happens most frequently and may uncover some interesting underlying relationships between the weather conditions and car accidents.

The trend presents a time series chart for the accidents occurrence over time for the specified state, date range, and severity. Google chart offers great zoom options (as shown in figure 3), so that user can quickly focus the chart to the desired time period range. There is also  a one off tab that does not require any user input to display interesting relationship between each state’s car accident per capita (%) and the car insurance premium under “interesting findings”.

timeseries

Figure 3: Time series chart for car accident count.

The “Data” panel displays all the data used for this project. Shiny’s renderDataTable provides a great search functionality so that user is able to quickly search for data of interest. I used 3 different datasets for this project: car accidents data from Kaggle, population data from census.gov, and insurance data from insure.com

 

Findings from the Data

Although California appears to have the most car accidents in the US, at around 523k from 2017 to 2019, South Carolina actually beats California in car accidents per 1k residents at around 28.19, meaning around 28 car accidents occurred per 1k residents over the last 3 years. Oregon also has seen drastic increase in car accidents per 1k residents, from 1.49 in 2017 to 5.38 in 2018 to 9.76 in 2019! Here are the top 7 states with highest car accidents per 1k and their changes over the past 3 years:

trends

From the leaflet map view in the Explore panel, I also observed that most car accidents occur on major highways and in major cities. From the bar charts, we can see that the most accidents occur in October, and more severe car accidents occur during the night time.

If you play around with the weather variables, you will find that most accidents occur in the higher humidity level.  But it would be naive to believe that there is a direct correlation here without knowing the distribution of the humidity in a given year. This is one of the aspects I wish I had more time to work on -- to overlay the distribution of all the weather variables to understand whether or not there could be a correlation between any of the weather variables and the car accident occurrence. Furthermore, it would be even more effective if I were able to reduce the weather variables using dimension reduction, as most will be correlated to one another. 

 

Conclusion and Future Improvements

It was an interesting experiment building my first shiny app. I uncovered many interesting facts about car accidents in the US. While working on this app, I learned the importance of data cleaning and exception handling, as well as the importance of storing data in an intelligent way to improve efficiency while building an application.

In the future, I wish to obtain more relevant data, such as number of drivers and average driver age for each state and accident driver age and join the extra piece and further analyze the data and unveil more interesting relationships between driver information and car accident data. I’d also like to incorporate the aforementioned dimension reduction technique to reduce the redundant weather variables, which will allow us to better determine the relationship between weather variables and the car accident occurrence. 

About Author

Melanie Zheng

Melanie is currently enrolled in Georgia Institute of Technology for Master's Degree in Computer Science with Machine Learning specialization. She previously worked as a product manager at Viacom and project manager at Citigroup.
View all posts by Melanie Zheng >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI