Data Analysis on Car Accidents in the United States
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
App | Github | Data
You probably witnessed or encountered at least a few car accidents in you life. When my friend was driving me home one snowy night in New York City, we saw 2 car accidents within a few miles on the highway. I started wondering upon the specifics of its data, such as, which state in the US has the most car accidents? How does the weather condition contribute to the amount of car accidents? Do we more severe car accidents happen during the day or night?
I analyzed the 3 million car accidents data from Kaggle and created an interactive App to provide users with some insight and interesting findings around the car accidents in the US through its various visualization tools. This web app may be useful for a range of different people from day to day drivers to traffic enforcement agencies to even car insurance companies.
I adapted three R file structure for this project - global.R, ui.R, and server.R. I also created a helper.R file to pre-process the huge raw data file (1GB) from Kaggle. helper.R performs data cleaning and processing and records the resulting data into a csv file for the actual project. The global.R takes the csv output file from the helper.R and imports data into a data table and assigns it into a variable for easy access.
The global.R also imports all the needed libraries, so they can be used without having to be imported repetitively. The ui.R file is responsible for all the user interface. I chose navbarPage layout to organize this web app. Server.R is the brain of the project where all the calculation and chart formations occur.It also ensures the outputs are rendered properly to be displayed in the UI.
In this section, I will discuss the main features of the app and the logic of the calculations. The app can be divided into 3 main parts: the overview panel, explore panel, and the data panel. The “Overview” panel provides a holistic view of the car accidents nation-wide.
As shown in Figure 1, this panel shows a US map with the car accident number aggregated for each of the 49 states (excluding Hawaii and Alaska but including DC). Users can select the year ranges from 2017 to 2019, inclusive and can also choose to view the number based on the count of accidents or accidents per 1000 residents. Car accidents per capita is provided as an option for users to view the car accidents information with respect to the population for each state. The overview also displays the top three states with highest car accidents based on user’s selection.
Figure 1: Car accidents counts and counts per capita for each of the contingent US states.
The “Explore” panel allows user to dive into a specific state for additional details. For the most part, this panel allows users to slice the data by State, Date Range, and Severity Level. It also offers various visualizations based on user selection, one of which is the leaflet map view (see figure 2 below). The map allows users to zoom into more granular level such as county, city, highway, streets to see where the car accidents occur for the specified state, date range and severity level. The Top 3 counties output will also update accordingly, based on user selection.
Figure 2: Car accidents map in New York City from 12/1/2019 to 12/31/2019.
Other tabs include summary, variables, trends, and interesting findings. The summary displays a bar chart that shows monthly average car accident counts for the selected state and severity. It’s worth noting that the monthly average chart does not take in the Date Range choice but rather calculates the monthly average based on all the available data. For example, the January average for car accident counts in NY is the mean of car accident counts in Jan 2017, in Jan 2018, and in Jan 2019.
The second bar chart shows similar information but divides each bar by Day and Night, enabling a user to see the correlations of that factor with accidents.. The variable shows the heatmap and histogram for user’s choices of weather variables. User can discover the weather condition where car accident happens most frequently and may uncover some interesting underlying relationships between the weather conditions and car accidents.
The trend presents a time series chart for the accidents occurrence over time for the specified state, date range, and severity. Google chart offers great zoom options (as shown in figure 3), so that user can quickly focus the chart to the desired time period range. There is also a one off tab that does not require any user input to display interesting relationship between each state’s car accident per capita (%) and the car insurance premium under “interesting findings”.
Figure 3: Time series chart for car accident count.
The “Data” panel displays all the data used for this project. Shiny’s renderDataTable provides a great search functionality so that user is able to quickly search for data of interest. I used 3 different datasets for this project: car accidents data from Kaggle, population data from census.gov, and insurance data from insure.com.
Findings from the Data
Although California appears to have the most car accidents in the US, at around 523k from 2017 to 2019, South Carolina actually beats California in car accidents per 1k residents at around 28.19, meaning around 28 car accidents occurred per 1k residents over the last 3 years. Oregon also has seen drastic increase in car accidents per 1k residents, from 1.49 in 2017 to 5.38 in 2018 to 9.76 in 2019! Here are the top 7 states with highest car accidents per 1k and their changes over the past 3 years:
From the leaflet map view in the Explore panel, I also observed that most car accidents occur on major highways and in major cities. From the bar charts, we can see that the most accidents occur in October, and more severe car accidents occur during the night time.
If you play around with the weather variables, you will find that most accidents occur in the higher humidity level. But it would be naive to believe that there is a direct correlation here without knowing the distribution of the humidity in a given year. This is one of the aspects I wish I had more time to work on -- to overlay the distribution of all the weather variables to understand whether or not there could be a correlation between any of the weather variables and the car accident occurrence. Furthermore, it would be even more effective if I were able to reduce the weather variables using dimension reduction, as most will be correlated to one another.
Conclusion and Future Improvements
It was an interesting experiment building my first shiny app. I uncovered many interesting facts about car accidents in the US. While working on this app, I learned the importance of data cleaning and exception handling, as well as the importance of storing data in an intelligent way to improve efficiency while building an application.
In the future, I wish to obtain more relevant data, such as number of drivers and average driver age for each state and accident driver age and join the extra piece and further analyze the data and unveil more interesting relationships between driver information and car accident data. I’d also like to incorporate the aforementioned dimension reduction technique to reduce the redundant weather variables, which will allow us to better determine the relationship between weather variables and the car accident occurrence.