Data Analysis on Car Accidents in the US

Zachary MacTaggart

Posted on Jan 4, 2024

Exploring Car Accidents through Data Analysis

Project Introduction

Thousands of car accidents occur every single day in the United States. An analysis of the causes of these accidents could help drivers avoid danger and, possibly, save lives. This project aims to provide insight on which variables come affect car accident frequency. I drew on a Kaggle dataset on approximately 7.7 million accident records from 2016 to 2023. The data includes features such as accident severity, proximity to traffic features, locations details, time details, accident descriptions, and weather conditions.

Dataset Transformation

The original dataset was transformed for optimization in an R-Shiny application. All missing or incomplete records were removed, and the project analysis focused on the most complete years of 2020 to 2022. That resulted in about 3.1 million accident records, a dataset that is too large for the R-Shiny application. To further reduce the dataset size, the data was sampled and stratified by state. This yielded a more streamlined dataset of 313,000 accident records that allowed the application to run relatively smoothly.

Variable Explanation

Some variables included in this dataset are accident severity, proximity to traffic feature, location details, time details, accident description, and weather condition at time of the accident.

Accident Severity is a rating based on the extent to which the accident impacted nearby traffic on a scale of 1 to 5.
Traffic Feature Proximity indicates whether the accident occurred at a traffic control feature such as a stop light or junction.
Location Details for each accident included the latitude and longitude of the accident.
Time Details included the hour, day, month, and year of the accident.
The accident records also came with a brief description of the accident, often including the road names and other noteworthy information about nearby traffic impact.
Weather Conditions at the time of the accident were also included. This variable was condensed into broader categories for more clarity and ease of use. Conditions such as light rain and heavy were categorized under their root or broader category, 'Rainy Conditions' and so on for other types of weather.

Leaflet Map

The first tab of the application features a Leaflet plot showcasing all 313,000 sampled accidents based on their latitude and longitude. Clusters are used to manage the amount of data points, enabling users to zoom in for singular accidents. Severity is color-coded for quick reference (see key in the bottom right). This map allows users to see locations with frequent accidents and areas where severe incidents are more prevalent. Below is a glimpse of clusters near NYC, emphasizing the high frequency of accidents in the city.

Exploring the Data with Variable Manipulation

The remaining data tabs in the application allow users to manipulate different variables, letting them observe how the resulting graphs change. Below is a snippet of the variables that users can change and select.

Accidents by Day

The first data tab explores accident frequency by the day of the week as a bar graph with weekdays in light purple and weekends in a darker shade of purple. The example below displays accidents at 2 AM, showing the increase in accidents during this timeframe on the weekends.

Accidents by Month

The second data tab explores accident frequency by the day of the month, with seasonal color distinctions. Below is a snippet of this graph displaying the frequency for each month. The graph illustrates an increase in accidents towards late fall and early winter, suggesting weather conditions may impact accident frequency during those months.

Accidents by Year

The third data tab displays accident frequency trends over the years, indicating an overall increase, likely influenced by factors such as COVID-related travel restrictions easing.

Traffic Control Features

The fourth data tab explores accident frequency by proximity to traffic control features, showing the top four features across all accidents. Users can manipulate variables such as state to discover what common feature has the highest frequency of accidents depending on the variable changed.

Accidents by Population

The last data tab explores accidents adjusted for population of each state and displays the top and bottom five states for accident frequency.

Conclusions and Insights

Insights:

Accidents during the night increase on the weekends compared to the weekdays. Drivers should take extra caution when driving at night, specifically during late-night and early-morning hours.
Accidents during the day happen more frequently on weekdays. Specific times, such as morning hours or rush hours, show higher accident rates on weekdays when more people are commuting to work. Drivers should take extra caution during these times.
Different states have different accident frequencies at different traffic control features. This is likely because certain states have more of these features than others. Adding ratios for this feature from a geographic API would provide more insight. Drivers should still use extra at the traffic control feature that has the most frequent accidents in their state.
Filtering by year allows users to see the change in accident trends from the height of the pandemic to 2022. That shows that overall accident frequency increased then.
One unexpected state in the top 5 accidents per population category was South Carolina. Looking deeper into the features of SC, we can see they have a steadier frequency of accidents later at night, which could be one of the reasons they are one of the top states per population for accidents. More features and ratios need to be explored to get a better idea of why.

This information, along with an analysis of how accident patterns change over the years, can help drivers make more informed decisions about when and where to drive. It can also help assist police and medical professionals who respond to accidents in planning their response and scheduling of staff. Adding other information such as age might also be valuable to address when or if older drivers need to be re-assessed for their license to continue to be able to drive safely.

Future Work

Future work on this project could involve incorporating additional information from traffic or geographic APIs. For example, data on the number of lanes at each accident location could indicate if more lanes have an impact on accident rates. Additional factors that could be correlate with accidents include the number of traffic control features in each state to calculate ratios for above insights, posted speed limits at each accident location to see if speed has any effect, and the proximity of the accident to places of interest (such as bars or other locations with higher alcohol use) to see if accidents occur more frequently near these locations.

Demographic information, such as age, could also provide valuable insights, though ethical considerations would need to be taken into account beforehand. Other future work includes using machine learning techniques to help predict these accidents occurring. Machine learning techniques would lead to more robust and actionable insights for this dataset such as using classification models to predict accident severity based on other accident features in the dataset.

I would also like to revisit this project in the future to do more extensive data grouping to gain more insight from the data within the project. I would also like to clean up the application by condensing the information in it and by removing unnecessary features in the data tabs.

Acknowledgments and Links

Links:

Github: zmactag/US-Car-Accidents-Shiny-App: Shiny app exploring a data set of car accident records in the United States. (github.com)

App Link: US Car Accidents (shinyapps.io)

Kaggle Dataset: US Accidents (2016 - 2023) (kaggle.com)

Acknowledgments:

Thank you to the authors for making this data publicly available and allowing me to transform and visualize it.
Sources:

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset,” 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

About Author

Zachary MacTaggart

Experienced medical laboratory scientist with a passion for data analytics in healthcare. Looking to transition into my first data science role, driven by an interest in machine learning techniques and the desire to bolster my analytical tools and...

View all posts by Zachary MacTaggart >

Capstone

The Convenience Factor: How Grocery Stores Impact Property Values

R Shiny Shows Decline in Even Strongest Democracies

Meetup

Examining Digital Connectivity in Kenya's 2019 Census Data

Data Visualization

Data-driven Predictions of Property Sale Value in Ames, Iowa

Student Works

Data Analysis on Airbnb in NYC