Data Study on Tornado Damage Since 1996
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction - Data Set Overview
Data shows tornadoes occur all across the US almost all year around. Some states, such as Alabama and Oklahoma, see them frequently, while others like Massachusetts see only a couple every year. Because of this different states have different levels of preparedness for tornadoes.
To analyze this, I started with a data set from the Storm Prediction Center (SPC), consisting of tornadoes across the United States from 1950 to 2015. The set consisted of over 60,000 rows of storms with twenty-two columns. These columns included (but were not limited to) attributes such as date and time, state, starting and ending coordinates, injuries, fatalities, and property loss.
However, not all of this data proved to be useful. According to the set description, prior to 1996 property loss was recorded on a 1 - 9 scale. Each step in the scale represented a range. The digit 2 represented $50-$500, 3 represented $500-$5,000, and so on in powers of 10. From 1996 onward, it was listed in millions of dollars. To get better comparisons, the set was limited to just these storms after 1996.
A second filter of the data was also added so that only tornadoes in the continental US were considered. This was largely so that the country could be viewed as a whole more easily, as in the map below. This led to a reduction of only thirty-three storms between 1996 and 2015.
One interesting thing to note from this map is that there are very few, although not no, tornadoes west of the Rocky Mountains, most are localized in the Great Plains. Viewing the country at this distance, however, does not paint as interesting a picture as at the state-level. This is the level at which the Shiny application focuses.
The Shiny Data Application
The main purpose of this application is to allow users to see how different states are effected by tornadoes. It is broken up in to three sections: Map, State Comparison, and State at a Glance. The user can select these on the left side of the application. The left panel also allows the user to select the state they wish to focus on, and change options such as grouping data by year or storm, and including zero-valued data. These options will be covered more when they are relevant.
The map feature of the application shows the path taken by all tornadoes within the selected ranges in the map box, for the state selected in the left panel. The map box allows users to filter by severity, on the Fujita scale, as well as by the year the tornado took place. The default values here are all severity levels, and the years 2005-2015.
Not all tornadoes in the data set for these selections will show up. These were tornadoes where the data appeared to be in error. Tornadoes had to pass two tests to show up in these maps. First, the coordinates had to be between the latitudes of 10° and 60°, and the longitudes of -130° and -50°. Second, the Haversine distance based on starting and ending locations had to be within twenty percent of the recorded distance traveled. This filtered out some of the storms, but not so many that the map was not useful.
State Comparison Data
One interesting way to see how tornadoes effect a state is to see how the damage the state endures both financially and with through casualties compares to other states. This comparison is shown by the state comparison page, which itself is broken up into three sections: Average Financial Loss Bar Plot, Average Casualties Bar Plot, Position Scatter Plot. Here casualties is defined as both injuries and fatalities. Also of note is that in both cases, for financial loss and casualties, NA values were always removed, this is similar to how apparently incorrect data was filtered out from the map.
This section makes use of the options on the left-hand panel. First the user can change whether or not they include zero-values for attributes. By default, cases where there was no property loss and cases where there were no casualties were both left out. These were left out by default as they would artificially bring down the averages and it was more important to see what actual damage (either financial or casualties) was being done by the storms.
The options panel also allows the user to group by year, or group by storm. This allows for aggregating for the average total damage over all storms over all years (ie. sum up all storms per year, and then average over all those values), or simply over all storms (ie. average damage done per storm). By default this is set to per year, so the user can compare it to yearly state budgets or similar expenditures.
For ease of use, the selected state is always highlighted in the graphs. Further, in the bar plots, the bars shown are those four above and four below the selected state. In edge cases, more bars are shown on the unconstrained side. Alabama, for example, is all the way at the right, so more bars are shown to the left.
By using these graphs, several interesting things can be seen. For both financial loss and casualties per year, Alabama, Oklahoma, and Missouri see the most damage. Texas is fourth for financial loss, but drops rather far down when looking at the casualties, where it is replaced, somewhat unexpectedly by Massachusetts, which isn't even in the top 10 of damage per year. If grouping is changed to per storm instead of per year, some possible explanations start to show up.
As can be seen in the graphs, Massachusetts ends up shooting to the top of the graphs when look at states on a per-storm basis. Looking at the average casualties per storm, Massachusetts even has far more casualties than any other at 102, more than four times more than the next highest, Alabama, with only 25.
The final chart on this page shows the relative positions of each state from the graphs above, along with a trend line showing, as could be expected, that as financial damage increases, so does the number of casualties. However some see more financial damage, while others see more casualties. There is more scatter here when grouping per storm (left) as opposed to per year (right). In the charts below, Massachusetts is highlighted.
State at a Glance
The final page in the application is the State at a Glance page. This allows the user to see more state-specific data. Along the top, information about the average storms per year, average casualties per year, and average financial loss per year are available. The later two can be switched to be per storm using the options on the left panel. As with the previous page, NA values are dropped, and zero-values are omitted by default but can be added in the options.
In the picture above, Massachusetts is shown to have only two storms a year on average. Compare this to other states, such as Alabama, which has 53. Note that in the graph above, the chart is in standard scale, while the chart below is set to log-log. This is because the lower-severity storms overlap and become harder to read.
Of note is that Alabama sees more tornadoes, and more severe tornadoes, but has fewer casualties per storm, and less financial loss per storm. The graphs allow us to dig down deeper and see that only the F4s and F5s in Alabama are more damaging than the F3s in Massachusetts, which are that state's most damaging. It is important to look at the per-storm damage here as it shows more about how each storm effects the state.
Most likely this effects we see in Massachusetts are the result of the state's preparedness rather than how severe the storms are. Since states like Massachusetts rarely see tornadoes, they are less prepared for them, especially the citizens. States that see many storms are hurt only by the worst and most severe storms.
By grouping data by storms rather than by year, interesting trends start to emerge. States that see little damage per year, see great damage per storm. It is clear that different states can handle single storms greatly. This is where it would be worthwhile to dive deeper. For example, it would be interesting to look more in to the frequency that different states see different severities of storms.
This could also be followed up by looking at how much states spent during that time on storm preparedness. Are states that see less severe storms less prepared for any level of storm? Are there states that see roughly the same severity of storm, with the same frequency, but spend different amounts? Do they take different amounts of damage? By looking more into how each states fares on a strictly by-storm basis, these and many other questions could be answered.
The original data set can be found on Kaggle, with a description available here.