Food and health demographics in the USA
Contributed by Yannick Kimmel. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - R Shiny (due on the 4th week of the program).
The culture of food and health (like the other aspects of culture) is constantly changing and is diverse (meaning high variance) in the USA. Obesity affects roughly 1 in 3 Americans, while diabetes affects roughly 1 in 10 Americans. I wanted to understand the relationship of food and health demographics. I thought this data would be important for policy makers and civic leaders who would be interested in changing their demographics for the better. The USDA's Food Environment Atlas is a great resource for county specific data on this specific subject matter. The data can be found here. The Shiny app can be found here, and the project code here.
Three R files were used in this project. The helpers.R file is used to manipulate the county demographic data, create a county map, fit a regression model, and use the fitted model to predict obesity based on values inputted by the user.
Data is uploaded from 6 CSV files and combined into a data frame:
Then stepwise regression is used to fit a model to county obesity and a prediction function is created so that obesity can be predicted given parameters changed by the user. Finally the model diagnostics are checked.
A choropleth map is created using county specific health parameters.
The server.R file has several functions. 1) Change the County map created in the helpers.R file depending on input by the use. 2) Create a Plotly scatter plot for the different demographic data. 3) Call the obesity model created in the helper.R file, and change the predicted obesity rate given inputs by the user. 4) Render the data table that allows a user to look up data on a specific county. The code for the county map, Plotly scatter plot, and obesity prediction is shown below:
USA county map
The first page of the Shiny app is county specific map of the continental USA. I choose to display three indicators of particular interest: obesity, diabetes, and poverty rates. Obesity and diabetes rates generally higher in the Midwest and southeast parts of the USA and lower on the East and West Coasts. The poverty rates generally seem higher in the Southern part of the USA.
On the second page, I selected 20 variables among the 9 categories in the Food Environment atlas so that the general relationship among them. The scatter plot was created through the Plotly package. Although the y variable can be changed from obesity rates, I fixed obesity to the color variable because that is the most important metric for this project, and allows it to always be considered on the plot. If the mouse pointer hovers over an individual data point, the specific county is displayed. Like was shown on the map, diabetes and obesity rates seem to have a strong positive relationship. There also seems to be a relationship with obesity rates and poverty rates, medium household income, convenience stores, fast-food and full service restaurants. While the relationship between obesity and grocery stores, farms, farmer markets, and age seems to be less apparent.
The end goal of this project is to make a predictive tool for policy makers interested in seeing how factors could affect their obesity rates. So in my third page, I allow users to predict obesity rates. As a preliminary prediction analysis, multiple, stepwise linear regression was used on 17 variables of interest to predict obesity rates. Stepwise regression showed that at least 10 variables are significant. Basic diagnostics indicate model assumptions were not violated, and multiple linear regression is valid. 76% of the data was complete cases, while the rest had at least one NA. Only the complete cases were used in prediction. The initial values in the sliders being the mean of the dataset and the range of the sliders are mean +/- 3*standard deviation, allowing the user to choose reasonable values. The page includes coefficients, their variance inflation factors, residuals plots, QQ plot, and leverage plot.
The prediction page also include all the diagnostics used to determine that multiple linear regression is valid for the obesity model. Some of the diagnostics are shown below, please see the app for the full list of diagnostics.
The last page is a data table of the counties where a user can search for a specific county in the USA.
The USA has many diverse counties in the USA, and this spread can result in many demographic differences in health and food. The scope of this project is to explore the relationship of food and economic factors in the US' counties. Data was taken from the USDA's Food Environment Atlas. The diversity (or variance) of factors across counties allowed for correlations in obesity rates to be developed. Users can map health indicators across the US, explore the relationship among different factors, and predict how changes in demographic factors will affect obesity rates.