Food and health demographics in the USA
Contributed by Yannick Kimmel. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - R Shiny (due on the 4th week of the program).
Introduction
The culture of food and health (like the other aspects of culture) is constantly changing and is diverse (meaning high variance) in the USA. Obesity affects roughly 1 in 3 Americans, while diabetes
R code
Three R files were used in this project. The helpers.R file is used to manipulate the county demographic data, create a county map, fit a regression model, and use the fitted model to predict obesity based on values inputted by the user.
Data is uploaded from 6 CSV files and combined into a data frame:
https://gist.github.com/Yankim/5ea82527b5f651681ed922fc2f17813b
Then stepwise regression is used to fit a model to county obesity and a prediction function is created so that obesity can be predicted given parameters changed by the user. Finally the model diagnostics are checked.
https://gist.github.com/Yankim/cea1802bc1134649f99328d9308c7b3d
A choropleth map is created using county specific health parameters.
https://gist.github.com/Yankim/ae93f25a1281c96582edb2b926188e3d
The server.R file has several functions. 1) Change the County map created in the helpers.R file depending on input by the use. 2) Create a Plotly scatter plot for the different demographic data. 3) Call the obesity model created in the helper.R file, and change the predicted obesity rate given inputs by the user. 4) Render the data table that allows a user to look up data on a specific county. The code for the county map, Plotly scatter plot, and obesity prediction is shown below:
https://gist.github.com/Yankim/9b665c7ea726b6fdf3867eba2feaa551
Shiny app
USA county map
The first page of the Shiny app is county specific map of the continental USA. I choose to display three indicators of particular interest: obesity, diabetes
Scatter plot
On the second page, I selected 20 variables among the 9 categories in the Food Environment atlas so that the general relationship among them. The scatter plot was created through the Plotly package. Although the y variable can be changed from obesity rates, I fixed obesity to the color variable because that is the most important metric for this project, and allows it to always be considered on the plot. If the mouse pointer hovers over an individual data point, the specific county is displayed. Like was shown on the map, diabetes
Prediction
The end goal of this project is to make a predictive tool for policy makers interested in seeing how factors could affect their obesity rates. So in my third page, I allow users to predict obesity rates. As a preliminary prediction analysis, multiple, stepwise linear regression was used on 17 variables of interest to predict obesity rates. Stepwise regression showed that at least 10 variables are significant. Basic diagnostics indicate model assumptions were not violated, and multiple linear regression is valid. 76% of the data was complete cases, while the rest had at least one NA. Only the complete cases were used in prediction. The initial values in the sliders being the mean of the dataset and the range of the sliders are mean +/- 3*standard deviation, allowing the user to choose reasonable values. The page includes coefficients, their variance inflation factors, residuals plots, QQ plot, and leverage plot.
Model Diagnostics
The prediction page also include all the diagnostics used to determine that multiple linear regression is valid for the obesity model. Some of the diagnostics are shown below, please see the app for the full list of diagnostics.

Data table
The last page is a data table of the counties where a user can search for a specific county in the USA.
Conclusions
The USA has many diverse counties in the USA, and this spread can result in many demographic differences in health and food. The scope of this project is to explore the relationship of food and economic factors in the US' counties. Data was taken from the USDA's Food Environment Atlas. The diversity (or variance) of factors across counties allowed for correlations in obesity rates to be developed. Users can map health indicators across the US, explore the relationship among different factors, and predict how changes in demographic factors will affect obesity rates.