Food and health demographics in the USA

Yannick Kimmel
Posted on May 17, 2016

Contributed by Yannick Kimmel. He  is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - R Shiny (due on the 4th week of the program).

 Introduction

The culture of food and health (like the other aspects of culture) is constantly changing and is diverse (meaning high variance) in the USA. Obesity affects roughly 1 in 3 Americans, while diabetes affects roughly 1 in 10 Americans. I wanted to understand the relationship of food and health demographics. I thought this data would be important for policy makers and civic leaders who would be interested in changing their demographics for the better. The USDA's Food Environment Atlas is a great resource for county specific data on this specific subject matter. The data can be found here.  The Shiny app can be found here, and the project code here.

R code

Three R files were used in this project. The helpers.R file is used to manipulate the county demographic data, create a county map, fit a regression model, and use the fitted model to predict obesity based on values inputted by the user.

Data is uploaded from 6 CSV files and combined into a data frame:

Then stepwise regression is used to fit a model to county obesity and a prediction function is created so that obesity can be predicted given parameters changed by the user. Finally the model diagnostics are checked.

A choropleth map is created using county specific health parameters.

The server.R file has several functions. 1) Change the County map created in the helpers.R file depending on input by the use. 2) Create a Plotly scatter plot for the different demographic data. 3) Call the obesity model created in the helper.R file, and change the predicted obesity rate given inputs by the user. 4) Render the data table that allows a user to look up data on a specific county. The code for the county map, Plotly scatter plot, and obesity prediction is shown below:

Shiny app

USA county map

The first page of the Shiny app is county specific map of the continental USA. I choose to display three indicators of particular interest: obesity, diabetes, and poverty rates. Obesity and diabetes rates generally higher in the Midwest and southeast parts of the USA and lower on the East and West Coasts. The poverty rates generally seem higher in the Southern part of the USA.

Screen Shot 2016-05-16 at 7.57.03 PM

Scatter plot

On the second page, I selected 20 variables among the 9 categories in the Food Environment atlas so that the general relationship among them. The scatter plot was created through the Plotly package. Although the y variable can be changed from obesity rates, I fixed obesity to the color variable because that is the most important metric for this project, and allows it to always be considered on the plot. If the mouse pointer hovers over an individual data point, the specific county is displayed. Like was shown on the map, diabetes and obesity rates seem to have a strong positive relationship. There also seems to be a relationship with obesity rates and poverty rates, medium household income, convenience stores, fast-food and full service restaurants. While the relationship between obesity and grocery stores, farms, farmer markets, and age seems to be less apparent.

Screen Shot 2016-05-16 at 7.44.10 PM

Prediction

The end goal of this project is to make a predictive tool for policy makers interested in seeing how factors could affect their obesity rates. So in my third page, I allow users to predict obesity rates. As a preliminary prediction analysis, multiple, stepwise linear regression was used on 17 variables of interest to predict obesity rates. Stepwise regression showed that at least 10 variables are significant. Basic diagnostics indicate model assumptions were not violated, and multiple linear regression is valid. 76% of the data was complete cases, while the rest had at least one NA. Only the complete cases were used in prediction. The initial values in the sliders being the mean of the dataset and the range of the sliders are mean +/- 3*standard deviation, allowing the user to choose reasonable values. The page includes coefficients, their variance inflation factors, residuals plots, QQ plot, and leverage plot.

Screen Shot 2016-05-16 at 7.44.58 PM

Model Diagnostics

The prediction page also include all the diagnostics used to determine that multiple linear regression is valid for the obesity model. Some of the diagnostics are shown below, please see the app for the full list of diagnostics.

Screen Shot 2016-09-09 at 1.19.11 PM

Screen Shot 2016-09-09 at 1.19.24 PM

Data table

The last page is a data table of the counties where a user can search for a specific county in the USA.

Screen Shot 2016-05-16 at 7.46.02 PM

Conclusions

The USA has many diverse counties in the USA, and this spread can result in many demographic differences in health and food. The scope of this project is to explore the relationship of food and economic factors in the US' counties. Data was taken from the USDA's Food Environment Atlas. The diversity (or variance) of factors across counties allowed for correlations in obesity rates to be developed. Users can map health indicators across the US, explore the relationship among different factors, and predict how changes in demographic factors will affect obesity rates.

About Author

Yannick Kimmel

Yannick Kimmel

Yannick is drawn to solving a wide range of problems - from the traditional sciences to current challenges in data science and machine learning. Yannick holds a PhD in chemical engineering from the University of Delaware, and a...
View all posts by Yannick Kimmel >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp