American Food Deserts: Analyzing the development of "unhealthy" neighborhoods

Posted on Feb 3, 2018

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

This is my shiny project examining the development of food deserts in American neighborhoods. Please click here to go directly to the app


Minorities today disproportionately suffer from poor health outcomes. Data shows that African Americans are twice as likely to have diabetes compared to whites1, and nearly 40 percent of Latinos are overweight or obese2 .

Much of the existing debate as to why such stark disparities exist have focused on issue such as healthcare coverage and socioeconomic status. However, galvanized by Michelle Obama’s "Let's Move Initiative”, a growing area of attention has been the prevalence of food deserts, and how their existence may help understand why nearly 40% of children are overweight or obese in black and hispanic communities.

Food deserts are commonly characterized has places that have limited access to affordable and nutritious foods, and have a surplus of restaurants, fast food chains, bars, and convenience stores (instead of grocery stores). And it make sensestands that such access to food sources plays a role may contribute to thesein persistent health differences across groups.

Therefore, for in this project I aim to:

  1. explore the frequency of food deserts in major metropolitan cities
  2. examine how such food deserts vary both across cities and over time
  3. test whether variation in food deserts were patterned by neighborhood racial composition



To accomplish the goals at hand, I needed a data source that had several characteristics, namely:

  • had some sort of record of food sources in a neighborhood in order to measure food deserts
  • recorded key neighborhood socioeconomic characteristics like race, poverty, unemployment, etc.
  • had the above characteristics BOTH overtime and across many cities

A tall task indeed!

Fortunately for me, one of the richest public data sources available is the U.S. national census. In addition to the decennial census, it offers neighborhood-level information from the annually collected American Community Survey (which began roughly in 2009) and the Zip Code Business Pattern Dataset, which contains information on the number of different types businesses (categorized by the NAICS) since the 1990s for every zip code in America.

Unfortunately for me, one of the messiest public data sources available is also the U.S. national census. In addition to the overall large nature of the data, which makes data processing and cleaning slow and tedious, the census also regularly changes coding schemes and concept definitions year to year. On top of that, the NAICS has over 10,000 categories. Consider what that amounts to with over 40,000 zip codes in American and nearly 15 years of dates used for this study. The data required A LOT of cleaning.

Not only that, but also, the NAICS has over 10,000 categories. Thus, with over 40,000 zip codes in American and nearly 15 years of dates used for this study, you get the picture. The data was required A LOT of cleaning.

Nonetheless after much cleaning, I had a dataset that consisted of 20 cities with:

  • 14 years (2000 - 2014) of data on the number of businesses in a zip code. Specifically, for the purposes of my research question:
    1. the number of fast food restaurants
    2. fresh grocery stores
    3. nonperishable food sources (i.e. convenience stores, and other places that mostly provided snacks and frozen foods)
    4. liquor stores/bars.
  • 7 years (2000, 2009, 2010, 2011, 2012, 2013, 2014) of census data on zip codes sociodemographic information

Project Summary


1)  Explore the frequency of food deserts in major metropolitan cities

Knowing the number of fast food restaurants, fresh grocery stores, nonperishable food sources, and liquor stores in conjunction with the shape files associated with each zip code allowed me to create a choropleth map that demonstrates how areas in a city differ on these food source categories.


Caption: From left to right and top to bottom, choropleth map of fast foods, unfresh grocery stores, fresh food, and alcohol in the San Francisco Bay area


The colors on the map indicate how much of that resource is in a given area. The darker the shade, the higher the frequency, and the lighter the shade, the lower the frequency. The bins for the color scheme were constructed using the distribution of each outcome; therefore, they change from graph to graph.


2) Examine how such food deserts vary both across cities and over time

Having 20 cities and 15 waves in my dataset allows me to explore differences in food deserts both across different cities and over time.

Caption: Map comparing number of fast food restaurants in San Francisco (L) and New York (R).


In the graph above, we can compare the number of fast food restaurants in the San Francisco Bay area with those in New York City.


Caption: Changes in the number of fast food restaurant in New York (Left to right; top to bottom - 2000, 2003, 2006, 2009, 2012, 2014).


In this graph, we can analyze  how the number of fast food restaurant changes over time in New York. In particular, we can see that there seems to be a large growth in Brooklyn.

3) Test whether variation in food deserts were patterned by neighborhood racial composition

Given that the data contains sociodemographic data that goes along with the number of food resources in a neighborhood, it is easy to observe the racial and socioeconomic characteristics of these neighborhoods


Clicking on the marker in a zip code reveals its zip code number, racial composition and median income, thus allowing users to explore the descriptive characteristics of areas they want to learn about. However, the basis for incorporating sociodemographic characteristics of zip codes was to garner insight into how food deserts correlated with neighborhood demographic composition. Unless users want to individually click every marker in the graph, the choropleth map fails to provide insight into that relationship.

While a choropleth map may be one of the more appealing visualizations, it was not the best one to focus on the association between my two variables of interest. I proceeded to try some bivariate analyses to see if I can unpack the relationship between neighborhood racial composition and food resources. Given the size of all the waves of my data and the nature of bivariate analyses and visualizations, I took a cross-section of my data and only examine the year 2014.

If the narrative that minority neighborhoods are deprived of access to healthy food and instead, have a plethora of unhealthy food options, we would expect to see some sort of either positive or negative correlation between percent white and the outcome variable. But as the graph above suggests, it is difficult to infer any type of relationship. Is that to say that we should reject the narrative that food deserts are concentrated in minority neighborhood?

Not necessarily. The lack of a distinct pattern in the scatter plot may be more of a function of the non-linear relationship between neighborhood demographics and food resources. That is, say, an increase percent white from 20% to 22% may not significantly affect the number of fast food restaurants, but crossing some threshold may.



These graphs best illustrate this point. When I transform my x-axis from the continuous variable percent white to a categorical variable where neighborhoods are categorized as predominantly white, black, hispanic, or heterogenous, the previous scatter plot now turns into a box plot where we can examine the differences in the means and distribution of these groups.

Yet still, these findings lack robustness. Though we can now see that predominantly white neighborhoods have the  lowest number of fast food restaurants, we cannot rule out cofounding factors. Namely, do minorities neighborhoods have more fast food restaurants because they also tend to have lower levels of socioeconomic status? What about population density? Midtown manhattan likely has the largest overall number of fast food restaurants, but could the overall foot traffic and population density be inflating the box plot results?



To control against these confounding conditions, I ran a negative binomial regression. Although my outcome variable is continuous, I decided to use a negative binomial regression. instead of OLS. This is because my outcome is a count variable,  and, therefore, by definition, it will have a lower bound of zero. The effects are twofold 1) they would prohibit the conditional errors from following a normal distribution and subsequently, and 2) they inherently make my errors heteroskedastic.

Furthermore, I chose a negative binomial regression over a poisson model because a negative binomial regression relaxes the assumption that an outcome's mean is equal to its variance. In my model, I used city level fixed effects, and control for area of a zip code, log of median income, and poverty level. I used zip code population as my exposure variable, which fixes its log value at one, thus essentially making the outcome a rate.

Finally for my variable of interest, neighborhood racial composition, I utilized percent black, percent Asian, percent Hispanic, and percent other with white as the reference category. Consequently, the effect of each group must be interpreted relative to the effect of the reference group, percent white.


Negative Binomial regressions are modeled such that the outcomes are measured as the log of rates, which ultimately detracts from its interpretability. Therefore, I constructed bar charts where I used my negative binomial regression model to compute predicted values. The sidebar slider manipulates one demographic group (while holding the others at their mean ratios),thus allowing for insight into the effect of increasing a given demographic group in a neighborhood.

For example, in the graphs above, the green histogram bar represents the predicted values for fast food restaurants, fresh food stores, alcohol serving institutions, and non perishable groceries when percent white is manipulated. The green bars in the first graph displays the predicted values when a neighborhood is 50 percent white. In subsequent graph, the green bar now represents the predicted values when a neighborhood is 70 percent white.

Therefore, the predicted values portion of my app allows for analysis of both the relative effect of increases in a demographic group (e.g. the effect of increasing percent white from 50% to 70%) as well as the comparison of effects across demographic groups (e.g. the first graph, where the predicted values are compared at 50% white vs. 50 % black vs. 50 % asian vs. 50 % hispanic).



The following analyses are not included in my shiny app because the nature of the results did not lend itself favorably to interactive graphs. However, I felt it was important to run these additional analyses. The above statistical analyses were all run cross sectionally. That is, I used only data from 2014. That was necessary due to constraints on the shiny server, the nature of the models, as well as for optimizing visualizations.

But in doing so, I dramatically reduced my sample size (and therefore the power of my models) and, more importantly, lost the immense leverage of using panel data to further tease out relationships. Therefore, I conclude my analyses for this project using latent growth curve models to examine differences in the trajectory of growth of different neighborhoods.

In contrast to typically used fixed effects models, growth curve models allow for analysis of between group instead of within group effects. In this case, that means that instead of having within neighborhood changes (e.g. 20 percent black in 2000 to 25 % in 2014) as the basis of my model, I instead examine differences in the overall trajectories of growth between groups (e.g. predominantly white neighborhoods in 2000 vs predominantly black neighborhoods in 2000).


For ease of understanding, the graph above shows imaginary data for how this growth would appear. Each line represent a neighborhood, so they each have their own distinct growth pattern. However, each individual line can also be categorized into a group: Asian, black, Hispanic, and white. So then, the questions becomes are there features of these groups that distinctly affect the slopes and intercepts of growth? This is the foundation of latent growth curve models in a nutshell.

y = intercept + slope


Growth curve modeling takes the simple notion of growth as slope plus intercept and expands on it by examining the latent factors that affect slope and intercept.


In the case of my model, I use characteristics of neighborhoods as either predominantly black, Hispanic or integrated (with white as the reference category) to construct the latent variables for slope and intercept, and use poverty level, log of median income, population density as time varying covariates.


In the figures above, I only examine  fast food restaurants and nonperishable groceries (and only include only the estimates for neighborhood racial composition for simplicity). The results for the other outcomes can be viewed in the supplemental materials section, but I focus on these two figures because of the additional insights they contribute to my negative binomial regression model.

For fast food restaurants, the results from the cross sectional negative binomial regression suggest that an increase in percent black was associated with less access to these food sources. However, the question remains, has this disparity always existed? The results from the latent growth curve model shed light on this question. The effect of predominantly black neighborhoods on the intercept of growth for fast food restaurants is negative; however, the effect on the slope is positive.

This shows that historically, predominantly black neighborhoods have had less access to fast food restaurants (compared to whites), but these differences have decreased over time. Given that the cross sectional models show that black communities have less access to fast food restaurants in 2014, we can infer that while there has been an increase in fast food restaurants concentrating in black communities, this increase is still overshadowed by initial differences in starting points.

For nonperishable groceries, the cross sectional negative binomial regression demonstrated that black communities have more access to these resources.  However, the growth curve models suggest a different pattern for how this difference has developed over time. Predominantly black neighborhoods have a positive and significant effect on the intercept and a non significant effect on the slope.

This indicates that, in regards to nonperishable groceries, black communities have historically had more nonperishable grocery stores, compared to white communities. However racial composition has no effect on the rate of growth of such grocery stores.  This suggests that the differences in access to nonperishable food stores between white neighborhoods and black neighborhoods results from initial starting point differences, rather that differences in growth, analogous to two parallel lines with different intercepts.



When I began this project, I had three goals in mind.

  1. provide a tool that allowed for exploration into the existence of food deserts in major metropolitan cities
  2. allow for examination into how food deserts vary across cities and over time
  3. provide insight into the narrative that variation in food deserts were patterned by neighborhood racial composition

Using a choropleth map within a shiny application, I was able to successfully accomplish the first two goals. The third goal proved more challenging. Simple bivariate analyses proved inappropriate because of the extent of confounding factors. Negative binomial regressions provided insight into a snapshot of how neighborhood composition, controlling for socioeconomic status, affected access to food resources but ignored questions of how food deserts develop over time.

Finally, latent growth curve models, while the most computationally demanding and statistically complex, reveal differences in trajectories of growth, defined by their intercept and slope, for analysis of between group variation in food deserts.

So back to the original question, are minorities disproportionately living in food deserts? I believe that my project shows mixed support for this narrative. For black communities, the results show that they have less access to fresh food and more to nonperishable groceries compared to whites, but they also have less access to fast food restaurants.

This suggests, that black food deserts exist in the sense that they are more likely to be characterized with 7-11s and corner stores than Whole Foods and Costcos. But at the same time, this image of McDonalds, Arby's, TacoBells, Applebees overflowing in these neighborhoods may also be false, and in reality they may just be communities barren of stores, restaurants, and businesses altogether.

However, the narrative of minority food deserts may be more accurate to describe hispanics and Asians. Coinciding with the narrative of minority food deserts, these groups have more access to both fast food restaurants and nonperishable groceries. Yet, they also have more access to fresh food stores. Again, this suggests for these neighborhoods, food deserts may exist in the sense of easy access to fast food chains convenience store style groceries, but they also have increased access to places primarily engaged in selling fresh fruit, vegetables, etc. like bodegas and family markets.

The predicted values and growth curve models add an additional element where we can also see the practical effects of each group as well as analysis of how changes over time are affected by neighborhood racial composition. Namely, we saw that differences in fast food restaurants have existed historically between black and white communities but have actually been decreasing within the past decade. There has also been historical differences in nonperishable groceries have also existed, but these disparities have remained stagnant over the past 15 years.

This has important implications for how we now understand and talk about food deserts. We can begin to examine structural differences in food resources that exist in communities today, as well as evaluate whether these differences have been exasperated over time. And as my analysis show, answers to these questions involve a lot of nuance.

And indeed, broad strokes answers and methods can overlook important patterns and phenomena that are going on in the data. And as my project shows, there are a lots of different way to approach a problem, with each garnering different insights. I only highlighted select models and groups in this blog for this reason, so I invite users to browse through my app as well as my growth curve models to gain a more detailed view of food deserts. Please feel free to reach out with any comments or suggestions!


Supplemental Graphs:





About Author

William Kye

I love using data to answer questions! Being a former PHD student in sociology at the University of Notre Dame, I am inherently fascinated, and trained, in analyzing and understanding human behavior. More than looking at data strictly...
View all posts by William Kye >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI