Using Public Behavioral Health Data

Posted on Dec 5, 2014

Contributed by Robert Deng. Rob took the Intro To R, R007 class with Charlie Redmon from Sept - Oct/ 2014. The post was based on his final project submission. ——————————————————————————————————————

Goal: Coming from a healthcare advertising agency perspective, I wanted to influence our pitch process by collaborating with our strategy team to provide audience research for program design.

The Behavioral Risk Factor Surveillance System (BRFSS) survey is a public dataset from the CDC. It is used to understand disease prevalence among different socio-economic factors for preventative measures and building health promotion activities. For this project, I used a dataset called landline and cellphone screening (LLCP) with survey questions that can be found here:

Here's a cleaned version of the data set after factorizing the columns: Download Link

Here's a cleaned version of the text read out after identifying the condition percentage per response: Download Link

Factor dimension and category lookup: Download Link

Process: Factorize qualitative categorical responses, indicated by columns having less than 10 levels. This way we can more breakout the condition of interest by factor responses

make_factors <- function(data, max_levels=10) {
 for(n in names(data)){
 if(!is.factor(data[[n]]) && 
 length(unique(data[[n]])) <= max_levels) {
 data[[n]] <- if(!is.numeric(data[[n]])){
 } else {
 } }}

Flag the condition of interest. This can easily be scaled for any other disease targets

paste(round(100*table(LLCPf$COND_OBESITY)[2] / nrow(LLCPf), 2), "%", sep="")
[1] "27.39%"

From this, we know 27.4% of survey respondents were considered obese as the base condition mean

Identify factors that skew higher for the disease condition through multiple DDPLY for loops and rebinding it to an existing data frame. The first iteration of the loops is ddply-ing the first column for the rows of condition interest. We return a "condition percentage" skew for each response of the question. Continue this process for all the factors and see the condition percentage for each response. Further clean up down to remove outliers based on upper and lower condition percentages and categorized each question into dimension and categories.

Current limitations lie in the column responses where the responses vary beyond 10 levels (i.e. how many fruits did you consumer, 1-23) where each column needs to be "cut." However

#Factor Identification Process
factoroutput <- NULL
factoroutputTotal <- NULL
for (i in 1:ncol(LLCPf))
 if (any(class(LLCPf[, i]) == "factor"))
 factoroutput <- ddply(LLCPf, i, summarise, TotalSurvey = length(Condition), 
 Condition = sum(Condition), ConditionPercentage = round(Condition/TotalSurvey, 3))
 factoroutput$Factor <- names(LLCPf[i]) 
 names(factoroutput) <- c("Factorvalue", "TotalSurvey", "Condition", "ConditionPercentage", "Factor")
 if (i == 1) 
 factoroutputTotal <- factoroutput 
 factoroutputTotal <- rbind(factoroutput, factoroutputTotal)

#Remove Ends and Require a minimum amount of survey respondents to qualify the audience, adjust these values according to your specificity
ConditionPerLower <- 0.05
ConditionPerUpper <- 0.95
TotalSurveyMin <- 100
factoroutputTotal$Spread <- factoroutputTotal$ConditionPercentage - BaselineCondition
factoroutputTotalFilter <- factoroutputTotal[(!(factoroutputTotal$ConditionPercentage < ConditionPerLower) & !(factoroutputTotal$ConditionPercentage > 0.95)), ]
factoroutputTotalFilter <- factoroutputTotalFilter[(factoroutputTotalFilter$TotalSurvey > TotalSurveyMin),]

#Require a minimum spread by specifying in SkewMin
SkewMin <- 0.05
factoroutputTotalFilter$IntFlag <- 0
factoroutputTotalFilter$IntFlag[abs(factoroutputTotalFilter$Spread) > SkewMin] <- 1
factoroutputTotalFilter$AbsSpreadIndex <- abs(factoroutputTotalFilter$Spread)*100
winners <- factoroutputTotalFilter[factoroutputTotalFilter$IntFlag == 1,-c(7,8)]

#Append Categorical, Dimension, and Label Lookup
winners$Category <- 0
winners$Dimension <- 0
winners$Label <- 0
FactorLookup <- read.csv("Dataset/BRFSS_Codebase_Lookup.csv")
winners$Category <- merge([c(5,7)]),[c(1, 3)]), by='Factor', all.x = TRUE, sort = FALSE)[,-c(1:2)]
winners$Dimension <- merge([c(5,8)]),[c(1, 4)]), by='Factor', all.x = TRUE, sort = FALSE)[,-c(1:2)]
winners$Label <- merge([c(5,9)]),[c(1, 2)]), by='Factor', all.x = TRUE, sort = FALSE)[,-c(1:2)]
winners$Correlation[winners$Spread > 0] <- "Positive"
winners$Correlation[winners$Spread <= 0] <- "Negative"


Readouts The output looks like this in excel form, where the factor and label responses can be found in the codebook


Attitudinal and emotional factors were sorted first; to read this, LSATISFY corresponded with the question “In general, how satisfied are you with your life? The high responses rates in ‘3’ and ‘4’ were ‘Dissatisfied’ and ‘Very Dissatisfied’ respectively. The rest of the code can be found in the public codebook above. Isolate all the interesting, significant factors where the spread, defined by Abs(Condition Percentage) - Mean survey condition population (27.4%) >5%

Here's a summary readout of the audience description:

Disease States People with obesity have a plethora of other disease condition factors

  • They are likely to have arthritis, where their doctor specifically suggested losing weight for joint symptoms
  • Everyday, they experience symptoms of asthma including coughing, wheezing, shortness of breath, chest tightness and phlegm production
  • They’ve been told they have diabetes and are taking Insulin
  • They’re likely to use smokeless tobacco products everyday

Attitudinal Helping them address their emotional needs will improve motivation

  • They tend to be dissatisfied with life and sometimes get the emotional support they need
  • They very frequently feel worthless, depressed, and hopeless
  • Everything they do feels like an effort

Physical They are predominantly physically inactive, but some of them are exercising

  • Dressing, walking, climbing stairs, and running errands alone are difficult
  • Most of the time they feel physically restless
  • 20% who responded that they are physically “highly active” but do not engage in vigorous activities

Dietary Nutrition issues do not indicate an alcohol problem, moreso with quality eating

  • They are not identifying themselves as heavy drinkers, or binge drinkers
  • They always or usually worry about having enough money to buy nutritious meals
  • They drink on average 2 soft drinks per day
  • They prefer to frequently eat vegetables over fruit

Demo Profile They have financial problems // Demo

  • 34% of obese people make less than $15K / year; likely to be in the age groups 18-24 and 24–36
  • Financially, they usually worry about having enough money to pay rent and mortgage

Unable to get the medicine they need due to cost

  • If they’ve married, they’re likely to be separated, they have an average of 2.5 children per household

Let’s plot the condition population on a US map with GGPLOT

ggplot(choropleth, aes(long, lat, group = group)) +
 geom_polygon(aes(fill = obesity_scale), colour = "white", size = 0.2) + 
 geom_polygon(data = state.df, colour = "white", fill = NA) +
 scale_fill_brewer(palette = "Purples") + coord_equal()

Using ggplot with our obesity percentages, we see that Montana, Colorado, Pennsylvania, and a few of the southern states skew towards obese (.30+). Further segmentation can be done to identify segments of people with attitudinal issues AND the condition state Most of the western states skew towards overweight (0.25-0.3). These are the states where preventative programs could be implemented


Looking forward

Utilize geographic audience profiling for media targeting implications by overlaying the geographic audience level data with private media data

Further audience segmentation could be done to identify opportunistic “active” condition patients who are ready for change with other future segmentation strategies

Obesity is my first test, other diseases like COPD could be next

About Author

Related Articles

Leave a Comment

van cleef et arpels bijoux chanceux prix August 19, 2016 van cleef et arpels bijoux chanceux prix

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp