Contributed by Robert Deng. Rob took the Intro To R, R007 class with Charlie Redmon from Sept - Oct/ 2014. The post was based on his final project submission. ——————————————————————————————————————
Goal: Coming from a healthcare advertising agency perspective, I wanted to influence our pitch process by collaborating with our strategy team to provide audience research for program design.
The Behavioral Risk Factor Surveillance System (BRFSS) survey is a public dataset from the CDC. It is used to understand disease prevalence among different socio-economic factors for preventative measures and building health promotion activities. For this project, I used a dataset called landline and cellphone screening (LLCP) with survey questions that can be found here:
Process: Factorize qualitative categorical responses, indicated by columns having less than 10 levels. This way we can more breakout the condition of interest by factor responses
From this, we know 27.4% of survey respondents were considered obese as the base condition mean
Identify factors that skew higher for the disease condition through multiple DDPLY for loops and rebinding it to an existing data frame. The first iteration of the loops is ddply-ing the first column for the rows of condition interest. We return a "condition percentage" skew for each response of the question. Continue this process for all the factors and see the condition percentage for each response. Further clean up down to remove outliers based on upper and lower condition percentages and categorized each question into dimension and categories.
Current limitations lie in the column responses where the responses vary beyond 10 levels (i.e. how many fruits did you consumer, 1-23) where each column needs to be "cut." However
#Factor Identification Process
factoroutput <- NULL
factoroutputTotal <- NULL
for (i in 1:ncol(LLCPf))
{
if (any(class(LLCPf[, i]) == "factor"))
{
factoroutput <- ddply(LLCPf, i, summarise, TotalSurvey = length(Condition),
Condition = sum(Condition), ConditionPercentage = round(Condition/TotalSurvey, 3))
factoroutput$Factor <- names(LLCPf[i])
names(factoroutput) <- c("Factorvalue", "TotalSurvey", "Condition", "ConditionPercentage", "Factor")
if (i == 1)
{
factoroutputTotal <- factoroutput
}
else
{
factoroutputTotal <- rbind(factoroutput, factoroutputTotal)
}}}
#Remove Ends and Require a minimum amount of survey respondents to qualify the audience, adjust these values according to your specificity
ConditionPerLower <- 0.05
ConditionPerUpper <- 0.95
TotalSurveyMin <- 100
factoroutputTotal$Spread <- factoroutputTotal$ConditionPercentage - BaselineCondition
factoroutputTotalFilter <- factoroutputTotal[(!(factoroutputTotal$ConditionPercentage < ConditionPerLower) & !(factoroutputTotal$ConditionPercentage > 0.95)), ]
factoroutputTotalFilter <- factoroutputTotalFilter[(factoroutputTotalFilter$TotalSurvey > TotalSurveyMin),]
#Require a minimum spread by specifying in SkewMin
SkewMin <- 0.05
factoroutputTotalFilter$IntFlag <- 0
factoroutputTotalFilter$IntFlag[abs(factoroutputTotalFilter$Spread) > SkewMin] <- 1
factoroutputTotalFilter$AbsSpreadIndex <- abs(factoroutputTotalFilter$Spread)*100
winners <- factoroutputTotalFilter[factoroutputTotalFilter$IntFlag == 1,-c(7,8)]
#Append Categorical, Dimension, and Label Lookup
winners$Category <- 0
winners$Dimension <- 0
winners$Label <- 0
FactorLookup <- read.csv("Dataset/BRFSS_Codebase_Lookup.csv")
winners$Category <- merge(as.data.frame(winners[c(5,7)]), as.data.frame(FactorLookup[c(1, 3)]), by='Factor', all.x = TRUE, sort = FALSE)[,-c(1:2)]
winners$Dimension <- merge(as.data.frame(winners[c(5,8)]), as.data.frame(FactorLookup[c(1, 4)]), by='Factor', all.x = TRUE, sort = FALSE)[,-c(1:2)]
winners$Label <- merge(as.data.frame(winners[c(5,9)]), as.data.frame(FactorLookup[c(1, 2)]), by='Factor', all.x = TRUE, sort = FALSE)[,-c(1:2)]
winners$Correlation[winners$Spread > 0] <- "Positive"
winners$Correlation[winners$Spread <= 0] <- "Negative"
Readouts The output looks like this in excel form, where the factor and label responses can be found in the codebook
Attitudinal and emotional factors were sorted first; to read this, LSATISFY corresponded with the question “In general, how satisfied are you with your life? The high responses rates in ‘3’ and ‘4’ were ‘Dissatisfied’ and ‘Very Dissatisfied’ respectively. The rest of the code can be found in the public codebook above. Isolate all the interesting, significant factors where the spread, defined by Abs(Condition Percentage) - Mean survey condition population (27.4%) >5%
Here's a summary readout of the audience description:
Disease States People with obesity have a plethora of other disease condition factors
They are likely to have arthritis, where their doctor specifically suggested losing weight for joint symptoms
Everyday, they experience symptoms of asthma including coughing, wheezing, shortness of breath, chest tightness and phlegm production
They’ve been told they have diabetes and are taking Insulin
They’re likely to use smokeless tobacco products everyday
Attitudinal Helping them address their emotional needs will improve motivation
They tend to be dissatisfied with life and sometimes get the emotional support they need
They very frequently feel worthless, depressed, and hopeless
Everything they do feels like an effort
Physical They are predominantly physically inactive, but some of them are exercising
Dressing, walking, climbing stairs, and running errands alone are difficult
Most of the time they feel physically restless
20% who responded that they are physically “highly active” but do not engage in vigorous activities
Dietary Nutrition issues do not indicate an alcohol problem, moreso with quality eating
They are not identifying themselves as heavy drinkers, or binge drinkers
They always or usually worry about having enough money to buy nutritious meals
They drink on average 2 soft drinks per day
They prefer to frequently eat vegetables over fruit
Demo Profile They have financial problems // Demo
34% of obese people make less than $15K / year; likely to be in the age groups 18-24 and 24–36
Financially, they usually worry about having enough money to pay rent and mortgage
Unable to get the medicine they need due to cost
If they’ve married, they’re likely to be separated, they have an average of 2.5 children per household
Let’s plot the condition population on a US map with GGPLOT
Using ggplot with our obesity percentages, we see that Montana, Colorado, Pennsylvania, and a few of the southern states skew towards obese (.30+). Further segmentation can be done to identify segments of people with attitudinal issues AND the condition state Most of the western states skew towards overweight (0.25-0.3). These are the states where preventative programs could be implemented
Utilize geographic audience profiling for media targeting implications by overlaying the geographic audience level data with private media data
Further audience segmentation could be done to identify opportunistic “active” condition patients who are ready for change with other future segmentation strategies
Obesity is my first test, other diseases like COPD could be next