HarvardX / MITx Online Courses - Year 1

Posted on Nov 13, 2015

Contributed by John Montroy. John took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).

(Course) Introduction

Online courses are often touted as the savior of higher education, serving as an accessible and affordable alternative to four-year colleges. Proponents point to the accessibility and affordability of online courses; skeptics are quick to note poor completion rates and scant evidence of any clear job market advantage.

Using a student-level dataset provided by HarvardX/MITx, I explore the skeptic's former point in the below analysis. Specifically, I seek to discuss:

  • How do completion rates fair among various sample groups?
  • Are there any groups with higher completion rates than other groups, on average?
  • How do subgroups perform grade-wise, comparatively?

To summarize, we can simply ask: how can we better engage our students through online courses?

In this analysis, we tackle this question using R and ggplot2.

The Data

The dataset analyzed was provided  in an open release by HarvardX and MITx, publicly available here. The data contain student-level observations per course - a student who enrolled in multiple courses would have multiple records in the dataset. The data release included all de-identified student records as well as an accompanying codebook detailing column interpretation. Some high-level statistics:

  • All courses launched on edX for the 2012 - 2013 academic year
  • 17 courses across 3 semesters (eg. "The Ancient Greek Hero", "Intro to Solid State Chemistry")
  • 1,055,562 total registrants
  • 597,692 unique users
  • 43,196 certificates of completion issued

One glance at these numbers outlines the issue nicely - only 7.2% of users received a certificate of completion (without factoring in multiple certificates per user, which would only reduce this percentage!). With this as our touchpoint, we can dive into analysis using R and ggplot2.

Wrangling  - dynamically!

R makes it easy to wrangle your data appropriately, and with a little tweaking, can produce more functionally-oriented code to increase reproducibility. I've included a few highlights of re-usable/dynamic code developed for this analysis - you can find the rest of the code (with comments) at the bottom under "R Code Walks: Wrangling - dynamically!".

After basic cleaning/wrangling, we would like to start aggregating user counts. Specifically, we are interested in percentages - what percentage of students got an A? What percentage stayed engaged with the course for more than a month? What percentage of students with a college degree failed to complete a given course?

In order to generate these percentages from individual records, we need to develop a re-usable function to dynamically calculate percentages per subgroup. The function to perform this makes use of dplyr's Standard Evaluation (SE) functionality to pass a column name into a function. Here, we've focused in on faceting based on the "active_length" variable, but this can be changed easily:

getPercs <- function(df, colname, colpiv, colvals) {
 if(!missing(colvals)) { 
   df.sub <- df[df[[colname]] %in% colvals,]
 } else {
   df.sub <- df
 } # this checks for optional colvals param
 df.sub <- df.sub[!is.na(df[[colname]]),] # remove NA
 df.sub.groups <- df.sub %>%
   group_by_(interp(~x, x = as.name(colname)), interp(~y, y = as.name(colpiv))) %>% 
   summarise(count = n()) # get group totals

 df.sub.totals <- df.sub %>%
   group_by_(interp(~x, x = as.name(colname))) %>% 
   summarise(totcount = n()) # get totals
 df.perc <- df.sub.groups %>% inner_join(df.sub.totals, by = c(colname))
 df.perc <- df.perc %>% mutate(perc = count / totcount)
 df.perc$count <- NULL; df.perc$totcount <- NULL

With this function in hand, we can quickly generate percentages of active lengths per subgroup, and use this data for plotting (as will become clearer momentarily). Our dataset contains many potentially interesting facets:

Facet-able features of edX dataset

Let's try a bunch, and celebrate re-usable code!

getPercs(edx.filt, 'gender', 'active_length', c('m','f'))
getPercs(edx.filt, 'age_cat', 'active_length') # 18 - 25, 25 - 32, etc.
getPercs(edx.filt, 'LoE_DI', 'active_length') # Bachelor's, Master's
getPercs(edx.filt, 'final_cc_cname_DI', 'active_length') # country
getPercs(edx.filt, 'school', 'active_length') # harvard / mit
getPercs(edx.filt, 'course', 'active_length')
getPercs(edx.filt, 'CourseCat', 'active_length') # humanities / sciences

Course Participation - The Bad, the Bad, and the still Bad

Now that we have percentages, let's visualize how course participation behaves over time. We do this through the "active_length" variable created above, bearing in mind that some people register for the course before the actual start date. We choose to visualize the subset of users who had a positive, non-zero active length. This tells us nothing about how frequently users interacted with their course, but it does allow us to see general trends in participation.

We create a function to plot attrition rate over "time", where time here is a user's active length. Note the use of "aes_strings" for dynamic columns.

createAttritionPlot <- function(df, colorcol, title, legendtitle) {
  ggplot(df, aes_string(x = 'active_length', y = 'perc', color = colorcol)) +
  geom_line() +
    title = paste0("User Engagement Attrition Rate (by ", title, ")"), 
    y = "Percentage of Total Users", x = "Days Active") +
  theme_bw() +
  scale_colour_discrete(name = legendtitle) 

With this, we begin plotting. With the code set up as such, it's as easy as:

   getPercs(edx.filt, 'gender', 'active_length', c('m','f')), 
   'gender', 'gender', 'Gender'),
   getPercs(edx.filt, 'CourseCat', 'active_length'), 
   'CourseCat', 'course type', 'Course Type'),
   getPercs(edx.filt, 'age_cat', 'active_length'), 
   'age_cat', 'age', 'Age Category'),
   getPercs(edx.filt, 'LoE_DI', 'active_length'), 
   'LoE_DI', 'level of education', 'Level of Education'),

(Note: grid.arrange comes courtesy of the gridExtra package)

And finally, this code produces:

Screen Shot 2015-11-12 at 11.50.17 PM

Wait a sec! That's not too encouraging and/or helpful. Looks like everyone is pretty much equally bad at committing to online courses - or, more importantly, online courses fail to capture students regardless of background.

Let's try faceting out one or two particular variable to look for trends:

Age category?

Screen Shot 2015-11-13 at 2.02.08 PM


Screen Shot 2015-11-13 at 2.05.36 PM

Neither of these facets are particular revelatory, other than reinforcing what we've already seen. At the beginning of course, hopes and ambitions are high, but they fall off very quickly. There is an odd spike at around day 25 in Electricity and Magnetism, which motivates the idea of targeted campaigns.

Besides verifying our fears, what can we do with these visualizations in hand? The best course is targeted campaigns. Set up A-B tests with different styles of campaigns (email, incentives, etc.) and carefully observe any "lift" in the above graphs. We can then find the best way to increase course retention through targeted campaigns.

Course Performance - Find your target

We now turn to a different question: how do various subgroups perform in terms of final grade? We've already seen that retention rates are universally abysmal, regardless of who you are. Perhaps performance comparisons will yield more useful differences.

With our data already clean, plotting becomes relatively straight-forward. We start with the plotting results gridded together:


Let's break these graphs down, starting with the top-left.


This first plot shows certificates issued by percentage per gender and level of education. Worth noting is the ratio of male vs female certificates per educational level. Lower educational levels tilt towards higher completion rates for men, but flips at the college/master's level towards women. At the doctorate level, completion rates are about equal.


The second plot is a plot of number of interaction events per user versus the days that user was active. The scatterplot is also color-scaled according to final grade, and log-scaled on the x-axis. This yields a rather pretty trend showing what we'd expect: the longer a user stays engaged and the more they interact, the higher their final grade tends to be.

More interesting here, however, is the subgroup stretching from ~ 3 - 7 on the x-axis. This starkly-defined cluster can be interepreted as very high performers who nonetheless did not interact with the course material very much. These over-achievers are ripe candidates to be targeted! Flatter them, incentivize them, get them involved with the course, as they clearly understand the material!


This third plot is similar to the second, but instead plots number of chapters completed against number of forum posts (which is basically a discrete variable, so we almost end up with a bar chart).  We see again that more interaction corresponds with a higher grade, and can trace the rise in grade by tracking the color gradient change through a horizontal line drawn most anywhere.


The final plot is a simple violin plot of of final grades, by education level and gender. We see once again that women generally out-perform men, but the distributions equalize at the highest education levels. This can be seen by the comparative flatness of each violin plot (or plunger, if you will) at the bottom, and the thickness at the top ends.

Conclusions and Future Work

What can we conclude here? As feared, completion rates are abysmal in online courses. A method of improving retention is targeted campaigns, whose effects we could hopefully observe in visualizations like the ones above.

Course performance is a bit more interesting. On the whole, women seem to outperform men except at the very highest level of educational attainment. Further, there are distinct clusters of users visible simply through observation of scatter plots. One of these groups is a set of high-performing but unengaged students - these students need to be identified and engaged, as they could be invaluable for overall success!

On the technical side - well-designed functions take a bit of time, but save much more on the other side. I've actually used the percentage function shown here in other projects, so it's proven worth the investment.

To conclude - there's plenty more to be done here. We can actually perform some data science on this data -- clustering algorithms may reveal even more subgroups of users that we could selectively and uniquely target in email campaigns. Classification models can help us predict the likelihood of completion for users based on various statistics a the course progresses, and regression can be done afterwards to understand the relationship between user interaction and final grade. A fuller dataset would also be more interesting in such an undertaking, of course.

Appendix: R Code Walks

This section is dedicated to all code not already covered above. Some might call this the less interesting code, but who knows?

Wrangling - dynamically!

I begin cleaning by importing the data with appropriate column classes and NA values accounted for. Reading in a snippet of your data and then adjusting your import is crucial for saving time later.

fileloc = '/path/to/dataset.csv'

# get classes, NAs from sample, adjust/account for
edx.head <- read.csv(fileloc, nrows = 1000)
classes <- sapply(edx.head, class)

classes[c("LoE_DI","YoB","gender","final_cc_cname_DI","userid_DI","course_id", "start_time_DI","last_event_DI")] <- "character" 

# re-read with colClasses
edx <- read.csv(fileloc, colClasses = classes, na.string = c("", " ", NA))

The provided dataset documentation reveals which columns are of interest - we can briefly dispense with useless columns and known bad data.

edx$roles <- NULL
edx.incomplete <- filter(edx, !is.na(incomplete_flag))
edx <- filter(edx, is.na(incomplete_flag))
edx$incomplete_flag = NULL; edx.incomplete$incomplete_flag = NULL;

The next few lines create: three new columns based on school/course/semester, a new age category variables, and a new "Active Length" variable. This variable is the difference between the course start time and the last time the user interacted with said course. We also convert the date columns [start_time_DI, last_event_DI] to R's Date class. Lastly, we create a mapping of course codes to course type ['Humanities,'Sciences'] and join that up to create new columns per record. These course category labels are obviously subjective - feel free to yell at me if you disagree.

edx <- edx %>% 
 mutate(start_time_DI = as.Date(start_time_DI, format = "%Y-%m-%d")) %>%
 mutate(., last_event_DI = as.Date(last_event_DI, format = "%Y-%m-%d"))

edx <- edx %>%
 separate(course_id, into = c("school", "course", "semester"), sep = "/", remove=TRUE)

edx <- edx %>%
 mutate(active_length = as.integer(last_event_DI - start_time_DI))

# create age category
edx <- edx %>%
 mutate(age_cat = 
    - as.Date(as.character(edx$YoB)[!is.na(edx$YoB)], 
    format = "%Y")) / 365, 
    breaks = c(0,18,25,35,45,65,Inf),
    labels = c("0 - 18", "18 - 25", "25 - 35", "35 - 45", "45 - 65", "65+")
  )), NA)

# create course type map, no getting around this
coursetype_map <- data.frame(rbind(
 c('CB22x','The Ancient Greek Hero', 'Humanities'),
 c('CS50x','Introduction to Computer Science 1', 'Sciences'),
 c('PH207x','Health in Numbers: Quantitative Methods', 'Sciences'),
 c('PH278x','Human Health and Global Environmental Change','Humanities'),
 c('14.73x','The Challenges of Global Poverty','Humanities'),
 c('2.01x','Elements of Structures','Sciences'),
 c('3.091x','Introduction to Solid State Chemistry','Sciences'),
 c('6.002x','Circuits and Electronics','Sciences'),
 c('6.00x','Introduction to Computer Science and Programming','Sciences'),
 c('7.00x','Introduction to Biology – The Secret of Life','Sciences'),
 c('8.02x','Electricity and Magnetism','Sciences'),
 c('8M.ReV','Mechanics Review','Sciences')

# rename cols, convert to strings, join it up
coursetype_map <- coursetype_map %>% rename(CourseCode = X1, CourseName = X2,CourseCat = X3)
coursetype_map <- data.frame(sapply(coursetype_map, as.character), stringsAsFactors = FALSE)
edx <- edx %>% inner_join(coursetype_map, by = c("course" = "CourseCode"))

Course Performance - Find your Target

The below code generates the four plots (bar plot, scatter1, scatter2, violin plot) discussed in the corresponding section above:

# number of completions
gg.bar <- ggplot(getPercs(subset(edx, certified == 1), 'gender', 'LoE_DI'), aes(x = LoE_DI, y = perc)) + 
 geom_bar(stat = "identity", position = "dodge", aes(fill = gender)) +
 labs(title = "Certifications Issued (by education and gender)", 
 y = "Number Certified",x = "Level of Education") +

# interactivity with color scaling for final grade
gg.point <- ggplot(subset(edx, grade > 0 & grade <= 1), aes(x = log(nevents), y = ndays_act)) +
 geom_point(aes(color = grade)) + 
 scale_color_gradient() +
 labs(title = "User Interactivity (color-scaled by final grade)", 
 y = "Number of Days Active ",x = "Number of interaction events (log)") +
 theme_economist() + scale_fill_economist()

# interactivity re: chapters and forum posts -- why such convergences of chapters read?
gg.jitter <- ggplot(subset(edx, grade > 0 & grade <= 1), aes(x = nforum_posts, y = nchapters)) +
 geom_jitter(aes(color = grade)) + 
 scale_color_gradient() +
 labs(title = "User Interactivity (color-scaled by final grade)", 
 y = "Number of chapters completed ", x = "Number of forum posts") +
 theme_economist() + scale_fill_economist()

# number of completions
gg.bar <- ggplot(getPercs(subset(edx, certified == 1), 'gender', 'LoE_DI'), aes(x = LoE_DI, y = perc)) + 
 geom_bar(stat = "identity", position = "dodge", aes(fill = gender)) +
 labs(title = "Certifications Issued (by education and gender)", 
 y = "Number Certified",x = "Level of Education") +

# interactivity with color scaling for final grade
gg.point <- ggplot(subset(edx, grade > 0 & grade <= 1), aes(x = log(nevents), y = ndays_act)) +
 geom_point(aes(color = grade)) + 
 scale_color_gradient() +
 labs(title = "User Interactivity (color-scaled by final grade)", 
 y = "Number of Days Active ",x = "Number of interaction events (log)") +
 theme_economist() + scale_fill_economist()

# interactivity re: chapters and forum posts
gg.jitter <- ggplot(subset(edx, grade > 0 & grade <= 1), aes(x = nforum_posts, y = nchapters)) +
 geom_jitter(aes(color = grade)) + 
 scale_color_gradient() +
 labs(title = "User Interactivity (color-scaled by final grade)", 
 y = "Number of chapters completed ", x = "Number of forum posts") +
 theme_economist() + scale_fill_economist()

# violin of grade by education
gg.violin <- ggplot(subset(edx, !is.na(LoE_DI) & grade > 0 & grade <= 1 & gender %in% c("f","m")), 
 aes(x = factor(LoE_DI), y = grade, fill = gender)) +
 geom_violin() +
 labs(title = "Grade (by education, gender)", 
 y = "Grade", x = "Level of Education") +
 theme_economist() ##+ scale_fill_economist()

About Author

John Montroy

John Montroy is a graduate of Middlebury College with a B.A. in Physics. After a summer of particle physics at CERN with the Harvard ATLAS team, he began his career as a data analyst in the auto industry....
View all posts by John Montroy >

Leave a Comment

Google September 29, 2020
Google The info mentioned inside the post are a number of the top offered.
AllanRKolker November 15, 2016
I had been very pleased to uncover this web site. I needed to thanks for your personal time exclusively for this fantastic read!! I definitely liked every small amount of it and that i have you book-marked to see new stuff on your own site.
FidelaDTelch August 27, 2016
What's up, constantly i used to check web site posts here in the early hours in the morning, for the reason that i like to find out more and more.
MerlinMKaren August 25, 2016
This part of writing is actually a fastidious one it helps new web people, who definitely are wishing for blogging.
fifa 17 coins August 15, 2016
I like this website - its so usefull and helpfull
RicoSBudzyna August 9, 2016
I think this can be one of many most vital information to me. And i am glad reading your article. But desire to remark on few general things, The internet site style is perfect, the articles is very nice : D. Good job, cheers
RetaGGoulart August 9, 2016
When someone writes an article he/she maintains the idea of a user within his/her brain that the way a user can be familiar with it. Thus that's why this paragraph is perfect. Thanks!
JulioZMasudi July 13, 2016
Hi there, its fastidious paragraph concerning media print, we all be aware of media is a fantastic source of facts.
Chiropractor Liverpool NY January 11, 2016
Are all the reviews from the same approximate time. Chiropractors offer to help you lead a pain free life for certain types of pain. In many cases, people find themselves unable to maintain their normal work schedule due to the pain that this condition causes.
Gregory Gandenberger December 4, 2015
I wonder why exactly low completion rates are supposed to be a problem, and for whom. I often sign up for five or more courses at a time, fully intending only to complete one of them but wanting to explore them a bit before I decide which one. My 20% completion rate is not a sign that I am not engaged with or benefiting from online courses; it is just a sign that I am shopping around. How many of the early dropouts are like me? Attempts to increase my engagement might influence me to choose one course over another but wouldn't really benefit me.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI