Analysis of WebMD Most Popular Supplement Reviews

Xingwang Chen
Posted on Oct 20, 2019

Introduction:

Supplements are a multibillion dollar industry. Broadly defined, supplements are vitamins, minerals, herbs, amino acids and enzymes that are meant to be taken to supplement the body need of the vital substance as well as serving to reduce the risk of disease. Since so many of us are taking supplements, the question arise, does supplements actually work?

Supplements are regulated by the FDA, but not according to the same framework as drugs, which have to undergo human clinical trials to ensure efficacy and safety before the drugs can be marketed. But the FDA does not review dietary supplement product for safety and effectiveness before they are marketed. In short, the answer to the question, do supplements work is that we do not know.

In an effort to gain insight into the efficacy of supplements, I did a webscraping project of the most commonly reviewed supplements on WebMD. WebMD is one of the top health-related website, it publishes articles and reviews written by industry professionals. It also allows users to post reviews of drugs and supplements, which gives us a critical look at the broader segment of the population.

 

Methods:

The webscraping project was done using Scrapy. The data points on the user reviews that were scraped were the following:

  • Reason for taking
  • Reviewer information
  • Star Rating
    • Effectiveness
    • Ease of Use
    • Satisfaction
  • Comment
  • Helpful, which is the number of people that found the review to be helpful.

Analysis of the data set was done using Python packages numpy, pandas, matplitlib, nltk, textblob, and wordcloud.

 

Project Findings:

A total of 7139 user reviews of the most common supplements were scraped from WebMD. I wanted to focus on the top 10 most effective supplement, by effectiveness rating and wanted to see if I can get insight as to their effectiveness. To ensure the ratings were not affected by small sample size, only supplements where their number of ratings were greater than average were considered.

In order to try to understand if there was a correlation between the effectiveness and the other ratings, namely the ease of use and satisfaction, I did a correlation plot of the 3 star reviews.

Based on this, there appears to be a strong correlation between ease of use, effectiveness and satisfaction. This correlation seems logical as the biggest consideration for the effectiveness of a drug, in this case supplement, is whether the patient actually take the drug. Satisfaction and effectiveness should show a direct correlation since the users' satisfaction is in large based on the effectiveness of the supplement.

In order to gain more insight into how users might be using these supplements, I did a natural language processing of comment left by the reviewers. For this, I chose the most effective supplement by rank, which was colloidal silver and did a wordcloud. The wordcloud suggest the most common user for colloidal silver was as antibiotic against infections, and the length of use range from months to year. 

 

Demographic studies:

In the dataset, there were significant data points that was relevant to user demographic. Analysis of this would be useful in determine who are the users of WebMD and how and what they are using the website for. This can help the design of WebMD as well as better targeting of ads for a better user experience. 

Based on the top 10 conditions users have entered as the reason why they were taking supplements, we can see that general health and wellness is a significant portion. This makes sense as supplements are meant to supplement, not to treat acute disease or conditions. Insomnia, arthritis and anxiety and weight loss are amongst the more specific conditions that users are taking supplement for. The website can engage these user segments by increasing the number of articles about these conditions.

 

One of the most valuable data points a company can gather is user demographics. Here I did breakdown of the reviewers by age bracket and gender. The top 4 bins are all female of different age bracket, and the top age bracket is 55-64 (including both male and female reviewers). This suggest that the reviewers of WebMD are majority female, and age bracket 45-64 covers a significant segment of the total reviewers. The reviewers here can serve as a stand-in for the average WebMD user, can having this data can help better improve user experience by presenting them content relevant to them as well as ads.

 

Future work:

For the future work, I like to expand on 2 fronts. Firstly, I would like to do a cross analysis between the user demographics and the conditions and supplements they most reviewed. Again, this additional layer of data allows for better user experience and targeting.

Secondly, I like to improve on the natural language processing of the most effective reviews as well as those reviews deem most helpful. The NLP might give us insight as to how the users are using the supplement and how that might in turn assist other visitors to the site.

Lastly, I would like to use the data gathered to generate an information sheet similar to a package insert for the most common supplement. Package inserts are the piece of paper included with prescription drug that gives indication, clinical trial data, as well as safety information of the drug mandated by the FDA. Visitors of WebMD are looking for information, and with the aid of an information sheet, they will be able to make a better informed decision on their supplements.

 

Thank you for reading! For a more extensive look at my project, please visit https://github.com/xingwchen/webmd.

 

 

 

About Author

Xingwang Chen

Xingwang Chen

Xingwang is a data scientist with background biotech and pharmaceutical space. Xingwang obtained his Ph.D in Molecular and Cellular Pharmacology from Stony Brook University. He is interested in leveraging data science and machine learning technique to advance medical...
View all posts by Xingwang Chen >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp