Data Analysis of WebMD Most Popular Supplement Reviews

Posted on Oct 20, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Data shows supplements are a multibillion dollar industry. Broadly defined, supplements are vitamins, minerals, herbs, amino acids and enzymes that are meant to be taken to supplement the body need of the vital substance as well as serving to reduce the risk of disease. Since so many of us are taking supplements, the question arise, does supplements actually work?

Supplements are regulated by the FDA, but not according to the same framework as drugs, which have to undergo human clinical trials to ensure efficacy and safety before the drugs can be marketed. But the FDA does not review dietary supplement product for safety and effectiveness before they are marketed. In short, the answer to the question, do supplements work is that we do not know.

Gaining Insight

In an effort to gain insight into the efficacy of supplements, I did a webscraping project of the most commonly reviewed supplements on WebMD. WebMD is one of the top health-related website, it publishes articles and reviews written by industry professionals. It also allows users to post reviews of drugs and supplements, which gives us a critical look at the broader segment of the population.



Data Analysis of WebMD Most Popular Supplement Reviews

The webscraping project was done using Scrapy.Β The data points on the user reviews that were scraped were the following:

  • Reason for taking
  • Reviewer information
  • Star Rating
    • Effectiveness
    • Ease of Use
    • Satisfaction
  • Comment
  • Helpful, which is the number of people that found the review to be helpful.

Analysis of the data set was done using Python packages numpy, pandas, matplitlib, nltk, textblob, and wordcloud.


Project Data Findings:

A total of 7139 user reviews of the most common supplements were scraped from WebMD. I wanted to focus on the top 10 most effective supplement, by effectiveness rating and wanted to see if I can get insight as to their effectiveness. To ensure the ratings were not affected by small sample size, only supplements where their number of ratings were greater than average were considered.

Data Analysis of WebMD Most Popular Supplement Reviews

In order to try to understand if there was a correlation between the effectiveness and the other ratings, namely the ease of use and satisfaction, I did a correlation plot of the 3 star reviews.

Data Analysis of WebMD Most Popular Supplement Reviews

Based on this, there appears to be a strong correlation between ease of use, effectiveness and satisfaction. This correlation seems logical as the biggest consideration for the effectiveness of a drug, in this case supplement, is whether the patient actually take the drug. Satisfaction and effectiveness should show a direct correlation since the users' satisfaction is in large based on the effectiveness of the supplement.

In order to gain more insight into how users might be using these supplements, I did a natural language processing of comment left by the reviewers. For this, I chose the most effective supplement by rank, which was colloidal silver and did a wordcloud. The wordcloud suggest the most common user for colloidal silver was as antibiotic against infections, and the length of use range from months to year.Β 


Demographic Data studies:

In the dataset, there were significant data points that was relevant to user demographic. Analysis of this would be useful in determine who are the users of WebMD and how and what they are using the website for. This can help the design of WebMD as well as better targeting of ads for a better user experience.Β 

Based on the top 10 conditions users have entered as the reason why they were taking supplements, we can see that general health and wellness is a significant portion. This makes sense as supplements are meant to supplement, not to treat acute disease or conditions. Insomnia, arthritis and anxiety and weight loss are amongst the more specific conditions that users are taking supplement for. The website can engage these user segments by increasing the number of articles about these conditions.


One of the most valuable data points a company can gather is user demographics. Here I did breakdown of the reviewers by age bracket and gender. The top 4 bins are all female of different age bracket, and the top age bracket is 55-64 (including both male and female reviewers). This suggest that the reviewers of WebMD are majority female, and age bracket 45-64 covers a significant segment of the total reviewers. The reviewers here can serve as a stand-in for the average WebMD user, can having this data can help better improve user experience by presenting them content relevant to them as well as ads.


Future work:

For the future work, I like to expand on 2 fronts. Firstly, I would like to do a cross analysis between the user demographics and the conditions and supplements they most reviewed. Again, this additional layer of data allows for better user experience and targeting.

Secondly, I like to improve on the natural language processing of the most effective reviews as well as those reviews deem most helpful. The NLP might give us insight as to how the users are using the supplement and how that might in turn assist other visitors to the site.

Lastly, I would like to use the data gathered to generate an information sheet similar to a package insert for the most common supplement. Package inserts are the piece of paper included with prescription drug that gives indication, clinical trial data, as well as safety information of the drug mandated by the FDA. Visitors of WebMD are looking for information, and with the aid of an information sheet, they will be able to make a better informed decision on their supplements.


Thank you for reading! For a more extensive look at my project, please visit




About Author

Xingwang Chen

Xingwang is a data scientist with background biotech and pharmaceutical space. Xingwang obtained his Ph.D in Molecular and Cellular Pharmacology from Stony Brook University. He is interested in leveraging data science and machine learning technique to advance medical...
View all posts by Xingwang Chen >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI