Data Analysis of WebMD Most Popular Supplement Reviews
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Data shows supplements are a multibillion dollar industry. Broadly defined, supplements are vitamins, minerals, herbs, amino acids and enzymes that are meant to be taken to supplement the body need of the vital substance as well as serving to reduce the risk of disease. Since so many of us are taking supplements, the question arise, does supplements actually work?
Supplements are regulated by the FDA, but not according to the same framework as drugs, which have to undergo human clinical trials to ensure efficacy and safety before the drugs can be marketed. But the FDA does not review dietary supplement product for safety and effectiveness before they are marketed. In short, the answer to the question, do supplements work is that we do not know.
In an effort to gain insight into the efficacy of supplements, I did a webscraping project of the most commonly reviewed supplements on WebMD. WebMD is one of the top health-related website, it publishes articles and reviews written by industry professionals. It also allows users to post reviews of drugs and supplements, which gives us a critical look at the broader segment of the population.
The webscraping project was done using Scrapy. The data points on the user reviews that were scraped were the following:
- Reason for taking
- Reviewer information
- Star Rating
- Ease of Use
- Helpful, which is the number of people that found the review to be helpful.
Analysis of the data set was done using Python packages numpy, pandas, matplitlib, nltk, textblob, and wordcloud.
Project Data Findings:
A total of 7139 user reviews of the most common supplements were scraped from WebMD. I wanted to focus on the top 10 most effective supplement, by effectiveness rating and wanted to see if I can get insight as to their effectiveness. To ensure the ratings were not affected by small sample size, only supplements where their number of ratings were greater than average were considered.
In order to try to understand if there was a correlation between the effectiveness and the other ratings, namely the ease of use and satisfaction, I did a correlation plot of the 3 star reviews.
Based on this, there appears to be a strong correlation between ease of use, effectiveness and satisfaction. This correlation seems logical as the biggest consideration for the effectiveness of a drug, in this case supplement, is whether the patient actually take the drug. Satisfaction and effectiveness should show a direct correlation since the users' satisfaction is in large based on the effectiveness of the supplement.
In order to gain more insight into how users might be using these supplements, I did a natural language processing of comment left by the reviewers. For this, I chose the most effective supplement by rank, which was colloidal silver and did a wordcloud. The wordcloud suggest the most common user for colloidal silver was as antibiotic against infections, and the length of use range from months to year.
Demographic Data studies:
In the dataset, there were significant data points that was relevant to user demographic. Analysis of this would be useful in determine who are the users of WebMD and how and what they are using the website for. This can help the design of WebMD as well as better targeting of ads for a better user experience.
Based on the top 10 conditions users have entered as the reason why they were taking supplements, we can see that general health and wellness is a significant portion. This makes sense as supplements are meant to supplement, not to treat acute disease or conditions. Insomnia, arthritis and anxiety and weight loss are amongst the more specific conditions that users are taking supplement for. The website can engage these user segments by increasing the number of articles about these conditions.
One of the most valuable data points a company can gather is user demographics. Here I did breakdown of the reviewers by age bracket and gender. The top 4 bins are all female of different age bracket, and the top age bracket is 55-64 (including both male and female reviewers). This suggest that the reviewers of WebMD are majority female, and age bracket 45-64 covers a significant segment of the total reviewers. The reviewers here can serve as a stand-in for the average WebMD user, can having this data can help better improve user experience by presenting them content relevant to them as well as ads.
For the future work, I like to expand on 2 fronts. Firstly, I would like to do a cross analysis between the user demographics and the conditions and supplements they most reviewed. Again, this additional layer of data allows for better user experience and targeting.
Secondly, I like to improve on the natural language processing of the most effective reviews as well as those reviews deem most helpful. The NLP might give us insight as to how the users are using the supplement and how that might in turn assist other visitors to the site.
Lastly, I would like to use the data gathered to generate an information sheet similar to a package insert for the most common supplement. Package inserts are the piece of paper included with prescription drug that gives indication, clinical trial data, as well as safety information of the drug mandated by the FDA. Visitors of WebMD are looking for information, and with the aid of an information sheet, they will be able to make a better informed decision on their supplements.
Thank you for reading! For a more extensive look at my project, please visit https://github.com/xingwchen/webmd.