Scraping WebMD Vitamin Supplements

Karthik Uppulury
Posted on Jun 6, 2019

The blog post describes a Web Scraping workflow and an R Shiny Application.

MOTIVATION: Webmd.com hosts useful information regarding medical drugs, supplements, pathological conditions, inter alia, aimed at improving public health and awareness. In this blog post I'll discuss and present a simple application that is aimed at providing a consumer with necessary and useful information regarding a specific Vitamin Supplement and/or a pathological condition of interest. Webmd.com hosts much structured data from user/consumer feedback. However, owing to the sheer volume of such vital consumer feedback potential new consumers end up scrolling several of these web pages to be able to make a personal decision about the product which is a time consuming activity. In order to make the search process more easy and convenient for the consumer, I've collected the data from Webmd.com to better summarize the data to help a consumer make better decisions regarding a specific Vitamin Supplement or a pathological condition of interest.

WORK FLOW & QUESTIONS OF INTEREST: I've developed and implemented a simple web scraping approach to collect user feedback pertinent to several of the Vitamin Supplements. Next, I developed an R Shiny application to investigate how consumers can benefit from such consumer feedback. Specifically, the frequency distribution of parameters such as user "Satisfaction" is evaluated with respect to user provided information such as "Gender" and "Age-Group". Also another R Shiny application was developed that offers consumers with the top Vitamin Supplements for consumption for a specific pathological condition. 

The code for the Python Spider and the R Shiny apps can be found here: https://github.com/uppulury/webmd

METHODOLOGY:

  • DATA EXTRACTION: In order to evaluate specific consumer information, the available data from the source web pages were all scraped using the (Python) Scrapy tool. For a given Vitamin Supplement, the number of web pages to be scraped to collect data pertinent to each user review is a variable number. The Spider evaluated all the web pages that needed to be scraped and extracted the following information from each user review: (1) Supplement Name, (2) Pathological Condition, (3) the Ease-Of-Use of the Supplement, (4) the user Satisfaction, (5) the Supplement Effectiveness, (6) the user Gender, (7) the Date the Review was posted, (8) the Age-Group of the user, and (9) the Time Duration the user consumed the Vitamin Supplement.
  • A total of 6928 user posts were scraped, so the overall sample size had 6928 rows and 9 columns. 
  • Missing Data: A peek into the data set revealed missing records for the fields of user "Gender" and "Age-Group". Specifically, a total of 959 instances of user's Gender were missing and 763 instances of the user's Age were unavailable. Of these, 261 records were in common and therefore pertained to the same user.
  • Handling Missing Data: From the raw data set the rows containing the missing records of Age and Gender were removed. Hence, a total of 5467 records were analyzed.
  • DESCRIPTIVE STATISTICS: A simple histogram distribution of the Gender and Age-Group across all other fields indicated that - (1) Female users are twice as many as Male users and (2) a broad distribution of the number of users across different Age-Groups, specifically - "0-2", "3-6", "7-12", "13-18", "19-24", "25-34", "35-44". "45-54", "55-64", "65-74", and the % number of users from these distinct Age-Groups are approx. 0.1%, 0.2%, 0.3%, 1.2%, 4.3%, 12.0%, 14.4%, 24.3%, 26.0% and 17.2% respectively. It is intuitive the % number of users increase with increase in Age with the maximum number of users in the Age-Group of "55-64". The users in the Age-Group range of 45 to 64 years comprise 50% of the consumers. 

    Age-Group    % Number of Users
        0-2                0.1
        3-6                0.2
      7-12                0.3
    13-18                1.2
    19-24                4.3
    25-34              12.0
    35-44              14.4
    45-54              24.3
    55-64              26.0
    65-74              17.2

ANALYSIS & R SHINY APPLICATION: The R Shiny applications are made available mentioned from the aforementioned Github link. An illustrative figure is shown below. The red histograms indicate the frequency distribution of users who rated Melatonin's Ease-Of-Use on a scale of 1-5 (including all genders and age-groups). The blue histogram is a subset of the red histogram in that it depicts the Female users (including all age-groups). Quite clearly, the % Female consumers of Melatonin is quite high compared to Male consumers of Melatonin. Finally, the green histogram is a subset of the blue histogram in that it depicts the Female users from the 45-54 Age-Group. Hence, this R Shiny app offers a specific breakdown of the distributions that may interest a potential new consumer.

 

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Classes Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet Lectures linear regression Live Chat Live Online Bootcamp Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Lectures Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking Realtime Interaction recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp