Scraping WebMD Vitamin Supplements
The blog post describes a Web Scraping workflow and an R Shiny Application.
MOTIVATION: Webmd.com hosts useful information regarding medical drugs, supplements, pathological conditions, inter alia, aimed at improving public health and awareness. In this blog post I'll discuss and present a simple application that is aimed at providing a consumer with necessary and useful information regarding a specific Vitamin Supplement and/or a pathological condition of interest. Webmd.com hosts much structured data from user/consumer feedback. However, owing to the sheer volume of such vital consumer feedback potential new consumers end up scrolling several of these web pages to be able to make a personal decision about the product which is a time consuming activity. In order to make the search process more easy and convenient for the consumer, I've collected the data from Webmd.com to better summarize the data to help a consumer make better decisions regarding a specific Vitamin Supplement or a pathological condition of interest.
WORK FLOW & QUESTIONS OF INTEREST: I've developed and implemented a simple web scraping approach to collect user feedback pertinent to several of the Vitamin Supplements. Next, I developed an R Shiny application to investigate how consumers can benefit from such consumer feedback. Specifically, the frequency distribution of parameters such as user "Satisfaction" is evaluated with respect to user provided information such as "Gender" and "Age-Group". Also another R Shiny application was developed that offers consumers with the top Vitamin Supplements for consumption for a specific pathological condition.
The code for the Python Spider and the R Shiny apps can be found here: https://github.com/uppulury/webmd
- DATA EXTRACTION: In order to evaluate specific consumer information, the available data from the source web pages were all scraped using the (Python) Scrapy tool. For a given Vitamin Supplement, the number of web pages to be scraped to collect data pertinent to each user review is a variable number. The Spider evaluated all the web pages that needed to be scraped and extracted the following information from each user review: (1) Supplement Name, (2) Pathological Condition, (3) the Ease-Of-Use of the Supplement, (4) the user Satisfaction, (5) the Supplement Effectiveness, (6) the user Gender, (7) the Date the Review was posted, (8) the Age-Group of the user, and (9) the Time Duration the user consumed the Vitamin Supplement.
- A total of 6928 user posts were scraped, so the overall sample size had 6928 rows and 9 columns.
- Missing Data: A peek into the data set revealed missing records for the fields of user "Gender" and "Age-Group". Specifically, a total of 959 instances of user's Gender were missing and 763 instances of the user's Age were unavailable. Of these, 261 records were in common and therefore pertained to the same user.
- Handling Missing Data: From the raw data set the rows containing the missing records of Age and Gender were removed. Hence, a total of 5467 records were analyzed.
- DESCRIPTIVE STATISTICS: A simple histogram distribution of the Gender and Age-Group across all other fields indicated that - (1) Female users are twice as many as Male users and (2) a broad distribution of the number of users across different Age-Groups, specifically - "0-2", "3-6", "7-12", "13-18", "19-24", "25-34", "35-44". "45-54", "55-64", "65-74", and the % number of users from these distinct Age-Groups are approx. 0.1%, 0.2%, 0.3%, 1.2%, 4.3%, 12.0%, 14.4%, 24.3%, 26.0% and 17.2% respectively. It is intuitive the % number of users increase with increase in Age with the maximum number of users in the Age-Group of "55-64". The users in the Age-Group range of 45 to 64 years comprise 50% of the consumers.
|Age-Group||% Number of Users|
ANALYSIS & R SHINY APPLICATION: The R Shiny applications are made available mentioned from the aforementioned Github link. An illustrative figure is shown below. The red histograms indicate the frequency distribution of users who rated Melatonin's Ease-Of-Use on a scale of 1-5 (including all genders and age-groups). The blue histogram is a subset of the red histogram in that it depicts the Female users (including all age-groups). Quite clearly, the % Female consumers of Melatonin is quite high compared to Male consumers of Melatonin. Finally, the green histogram is a subset of the blue histogram in that it depicts the Female users from the 45-54 Age-Group. Hence, this R Shiny app offers a specific breakdown of the distributions that may interest a potential new consumer.