Vitamin Supplements Scraping WEbmd.com Information

Posted on Jun 6, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

The blog post describes a Web Scraping workflow and an R Shiny Application.

MOTIVATION:

Webmd.com hosts useful information regarding medical drugs, supplements, pathological conditions, inter alia, aimed at improving public health and awareness. In this blog post I'll discuss and present a simple application that is aimed at providing a consumer with necessary and useful information regarding a specific Vitamin Supplement and/or a pathological condition of interest.

Webmd.com hosts much structured data from user/consumer feedback. However, owing to the sheer volume of such vital consumer feedback potential new consumers end up scrolling several of these web pages to be able to make a personal decision about the product which is a time consuming activity. In order to make the search process more easy and convenient for the consumer, I've collected the data from Webmd.com to better summarize the data to help a consumer make better decisions regarding a specific Vitamin Supplement or a pathological condition of interest.

WORK FLOW & QUESTIONS OF INTEREST:

I've developed and implemented a simple web scraping approach to collect user feedback pertinent to several of the Vitamin Supplements. Next, I developed an R Shiny application to investigate how consumers can benefit from such consumer feedback. Specifically, the frequency distribution of parameters such as user "Satisfaction" is evaluated with respect to user provided information such as "Gender" and "Age-Group". Also another R Shiny application was developed that offers consumers with the top Vitamin Supplements for consumption for a specific pathological condition. 

The code for the Python Spider and the R Shiny apps can be found here: https://github.com/uppulury/webmd

METHODOLOGY:

  • DATA EXTRACTION: In order to evaluate specific consumer information, the available data from the source web pages were all scraped using the (Python) Scrapy tool. For a given Vitamin Supplement, the number of web pages to be scraped to collect data pertinent to each user review is a variable number.
  • The Spider evaluated all the web pages that needed to be scraped and extracted the following information from each user review: (1) Supplement Name, (2) Pathological Condition, (3) the Ease-Of-Use of the Supplement, (4) the user Satisfaction, (5) the Supplement Effectiveness, (6) the user Gender, (7) the Date the Review was posted, (8) the Age-Group of the user, and (9) the Time Duration the user consumed the Vitamin Supplement.
  • A total of 6928 user posts were scraped, so the overall sample size had 6928 rows and 9 columns. 
  • Missing Data: A peek into the data set revealed missing records for the fields of user "Gender" and "Age-Group". Specifically, a total of 959 instances of user's Gender were missing and 763 instances of the user's Age were unavailable. Of these, 261 records were in common and therefore pertained to the same user.
  • Handling Missing Data: From the raw data set the rows containing the missing records of Age and Gender were removed. Hence, a total of 5467 records were analyzed.
  • DESCRIPTIVE STATISTICS: A simple histogram distribution of the Gender and Age-Group across all other fields indicated that - (1) Female users are twice as many as Male users and (2) a broad distribution of the number of users across different Age-Groups, specifically - "0-2", "3-6", "7-12", "13-18", "19-24", "25-34", "35-44". "45-54", "55-64", "65-74", and the % number of users from these distinct Age-Groups are approx.
  • 0.1%, 0.2%, 0.3%, 1.2%, 4.3%, 12.0%, 14.4%, 24.3%, 26.0% and 17.2% respectively. It is intuitive the % number of users increase with increase in Age with the maximum number of users in the Age-Group of "55-64". The users in the Age-Group range of 45 to 64 years comprise 50% of the consumers. 

    Age-Group    % Number of Users
        0-2                0.1
        3-6                0.2
      7-12                0.3
    13-18                1.2
    19-24                4.3
    25-34              12.0
    35-44              14.4
    45-54              24.3
    55-64              26.0
    65-74              17.2

ANALYSIS & R SHINY APPLICATION:

The R Shiny applications are made available mentioned from the aforementioned Github link. An illustrative figure is shown below. The red histograms indicate the frequency distribution of users who rated Melatonin's Ease-Of-Use on a scale of 1-5 (including all genders and age-groups). The blue histogram is a subset of the red histogram in that it depicts the Female users (including all age-groups).

Quite clearly, the % Female consumers of Melatonin is quite high compared to Male consumers of Melatonin. Finally, the green histogram is a subset of the blue histogram in that it depicts the Female users from the 45-54 Age-Group. Hence, this R Shiny app offers a specific breakdown of the distributions that may interest a potential new consumer.

 

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI