Data Scraping WebMD and Creating an OTC Drug Finder

Posted on Feb 20, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Choosing the right medicine should be an easy process. For common symptoms such as a cold or a minor ache, one should be able to walk into a pharmacy and know exactly what they need without consulting data from a doctor or pharmacist. However, with so many different drugs treating the same symptoms, how do we know we're selecting the best one?

WebMD

WebMD, the leading source for trustworthy and timely health and medical news and information, provides a large database for users to look up and learn more about medications that they may be taking. While there is a lot of useful information here, finding a drug for your symptoms is not the simplest process.

Data Scraping WebMD and Creating an OTC Drug Finder

WebMD allows users to search for a drug using its name, or medical condition. This is very helpful when you want to learn more about the drug that you have been prescribed. With its large database, it is more than likely to have information on whatever medication you may be taking. However, if you are trying to use it to find a drug to treat something like a runny nose, you may run into some issues.

wedmd2

As you might imagine, there are hundreds, if not thousands of drugs that can treat a runny nose. After selecting a drug, WebMD will bring you to its next page, where users can read about the details of the drugs: how to use, precautions, side effects, etc. Here, the user will have to filter through multiple paragraphs and determine if they have any known allergies or bad interactions to the medication. If all is well, the user can then move onto the Reviews section or Find Lowest Prices.

 Data Scraping WebMD and Creating an OTC Drug Finder Data Scraping WebMD and Creating an OTC Drug Finder webmd4

However, if the users determine that the ratings or price does not match their criteria, they would have to restart their search from the beginning. This can be quite a tedious process if you're flipping through hundreds of drugs, so some type of automation would definitely be beneficial here.

Data Web Scraping

While WebMD is an extremely resourceful website, it does not make its data available to download for users visiting the page. Before we can create our Robot Pharmacist Drug Finder, we have to figure out how to retrieve the data that will be required.

So how can we avoid the painful process of copying and pasting every single page on WebMD into an excel file? Thankfully, we can use a spider to crawl these pages for us. With some simple coding and Python's web scraping framework, Scrapy, we can download all the viewable drug information from WebMD into a convenient csv file in about 6 hours.

OTC Drug Finder

After some basic data cleaning, we are now ready to create our app! With the help of ipywidgets, we are able to create a simple, yet functional app right inside our Jupyter notebook.

webmd5webmd6

The functionality of this app is simple - the user inputs their symptoms, age, allergies, current medications, and known illnesses. The user can also specify whether they want to sort by price, effectiveness/ease of use/ satisfaction ratings, or number of reviews. Then, with the power of regular expressions and basic NLP, the app will return a table of drugs that the users can take to help with their symptoms.

For the purpose of this project, the focal point being to develop web scraping skills, this app was designed as a proof of concept rather than a marketable product. Some future actions would include porting the functionality to a standalone app framework that users can access through the convenience of their phone or web browser.

In addition to displaying only a table, options to show the detailed content such as a drug's precautions, side effects, and other warnings can also be added. Although prescription drugs and OTC drugs were both scraped, this app was designed with the intent of helping users find OTC drugs only. However, prescription drugs can also be included to assist doctors with the process of prescribing drugs to their patients.

Exploratory Data Analysis

This data science project wouldn't be complete without some EDA. So what insights might we find in this very large dataset of medications?

One question that I wanted to answer was the effect that forms of drugs had on its users. Are certain forms of drugs more effective than others? Is it better to buy cough medicine in liquid form or pill form? Does it make a difference?

In this analysis, I used only the top 13 most common drug forms in my dataset. Although there were a total of 51 forms scraped from WebMD - the top 13 made up over 95% of the total volume of drugs that appeared in the data. Future work will be required in cleaning the data to reduce this number as I noticed some forms can be grouped together (ie. liquids and solutions, lotions and creams, tablets and tabs). However, the current form should be sufficient for the purpose of our analysis.

webmd7

Price

Comparing the prices of different drug forms, it is quite clear that they differ greatly.

webmd8

Effectiveness Rating

However, when we compare the effectiveness rating of the different forms, it isn't very obvious that there is a significant difference.

webmd9

Statistical Test

A statistical test can help us be more sure about whether they are the same or not. Luckily for us, the stats module in the scipy package allows us to run statistical tests without much effort.

webmd10

While the ANOVA test might be the best method for us to check whether there is a difference in the average effectiveness between the groups of drugs, we cannot use this due to a difference in variance between the groups. Instead, we will have to use the Kruskal's test to see whether there is a difference in median effectiveness. The median is a good alternative to mean in this situation. Since our test returned an extremely small p-value, there is enough evidence to suggest that there is a significant difference between effectiveness in different forms of drugs.

Effectiveness and Prices Between the Drugs

The following two boxplots can help us to better visualize the effectiveness and prices between the drugs.

webmd11
webmd12

Brand Names vs Off Label

So we have learned that there is clearly a difference in effectiveness between different forms of drugs. What can the data say about brand names vs generic forms of the same drugs?

The following boxplot shows that they are exactly the same!

webmd13

However, when we compare the prices between brand name and generic drugs, there is a clear difference, with generic (off label) drugs being significantly less expensive.

The next time you find yourself in the pharmacy section, it may save you a few dollars to go with the generic form if its available!

webmd14

User Ratings

In the reviews section of each medication, users on WebMD can provide ratings in the categories of Effectiveness, Ease of Use, and Satisfaction.

Satisfaction of the drug can be a strong indicator of whether a user would buy the drug again in the future. How was satisfaction affected by effectiveness and ease of use?

In the following scatter plot, I wanted to see the relationship between Satisfaction and Ease of Use. The two seem to be slightly correlated with a Pearson's correlation coefficient of .63. However, when we plot Satisfaction against Effectiveness, the correlation is much stronger at .85. This leads us to believe that effectiveness of a drug has a much stronger effect on a users satisfaction than the simplicity of administering a drug. This makes sense since making a drug easy to take is not as important as making the drug effective at curing symptoms. A user would probably care about how effective a drug is first, and how easy it is to take second; especially if they are very sick.

webmd15 webmd16

Below is a correlation table of all the above ratings, plus an overall rating which was created by taking the average of the 3 ratings.

webmd17

Conclusion

There is still much work that can be done with this data set. In the future, more advanced NLP and machine learning algorithms can be used to improve the capabilities of the OTC Drug Finder app, as well as to learn more insights from this dataset.

You can find the code to my Scrapy spider, data visualization, and app on my Github here.

About Author

Related Articles

Leave a Comment

Google April 28, 2021
Google Sites of interest we have a link to.
Google March 31, 2021
Google Here are a few of the web sites we suggest for our visitors.
Google November 3, 2020
Google Every once in a though we pick blogs that we study. Listed below are the newest web sites that we pick.
Google June 19, 2020
Google Sites of interest we've a link to.
Cherie January 22, 2019
Hi Jason, I am trying to do a similar project but scrape data on different Cardiovascular disease surgeons in Florida. How should I go about doing this?
Vamshi September 8, 2017
Can I get a github link for your code

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI