Data Scraping WebMD and Creating an OTC Drug Finder

Posted on Feb 20, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Choosing the right medicine should be an easy process. For common symptoms such as a cold or a minor ache, one should be able to walk into a pharmacy and know exactly what they need without consulting data from a doctor or pharmacist. However, with so many different drugs treating the same symptoms, how do we know we're selecting the best one?


WebMD, theĀ leading source for trustworthy and timely health and medical news and information, provides a large databaseĀ for users to look up and learn more about medications that they may be taking. While there is a lot of useful information here, finding a drug for your symptoms is not the simplest process.

Data Scraping WebMD and Creating an OTC Drug Finder

WebMD allows users to search for a drug using its name, or medical condition. This is very helpful when you want to learn more about the drug thatĀ you have been prescribed.Ā With its large database, it is more than likely to have information on whatever medication you may be taking. However, if you areĀ trying to use it to find a drug to treat something like a runny nose, you may run into some issues.


As you might imagine, there are hundreds, if not thousands of drugs that can treat a runny nose. After selecting a drug, WebMD will bring you to its next page, where users can read about the details of the drugs: how to use, precautions, side effects, etc. Here, the user will have to filter through multiple paragraphs and determine if they have any known allergies or bad interactions to the medication. IfĀ all is well, the user can then move onto the Reviews section or Find Lowest Prices.

Ā Data Scraping WebMD and Creating an OTC Drug FinderĀ Data Scraping WebMD and Creating an OTC Drug FinderĀ webmd4

However, if the users determine that the ratings or price does not match their criteria, they would have to restart their search from the beginning. This can be quite a tedious process if you're flipping through hundreds of drugs, so some type of automation would definitely be beneficial here.

Data Web Scraping

While WebMD is an extremely resourceful website, it does not make its data available to download for users visiting the page. Before we can create our Robot Pharmacist Drug Finder, we have to figure out how to retrieve the data that will be required.

So how can we avoid the painful process of copying and pasting every single page on WebMD into an excel file? Thankfully, we can use a spider to crawl these pages for us. With some simple coding and Python's web scraping framework, Scrapy, we can download all the viewable drug information from WebMD into a convenient csv file in about 6 hours.

OTC Drug Finder

After some basicĀ data cleaning, we are now ready to create our app! With the help of ipywidgets, we are able to create a simple, yet functional app right inside our Jupyter notebook.


The functionality of this app is simple - the user inputs their symptoms, age, allergies, current medications, and known illnesses. The user can also specify whether they want to sort by price, effectiveness/ease of use/ satisfaction ratings, or number of reviews. Then, with the power of regular expressions and basic NLP, the app will return a table of drugs that the users can take to help with their symptoms.

For the purpose of this project, the focal point beingĀ to develop web scraping skills, this app was designed as a proof of concept ratherĀ than a marketable product. Some future actions would include portingĀ the functionality to a standalone app framework that users can access through the convenience of their phone or web browser.

In addition to displaying only a table, options to show the detailed content such as a drug's precautions, side effects, and other warnings canĀ also be added. Although prescription drugs and OTC drugs were both scraped, this app was designedĀ with the intent of helping users find OTC drugs only. However, prescription drugs can also be included to assist doctors with the process of prescribing drugs to their patients.

Exploratory Data Analysis

This data science project wouldn't be complete without some EDA. So what insights might we find in this very large dataset of medications?

One question that I wanted to answer was the effect that forms of drugs had on its users. Are certain forms of drugs more effective than others? Is it better to buy cough medicine in liquid form or pill form? Does it make a difference?

In this analysis, I used only the top 13 most common drug forms in my dataset. Although there were a total of 51 forms scraped from WebMD - the top 13 made up over 95% of the total volume of drugs that appeared in the data. Future work will be required in cleaning the data to reduce this number as I noticed some forms can be grouped together (ie. liquids and solutions, lotions and creams, tablets and tabs). However, the current form should be sufficient for the purpose of our analysis.



ComparingĀ the prices of different drug forms, it is quite clearĀ that they differ greatly.


Effectiveness Rating

However, when we compare the effectiveness rating of the different forms, it isn't very obvious that there is a significant difference.


Statistical Test

A statistical test can help us be more sure about whether they are the same or not. Luckily for us, the stats module in the scipy package allows us to run statistical tests without much effort.


While the ANOVA test might be the best method for us to check whether there is a difference in the average effectiveness between the groups of drugs, we cannot use this due to a difference in variance between the groups. Instead, we will have to use the Kruskal's test to see whether there is a difference in median effectiveness. The median is a good alternative to mean in this situation. Since our test returned an extremely small p-value, there is enough evidence to suggest that there is a significant difference between effectiveness in different forms of drugs.

Effectiveness and Prices Between the Drugs

The following two boxplots can help us to better visualize the effectiveness and prices between the drugs.


Brand Names vs Off Label

So we have learned that there is clearly a difference in effectiveness between different forms of drugs. What can the data say about brand names vs generic forms of the same drugs?

The following boxplot shows that they are exactly the same!


However, when we compare the prices between brand name and generic drugs, there is a clear difference, with generic (off label) drugs being significantly less expensive.

The next time you find yourself in the pharmacy section, it may save you a few dollars to goĀ with the generic form if its available!


User Ratings

In the reviews section of each medication, users on WebMD can provide ratings in the categories of Effectiveness, Ease of Use, and Satisfaction.

Satisfaction of the drug can be a strong indicator of whether a user would buy the drug again in the future. How was satisfaction affected by effectiveness and ease of use?

In the following scatter plot, I wanted to see the relationship between Satisfaction and Ease of Use. The two seem to be slightly correlated with a Pearson's correlation coefficient of .63. However, when we plot Satisfaction against Effectiveness, the correlation is much stronger at .85. This leads us to believe that effectiveness of a drug has a much stronger effect on a users satisfaction than theĀ simplicity of administering a drug. This makes sense since making a drug easy to take is not as important as making the drug effective at curing symptoms. A user would probably care about how effective a drug is first, and how easy it is to take second; especially if they are very sick.

webmd15 webmd16

Below is a correlation table of all the above ratings, plus an overall rating which was created by taking the average of the 3 ratings.



There is still much work that can be done with this data set. In the future, more advanced NLP and machine learning algorithms can be used to improve the capabilities of the OTC Drug Finder app, as well as to learn more insights from this dataset.

You can find the code to my Scrapy spider, data visualization, and app on my Github here.

About Author

Related Articles

Leave a Comment

Google April 28, 2021
Google Sites of interest we have a link to.
Google March 31, 2021
Google Here are a few of the web sites we suggest for our visitors.
Google November 3, 2020
Google Every once in a though we pick blogs that we study. Listed below are the newest web sites that we pick.
Google June 19, 2020
Google Sites of interest we've a link to.
Cherie January 22, 2019
Hi Jason, I am trying to do a similar project but scrape data on different Cardiovascular disease surgeons in Florida. How should I go about doing this?
Vamshi September 8, 2017
Can I get a github link for your code

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI