Scraping WebMD and Creating an OTC Drug Finder with Python

Posted on Feb 20, 2017


Choosing the right medicine should be an easy process. For common symptoms such as a cold or a minor ache, one should be able to walk into a pharmacy and know exactly what they need without consulting a doctor or pharmacist. However, with so many different drugs treating the same symptoms, how do we know we're selecting the best one?


WebMD, the leading source for trustworthy and timely health and medical news and information, provides a large database for users to look up and learn more about medications that they may be taking. While there is a lot of useful information here, finding a drug for your symptoms is not the simplest process.


WebMD allows users to search for a drug using its name, or medical condition. This is very helpful when you want to learn more about the drug that you have been prescribed. With its large database, it is more than likely to have information on whatever medication you may be taking. However, if you are trying to use it to find a drug to treat something like a runny nose, you may run into some issues.


As you might imagine, there are hundreds, if not thousands of drugs that can treat a runny nose. After selecting a drug, WebMD will bring you to its next page, where users can read about the details of the drugs: how to use, precautions, side effects, etc. Here, the user will have to filter through multiple paragraphs and determine if they have any known allergies or bad interactions to the medication. If all is well, the user can then move onto the Reviews section or Find Lowest Prices.

 wedmd3 webmd3 webmd4

However, if the users determine that the ratings or price does not match their criteria, they would have to restart their search from the beginning. This can be quite a tedious process if you're flipping through hundreds of drugs, so some type of automation would definitely be beneficial here.

Web Scraping

While WebMD is an extremely resourceful website, it does not make its data available to download for users visiting the page. Before we can create our Robot Pharmacist Drug Finder, we have to figure out how to retrieve the data that will be required.

So how can we avoid the painful process of copying and pasting every single page on WebMD into an excel file? Thankfully, we can use a spider to crawl these pages for us. With some simple coding and Python's web scraping framework, Scrapy, we can download all the viewable drug information from WebMD into a convenient csv file in about 6 hours.

OTC Drug Finder

After some basic data cleaning, we are now ready to create our app! With the help of ipywidgets, we are able to create a simple, yet functional app right inside our Jupyter notebook.


The functionality of this app is simple - the user inputs their symptoms, age, allergies, current medications, and known illnesses. The user can also specify whether they want to sort by price, effectiveness/ease of use/ satisfaction ratings, or number of reviews. Then, with the power of regular expressions and basic NLP, the app will return a table of drugs that the users can take to help with their symptoms.

For the purpose of this project, the focal point being to develop web scraping skills, this app was designed as a proof of concept rather than a marketable product. Some future actions would include porting the functionality to a standalone app framework that users can access through the convenience of their phone or web browser. In addition to displaying only a table, options to show the detailed content such as a drug's precautions, side effects, and other warnings can also be added. Although prescription drugs and OTC drugs were both scraped, this app was designed with the intent of helping users find OTC drugs only. However, prescription drugs can also be included to assist doctors with the process of prescribing drugs to their patients.

Exploratory Data Analysis

This data science project wouldn't be complete without some EDA. So what insights might we find in this very large dataset of medications?

One question that I wanted to answer was the effect that forms of drugs had on its users. Are certain forms of drugs more effective than others? Is it better to buy cough medicine in liquid form or pill form? Does it make a difference?

In this analysis, I used only the top 13 most common drug forms in my dataset. Although there were a total of 51 forms scraped from WebMD - the top 13 made up over 95% of the total volume of drugs that appeared in the data. Future work will be required in cleaning the data to reduce this number as I noticed some forms can be grouped together (ie. liquids and solutions, lotions and creams, tablets and tabs). However, the current form should be sufficient for the purpose of our analysis.


Comparing the prices of different drug forms, it is quite clear that they differ greatly.


However, when we compare the effectiveness rating of the different forms, it isn't very obvious that there is a significant difference.


A statistical test can help us be more sure about whether they are the same or not. Luckily for us, the stats module in the scipy package allows us to run statistical tests without much effort.


While the ANOVA test might be the best method for us to check whether there is a difference in the average effectiveness between the groups of drugs, we cannot use this due to a difference in variance between the groups. Instead, we will have to use the Kruskal's test to see whether there is a difference in median effectiveness. The median is a good alternative to mean in this situation. Since our test returned an extremely small p-value, there is enough evidence to suggest that there is a significant difference between effectiveness in different forms of drugs.

The following two boxplots can help us to better visualize the effectiveness and prices between the drugs.


Brand Names vs Off Label

So we have learned that there is clearly a difference in effectiveness between different forms of drugs. What can the data say about brand names vs generic forms of the same drugs?

The following boxplot shows that they are exactly the same!


However, when we compare the prices between brand name and generic drugs, there is a clear difference, with generic (off label) drugs being significantly less expensive.

The next time you find yourself in the pharmacy section, it may save you a few dollars to go with the generic form if its available!


User Ratings

In the reviews section of each medication, users on WebMD can provide ratings in the categories of Effectiveness, Ease of Use, and Satisfaction.

Satisfaction of the drug can be a strong indicator of whether a user would buy the drug again in the future. How was satisfaction affected by effectiveness and ease of use?

In the following scatter plot, I wanted to see the relationship between Satisfaction and Ease of Use. The two seem to be slightly correlated with a Pearson's correlation coefficient of .63. However, when we plot Satisfaction against Effectiveness, the correlation is much stronger at .85. This leads us to believe that effectiveness of a drug has a much stronger effect on a users satisfaction than the simplicity of administering a drug. This makes sense since making a drug easy to take is not as important as making the drug effective at curing symptoms. A user would probably care about how effective a drug is first, and how easy it is to take second; especially if they are very sick.

webmd15 webmd16

Below is a correlation table of all the above ratings, plus an overall rating which was created by taking the average of the 3 ratings.



There is still much work that can be done with this data set. In the future, more advanced NLP and machine learning algorithms can be used to improve the capabilities of the OTC Drug Finder app, as well as to learn more insights from this dataset.

You can find the code to my Scrapy spider, data visualization, and app on my Github here.

About Author

Related Articles

Leave a Comment

Google April 28, 2021
Google Sites of interest we have a link to.
Google March 31, 2021
Google Here are a few of the web sites we suggest for our visitors.
Google November 3, 2020
Google Every once in a though we pick blogs that we study. Listed below are the newest web sites that we pick.
Google June 19, 2020
Google Sites of interest we've a link to.
Cherie January 22, 2019
Hi Jason, I am trying to do a similar project but scrape data on different Cardiovascular disease surgeons in Florida. How should I go about doing this?
Vamshi September 8, 2017
Can I get a github link for your code

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp