Data Scraping WebMD and Creating an OTC Drug Finder
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Choosing the right medicine should be an easy process. For common symptoms such as a cold or a minor ache, one should be able to walk into a pharmacy and know exactly what they need without consulting data from a doctor or pharmacist. However, with so many different drugs treating the same symptoms, how do we know we're selecting the best one?
WebMD, the leading source for trustworthy and timely health and medical news and information, provides a large database for users to look up and learn more about medications that they may be taking. While there is a lot of useful information here, finding a drug for your symptoms is not the simplest process.
WebMD allows users to search for a drug using its name, or medical condition. This is very helpful when you want to learn more about the drug that you have been prescribed. With its large database, it is more than likely to have information on whatever medication you may be taking. However, if you are trying to use it to find a drug to treat something like a runny nose, you may run into some issues.
As you might imagine, there are hundreds, if not thousands of drugs that can treat a runny nose. After selecting a drug, WebMD will bring you to its next page, where users can read about the details of the drugs: how to use, precautions, side effects, etc. Here, the user will have to filter through multiple paragraphs and determine if they have any known allergies or bad interactions to the medication. If all is well, the user can then move onto the Reviews section or Find Lowest Prices.
However, if the users determine that the ratings or price does not match their criteria, they would have to restart their search from the beginning. This can be quite a tedious process if you're flipping through hundreds of drugs, so some type of automation would definitely be beneficial here.
Data Web Scraping
While WebMD is an extremely resourceful website, it does not make its data available to download for users visiting the page. Before we can create our Robot Pharmacist Drug Finder, we have to figure out how to retrieve the data that will be required.
So how can we avoid the painful process of copying and pasting every single page on WebMD into an excel file? Thankfully, we can use a spider to crawl these pages for us. With some simple coding and Python's web scraping framework, Scrapy, we can download all the viewable drug information from WebMD into a convenient csv file in about 6 hours.
OTC Drug Finder
After some basic data cleaning, we are now ready to create our app! With the help of ipywidgets, we are able to create a simple, yet functional app right inside our Jupyter notebook.
The functionality of this app is simple - the user inputs their symptoms, age, allergies, current medications, and known illnesses. The user can also specify whether they want to sort by price, effectiveness/ease of use/ satisfaction ratings, or number of reviews. Then, with the power of regular expressions and basic NLP, the app will return a table of drugs that the users can take to help with their symptoms.
For the purpose of this project, the focal point being to develop web scraping skills, this app was designed as a proof of concept rather than a marketable product. Some future actions would include porting the functionality to a standalone app framework that users can access through the convenience of their phone or web browser.
In addition to displaying only a table, options to show the detailed content such as a drug's precautions, side effects, and other warnings can also be added. Although prescription drugs and OTC drugs were both scraped, this app was designed with the intent of helping users find OTC drugs only. However, prescription drugs can also be included to assist doctors with the process of prescribing drugs to their patients.
Exploratory Data Analysis
This data science project wouldn't be complete without some EDA. So what insights might we find in this very large dataset of medications?
One question that I wanted to answer was the effect that forms of drugs had on its users. Are certain forms of drugs more effective than others? Is it better to buy cough medicine in liquid form or pill form? Does it make a difference?
In this analysis, I used only the top 13 most common drug forms in my dataset. Although there were a total of 51 forms scraped from WebMD - the top 13 made up over 95% of the total volume of drugs that appeared in the data. Future work will be required in cleaning the data to reduce this number as I noticed some forms can be grouped together (ie. liquids and solutions, lotions and creams, tablets and tabs). However, the current form should be sufficient for the purpose of our analysis.
Comparing the prices of different drug forms, it is quite clear that they differ greatly.
However, when we compare the effectiveness rating of the different forms, it isn't very obvious that there is a significant difference.
A statistical test can help us be more sure about whether they are the same or not. Luckily for us, the stats module in the scipy package allows us to run statistical tests without much effort.
While the ANOVA test might be the best method for us to check whether there is a difference in the average effectiveness between the groups of drugs, we cannot use this due to a difference in variance between the groups. Instead, we will have to use the Kruskal's test to see whether there is a difference in median effectiveness. The median is a good alternative to mean in this situation. Since our test returned an extremely small p-value, there is enough evidence to suggest that there is a significant difference between effectiveness in different forms of drugs.
Effectiveness and Prices Between the Drugs
The following two boxplots can help us to better visualize the effectiveness and prices between the drugs.
Brand Names vs Off Label
So we have learned that there is clearly a difference in effectiveness between different forms of drugs. What can the data say about brand names vs generic forms of the same drugs?
The following boxplot shows that they are exactly the same!
However, when we compare the prices between brand name and generic drugs, there is a clear difference, with generic (off label) drugs being significantly less expensive.
The next time you find yourself in the pharmacy section, it may save you a few dollars to go with the generic form if its available!
In the reviews section of each medication, users on WebMD can provide ratings in the categories of Effectiveness, Ease of Use, and Satisfaction.
Satisfaction of the drug can be a strong indicator of whether a user would buy the drug again in the future. How was satisfaction affected by effectiveness and ease of use?
In the following scatter plot, I wanted to see the relationship between Satisfaction and Ease of Use. The two seem to be slightly correlated with a Pearson's correlation coefficient of .63. However, when we plot Satisfaction against Effectiveness, the correlation is much stronger at .85. This leads us to believe that effectiveness of a drug has a much stronger effect on a users satisfaction than the simplicity of administering a drug. This makes sense since making a drug easy to take is not as important as making the drug effective at curing symptoms. A user would probably care about how effective a drug is first, and how easy it is to take second; especially if they are very sick.
Below is a correlation table of all the above ratings, plus an overall rating which was created by taking the average of the 3 ratings.
There is still much work that can be done with this data set. In the future, more advanced NLP and machine learning algorithms can be used to improve the capabilities of the OTC Drug Finder app, as well as to learn more insights from this dataset.
You can find the code to my Scrapy spider, data visualization, and app on my Github here.