Data Scraping Groupon!
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Everyone loves saving money. We all try to make the most of our money, and sometimes itβs the simplest things that can make the biggest difference. Coupons have been long thought of as pieces of paper clipped and taken to the supermarket to get a discount. Anyone who has done this knows clipping coupons can be tedious and time consuming, but using data shows coupons has never been easier thanks to Groupon.
Groupon is a coupon recommendation service that broadcasts electronic coupons for restaurants and stores in your neighborhood. Some of these coupons can be very significant, especially when planning group activities, because the discounts can reach as high as 60%.
You can sign up for Groupon for free, and each day Groupon will send you emails with deals of the day in that metro area. If you like the deal, then you can purchase it immediately from Groupon and redeem it at the restaurant/store for its value.
Data
The data was scraped from the New York City region of the website Groupon. The website layout is divided into an album look for all the different groupons, and then an in-depth page for each particular groupon. The website appearance is shown in the following:
The layouts of both pages were not dynamic, so a custom scrapy spider was built in order to quickly scrape through all the pages and retrieve the information to be analyzed. However, the comments, the important pieces of information, were rendered and loaded via JavaScript, therefore a Selenium was script used. The Selenium script used the URLβs of the groupons acquired from scrapy, and essentially mimicked a human clicking the βnextβ button in the user comments section.
Findings
Around ~400 individual groupons were scraped. The data retrieved from each groupon is shown below.
Groupon Title | Merchant |
Categories | Mini Info |
Deal Features | Location |
Total Number of Ratings | URL |
Around ~89,000 individual user reviews were scraped. The data retrieved from each review is shown below.
Author | Date |
Review | URL |
Exploratory Data Analysis
An interesting finding was that the use of groupons has been increasing tremendously over the past few years. We found this by examining the dates supplied by the reviews. Looking at the following image, where the x-axis represents months/year and the y-axis, which represents count, this conclusion becomes obvious. The slight decline at the end is due to the fact that some of the groupons at the time were likely seasonal in nature.
An interesting finding was that the use of groupons has been increasing tremendously over the past few years. We found this by examining the dates supplied by the reviews. Looking at the following image, where the x-axis represents months/year and the y-axis, which represents count, this conclusion becomes obvious. The slight decline at the end is due to the fact that some of the groupons at the time were likely seasonal in nature.
Price
Lastly, since much of the of the data is through text and follows the common procedure of: Price (Original Price), a regular expression was derived to parse out the price informationβs, and the number of deals they offer. That information is presented in the following bar chart:
Lastly, utilizing the user review data, a word cloud was generated:
Topic Modelling
In order to conduct the topic modelling, the two most important packages used are gensim and spacy. The initial step in creating a corpus was to remove all stop words such as βaβ, βofβ, βtheβ, and βetc.β The next step involved lemmatizing the βnewβ reviews, and finally creating the trigrams. Trigrams were chosen since, based on the dataset location, words like New York City would be identified as one unique word.
The model chosen was Latent Dirichlet Allocation for its superior ability at distinguishing topics from different documents and that there exists a package to visualize the results clearly and efficiently. Since the method is unsupervised, the number of topics have to be chosen beforehand, and after playing around, the optimal number was 3 at 25 consecutive iterations of the model. The results are as follows:
The visualization above is a projection of the topics onto two components, where topics that are similar to each other will appear closer, while those that are dissimilar will be further away. The words on the right are words composing each topic, and the lambda parameter controls the exclusivity of the words. A lambda of 0 indicates the most exclusive words around each topic while a lambda of 1 indicates the most frequent words around each topic.
The first topic represents actionable words, which I believe are for quality and reception of the groupon. The second topic has words that describe exercising and physical activities. Lastly, the third topic has words that belong to the food category.
Conclusion
Topic modelling is a form of unsupervised learning, and the scope of this project was to briefly examine the functionality and efficacy of finding patterns behind the underlying words. Though we believe our reviews of certain products/services to be unique, it is clearly evident by the model that in fact, certain words are used for certain things amongst the whole population.
This project is not completely exhaustive as there is much more NLP techniques which could be applied, as well as extending the scope of the reviews to catch the country as a whole. Other techniques that I plan to use are sentiment analysis and a word2vec model to conduct βwordβ algebra to aid future website viewers to see similar terms to what they want.