Scraping Groupon!

Nicholas Maloof
Posted on Jan 13, 2018

Introduction

Everyone loves saving money. We all try to make the most of our money, and sometimes it’s the simplest things that can make the biggest difference. Coupons have been long thought of as pieces of paper clipped and taken to the supermarket to get a discount. Anyone who has done this knows clipping coupons can be tedious and time consuming, but using coupons has never been easier thanks to Groupon.

Groupon is a coupon recommendation service that broadcasts electronic coupons for restaurants and stores in your neighborhood. Some of these coupons can be very significant, especially when planning group activities, because the discounts can reach as high as 60%.

You can sign up for Groupon for free, and each day Groupon will send you emails with deals of the day in that metro area. If you like the deal, then you can purchase it immediately from Groupon and redeem it at the restaurant/store for its value.

Data

The data was scraped from the New York City region of the website Groupon. The website layout is divided into an album look for all the different groupons, and then an in-depth page for each particular groupon. The website appearance is shown in the following:

The layouts of both pages were not dynamic, so a custom scrapy spider was built in order to quickly scrape through all the pages and retrieve the information to be analyzed. However, the comments, the important pieces of information, were rendered and loaded via JavaScript, therefore a Selenium was script used. The Selenium script used the URL’s of the groupons acquired from scrapy, and essentially mimicked a human clicking the “next” button in the user comments section.

Around ~400 individual groupons were scraped. The data retrieved from each groupon is shown below.

Groupon Title Merchant
Categories Mini Info
Deal Features Location
Total Number of Ratings URL

Around ~89,000 individual user reviews were scraped. The data retrieved from each review is shown below.

Author Date
Review URL

Exploratory Data Analysis

An interesting finding was that the use of groupons has been increasing tremendously over the past few years. We found this by examining the dates supplied by the reviews. Looking at the following image, where the x-axis represents months/year and the y-axis, which represents count, this conclusion becomes obvious. The slight decline at the end is due to the fact that some of the groupons at the time were likely seasonal in nature.

An interesting finding was that the use of groupons has been increasing tremendously over the past few years. We found this by examining the dates supplied by the reviews. Looking at the following image, where the x-axis represents months/year and the y-axis, which represents count, this conclusion becomes obvious. The slight decline at the end is due to the fact that some of the groupons at the time were likely seasonal in nature.

Lastly, since much of the of the data is through text and follows the common procedure of: Price (Original Price), a regular expression was derived to parse out the price information’s, and the number of deals they offer. That information is presented in the following bar chart:

 

Lastly, utilizing the user review data, a word cloud was generated:

Topic Modelling

In order to conduct the topic modelling, the two most important packages used are gensim and spacy. The initial step in creating a corpus was to remove all stop words such as “a”, “of”, “the”, and “etc.” The next step involved lemmatizing the “new” reviews, and finally creating the trigrams. Trigrams were chosen since, based on the dataset location, words like New York City would be identified as one unique word.

The model chosen was Latent Dirichlet Allocation for its superior ability at distinguishing topics from different documents and that there exists a package to visualize the results clearly and efficiently. Since the method is unsupervised, the number of topics have to be chosen beforehand, and after playing around, the optimal number was 3 at 25 consecutive iterations of the model. The results are as follows:


The visualization above is a projection of the topics onto two components, where topics that are similar to each other will appear closer, while those that are dissimilar will be further away. The words on the right are words composing each topic, and the lambda parameter controls the exclusivity of the words. A lambda of 0 indicates the most exclusive words around each topic while a lambda of 1 indicates the most frequent words around each topic.

The first topic represents actionable words, which I believe are for quality and reception of the groupon. The second topic has words that describe exercising and physical activities. Lastly, the third topic has words that belong to the food category.

Conclusion

Topic modelling is a form of unsupervised learning, and the scope of this project was to briefly examine the functionality and efficacy of finding patterns behind the underlying words. Though we believe our reviews of certain products/services to be unique, it is clearly evident by the model that in fact, certain words are used for certain things amongst the whole population.

This project is not completely exhaustive as there is much more NLP techniques which could be applied, as well as extending the scope of the reviews to catch the country as a whole. Other techniques that I plan to use are sentiment analysis and a word2vec model to conduct “word” algebra to aid future website viewers to see similar terms to what they want.

About Author

Related Articles

Leave a Comment

Avatar
Garrett Larson August 2, 2018
Do you have code you would be willing to share?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp