Data Scraping Groupon!

Posted on Jan 13, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Everyone loves saving money. We all try to make the most of our money, and sometimes it’s the simplest things that can make the biggest difference. Coupons have been long thought of as pieces of paper clipped and taken to the supermarket to get a discount. Anyone who has done this knows clipping coupons can be tedious and time consuming, but using data shows coupons has never been easier thanks to Groupon.

Groupon is a coupon recommendation service that broadcasts electronic coupons for restaurants and stores in your neighborhood. Some of these coupons can be very significant, especially when planning group activities, because the discounts can reach as high as 60%.

You can sign up for Groupon for free, and each day Groupon will send you emails with deals of the day in that metro area. If you like the deal, then you can purchase it immediately from Groupon and redeem it at the restaurant/store for its value.

Data

The data was scraped from the New York City region of the website Groupon. The website layout is divided into an album look for all the different groupons, and then an in-depth page for each particular groupon. The website appearance is shown in the following:

Data Scraping Groupon!

The layouts of both pages were not dynamic, so a custom scrapy spider was built in order to quickly scrape through all the pages and retrieve the information to be analyzed. However, the comments, the important pieces of information, were rendered and loaded via JavaScript, therefore a Selenium was script used. The Selenium script used the URL’s of the groupons acquired from scrapy, and essentially mimicked a human clicking the β€œnext” button in the user comments section.

Findings

Around ~400 individual groupons were scraped. The data retrieved from each groupon is shown below.

Groupon Title Merchant
Categories Mini Info
Deal Features Location
Total Number of Ratings URL

Around ~89,000 individual user reviews were scraped. The data retrieved from each review is shown below.

Author Date
Review URL

Exploratory Data Analysis

An interesting finding was that the use of groupons has been increasing tremendously over the past few years. We found this by examining the dates supplied by the reviews. Looking at the following image, where the x-axis represents months/year and the y-axis, which represents count, this conclusion becomes obvious. The slight decline at the end is due to the fact that some of the groupons at the time were likely seasonal in nature.

Data Scraping Groupon!

An interesting finding was that the use of groupons has been increasing tremendously over the past few years. We found this by examining the dates supplied by the reviews. Looking at the following image, where the x-axis represents months/year and the y-axis, which represents count, this conclusion becomes obvious. The slight decline at the end is due to the fact that some of the groupons at the time were likely seasonal in nature.
Data Scraping Groupon!

Price

Lastly, since much of the of the data is through text and follows the common procedure of: Price (Original Price), a regular expression was derived to parse out the price information’s, and the number of deals they offer. That information is presented in the following bar chart:

 

Lastly, utilizing the user review data, a word cloud was generated:

Topic Modelling

In order to conduct the topic modelling, the two most important packages used are gensim and spacy. The initial step in creating a corpus was to remove all stop words such as β€œa”, β€œof”, β€œthe”, and β€œetc.” The next step involved lemmatizing the β€œnew” reviews, and finally creating the trigrams. Trigrams were chosen since, based on the dataset location, words like New York City would be identified as one unique word.

The model chosen was Latent Dirichlet Allocation for its superior ability at distinguishing topics from different documents and that there exists a package to visualize the results clearly and efficiently. Since the method is unsupervised, the number of topics have to be chosen beforehand, and after playing around, the optimal number was 3 at 25 consecutive iterations of the model. The results are as follows:


The visualization above is a projection of the topics onto two components, where topics that are similar to each other will appear closer, while those that are dissimilar will be further away. The words on the right are words composing each topic, and the lambda parameter controls the exclusivity of the words. A lambda of 0 indicates the most exclusive words around each topic while a lambda of 1 indicates the most frequent words around each topic.

The first topic represents actionable words, which I believe are for quality and reception of the groupon. The second topic has words that describe exercising and physical activities. Lastly, the third topic has words that belong to the food category.

Conclusion

Topic modelling is a form of unsupervised learning, and the scope of this project was to briefly examine the functionality and efficacy of finding patterns behind the underlying words. Though we believe our reviews of certain products/services to be unique, it is clearly evident by the model that in fact, certain words are used for certain things amongst the whole population.

This project is not completely exhaustive as there is much more NLP techniques which could be applied, as well as extending the scope of the reviews to catch the country as a whole. Other techniques that I plan to use are sentiment analysis and a word2vec model to conduct β€œword” algebra to aid future website viewers to see similar terms to what they want.

About Author

Related Articles

Leave a Comment

Garrett Larson August 2, 2018
Do you have code you would be willing to share?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI