Web Scraping to Visualize Trends in Deals Using Data

Posted on Aug 22, 2016

Motivation:

In a world full of deals and coupons, have you ever wondered which deals are actually good deals?

Anyone familiar with consumer psychology can tell you that people love deals. Those huge, red signs saying "30% off" or "Buy 1 Get 1 Free" are very attractive to consumers. So much so that many companies are having sale items all year round.

This raises the question about the quality of these deals. Do these deals exist because the items have poorer quality (e.g. a jacket with a scratch on the back)? Do they exist because the functionality is obsolete (e.g floppy disks)? Do they exist because the inventory is low or the item is out of season? Whatever the reason may be, there is always a reason. The interesting question is, is this deal a good deal and will it save me money.

 

About dealmoon.com:

Dealmoon.com is very similar to groupon.com, where it gathers information of deals and coupons from merchants in the U.S., and groups them into different categories (e.g. Clothing, Electronics, Baby, etc.). All information are available on their website for free.

Web Scraping:

I used the Selenium package in Python to scrape all data.

Some logistics about the data I scraped:

    • Total of ~45,000 deals from 8 categories (i.e. Clothing, Beauty, Nutrition, Baby, Home, Electronics, Travel, Finance )
    • Total of 6 attributes (i.e. category of deal, deal title, deal description, posted time, number of comments, number of bookmarks)
    • The entire crawling process took ~6hrs.

 

Visualisations:

  • What are the popular deals?

For me, when I try to find good deals I always check the popular deals––deals with a lot of bookmarks and comments. My rationale is that if a deal has high popularity, it must be good; the chance of a group of people bookmarking a bad deal is low. Under this assumption, I first explored the popular deals.

 

Screen Shot 2016-08-21 at 10.51.49 PM

 

To take into consideration that maybe not everyone defines popularity the same way, the App allows the users to define "popularity" by whichever metric they like: the number of bookmarks, the number of comments, or both.

  • What are the popular stores?

    Screen Shot 2016-08-21 at 10.50.34 PM

 

  • Which stores always have good deals?

By now, you know enough about the functionality of this app to explore this topic on your own, my dear reader. Find the link to the app at the end of this post, and find out which stores always have good deals. You may be surprised!

 

  • When are there most deals?

Screen Shot 2016-08-21 at 10.49.00 PM

 

Screen Shot 2016-08-21 at 10.49.34 PM

 

Future Directions:

All the above-mentioned visualisations will help us understand which deals are good deals and which stores always have good deals. However, one drawback of this analysis is that they are all post-hoc analyses––they will only inform users which deals they should take advantage of AFTER other users have used the deal. By then, it may be too late: the deal is no longer valid or the item has been sold out. Therefore, in order to fully take advantage of past deals, one approach is to use Natural Language Processing to extract patterns of previous good deals to help classify new deals to be good or bad in real time.

The patterns may be the type of deal (e.g. 'Buy 1 get 1 free', 'Free shipping for orders over $100'), the duration of the deal (e.g. 'Today only', 'Valid for this weekend'), deducted percentage (e.g. '$50, originally $100', ' $100, originally $250').

 

 

To check out the App, click here.

All code used to generate the App can be found on my Github.

About Author

Jielei (Emma) Zhu

Emma (Jielei) Zhu graduated from New York University in May 2016 with a B.A. in Computer Science and Psychology and a minor in Mathematics. In school, Emma was able to explore her interests to the fullest by taking...
View all posts by Jielei (Emma) Zhu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI