Web Scraping Product Details from Sunglass Hut and Woot!

Avatar
Posted on Oct 9, 2020

Sunglasses product details were scraped from the Sunglass Hut and Woot! websites in order to perform an exploratory data analysis (EDA) and to compare the deals on Woot! to the retail prices on Sunglass Hut. The above word cloud was produced using the descriptions of the sunglasses on Sunglass Hut. The code used to scrape and analyze this data may be found on the Git Hub.

Web scraping

The Sunglass Hut website uses Ajax to load more sunglasses on each page when you click a button at the bottom the screen. For this reason Selenium had to be used to interact with this dynamic website. The main brand page was visited, and the "load more" button was programed to click until all sunglasses were visible on the page. Then the url to each of those pairs was scraped and saved in a CSV file. This list of urls was then the starting urls in a Scrapy spider that visited each and collected the urls to all the different colors of the same pair. This was necessary because each pair that comes in multiple colors will have different product details for each color. For example, some colors may have polarized lenses, while some may not. Then, each individual pair's url was visited and the product details were scraped.

For each pair, the brand, description, name, price, whether it is on sale and by how much, whether the lenses are polarized, frame color, frame material, lens color, lens material, lens technology, shape, url to the product page, and face shape for best look was scraped.

What are the most expensive brands?

We break down the median price per pair for each brand.

How do the price distributions of the most expensive two brands compare?

We see that, although Fendi has a higher median price, Bulgari has a few pairs that are extremely expensive. In fact, the most expensive pair on all of the Sunglass Hut website is from Bulgari

It is also worth noting that Bulgari has many more models available than Fendi, as the next graph demonstrates.

Let's choose a few brands

For the purpose of an exploratory data analysis, let's pick a few brands to analyze. The following graph shows the number of pairs available on Sunglass Hut from each brand. We see that Ray-Ban is far and away the most, followed by Oakley, Vogue and Prada. We will also include Gucci because it is a popular brand, and we will include the Prada Linea Rossa sunglasses along with the Prada sunglasses.

We see that, among these brands, Gucci seems to be the most expensive overall, followed by Prada. Ray-Ban and Oakley seem similarly priced, while Vogue is the cheapest among these brands.

Price by brand for the brands we selected.

Lens polarization

How does whether or not the lenses are polarized affect the price of the sunglasses. One would assume this feature would result in an increased price. Do the numbers bear this out? From the below graph, we see that for most brands, the polarized sunglasses tend to be more expensive than the non-polarized sunglasses. The notable exception seems to be Gucci, where the median price of polarized sunglasses is less than non polarized. This is due to the fact that many of the most high-end sunglasses are not polarized. We see that the price distribution of polarized sunglasses is much more strongly peaked near its median. In other words, Gucci has some relatively cheaper non-polarized sunglasses and also very expensive non-polarized sunglasses.

Below is a distribution of the prices among all brands of lenses that are polarized (red, right) and lenses that are not polarized (blue, left). The difference of these distributions was found to be very significant (p-value less than 1e-14) based on the Kolmogorov–Smirnov test.

Further EDA

Further EDA can be done on this data set. For example, the following graphs give price by frame color as well as price by face shape for best look.

Woot!

A Scrapy spider was also written to scrape the product details of the sunglasses on Woot!. This data set was then joined with the Sunglass Hut data set when we could find possible matches. Many pairs contain a letter and numerical digit label in the name of the pair which can then be matched on both pages. Then, further exploration can be done to determine if the deals on Woot! are as good as they seem.

The first pair is an easy match and a good deal.

Items 0 through 6 in the above table all correspond to different Wayfarer sunglasses by Ray-Ban of different colors. The exact color from Woot! could not be found on Sunglass Hut. See the images below:

Best use of this data

This data is rich enough to explore several features to find sunglasses that you like or to match pairs with Woot! to find deals. It may be useful to do this kind of brand analysis if you are opening a shop or if you are thinking of manufacturing sunglasses. There is also a market for reselling sunglasses on sites like Poshmark and TheRealReal. This data could be used to find deals on Woot! that may be resold on Poshmark or TheRealReal, although a further analysis of sales on those websites would become necessary.

About Author

Avatar

Patrick Starvaggi

Data Scientist and Applied Mathematician interested in Machine Learning, Data Mining, Stochastic Analysis, Modeling and Simulation, PDEs, SDEs, and more. Twelve years of university teaching and research in diverse and interdisciplinary fields.
View all posts by Patrick Starvaggi >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp