WebScraping Public Goods

Juan R. Vasquez Jr.
Posted on Feb 22, 2021

Introduction

For my web scraping project, I chose to scrape product information from Public Goods, a membership-based online home goods store with a focus on quality, sustainability, and simplicity.  My main objective was to analyze the scraped product data to discover trends in price and ratings across product categories – ranging from Personal Care, Household, Grocery, Supplement & Vitamins, Pets, and CBD. In analyzing product data, I sought to not only discover if price points played a factor in customer ratings and reviews, but also what categories yielded the most engagement.

 

Scraped Data

Using python to build my Selenium web scraper, I crawled through 250 product pages on Public Good’s website. Information that I specifically targeted was the product’s name, product description, core features, main ingredients, price, volume, ratings, and number of reviews. With python I was able to do the bulk of my preprocessing and created a pandas dataframe containing the main variables I wanted to analyze. Once creating the dataframe I used Seaborn to visualize the data, and create the plots showcased in this blog.


Outcomes

  • In some cases, the lower the price of the product, the higher the rating. This is not true across all categories.

  • Number of reviews may be affected by price point, but there are other features to consider.

  • Star ratings from 3 – 4 have a lower range of number of reviews. Whereas star ratings from 4 -5 have a more inclusive range.

  • The highest engagement in regards to star ratings and number of reviews were found in the grocery and household product category. 

Next Steps 

  • Investigating if variables such as: core features and main ingredients influence star ratings and reviews.
  • Text Sentiment Analysis / NLP
  • Acquiring revenue data.
  • Integrating competitor data – i.e Brandless, Amazon Prime, etc.
  • Acquiring actual data on how negative reviews effect business revenue and churn.

 

About Author

Juan R. Vasquez Jr.

Juan R. Vasquez Jr.

Juan is a recent graduate of NYC Data Science Academy where he studied dashboard creation, machine learning, and statistical analysis. His background of three years in the hospitality and commercial art industry allowed him to hone his organization...
View all posts by Juan R. Vasquez Jr. >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp