NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Sephora Product Success: Capstone and Final Project

Sephora Product Success: Capstone and Final Project

Heather Kleypas
Posted on Apr 17, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Similar to the Home Depot of tools, or the Staples of office supplies, Sephora is one of the largest beauty retailers in the United States. They carry a wide range of makeup, skincare, haircare, fragrance, and much more. For my final project, I chose to conduct a visual analysis and several machine learning experiments on data obtained from Sephora.com. In total, I collected the information of over 7,400 individual products for this project. My overall goal is to predict a product's star rating and recommendation rate.

Below is a screenshot of an example product page. The main variables of interest are in yellow:

Scraping and Variable Explanation

 

To gather all the variables (product name, brand, loves, category, reviews, star-rating, price, etc.),  I performed web scraping of Sephora.com using Selenium and Python. I recycled some scraping code from a former NYC Data Science Academy student who had previously scraped Sephora, Gurminder Kaur. In order to scrape every product, I had to navigate through 2 pages: the list of brands, and the brand's product page.

From the 'Details' section, I was able to pull the following characteristic features of products sold by Sephora, including:

  1. Skin Type: a callout used for when a product is made for a specific skin type; Oily, Combination, Dry, Normal.
  2. Research Results: a callout used for when a product undergoes clinical research and can state results such as "88% visibly smoother skin".
  3. Dermatologist Tested: a callout use when the product is tested by a dermatologist.
  4. Formulated Without: a callout used when a product does not contain specific ingredients that consumers find undesirable. Examples include: Parabens, Phthalates, oil, alcohol, silicone, nuts
  5. Vegan: a callout used when the product is not tested on animals or has ingredients without animal products.
  6. Clean at Sephora: a label used when a product excludes a variety of certain ingredients.

After the 'Details' section is the 'Ratings and Reviews' section, which lists the number of reviews for each star count and average star rating for the majority of the products. However, there's about 20% of products that have a listed recommendation rate. Brand new products, as well as best-selling products of all categories, have it missing, for reasons unknown to me.

See below for what the recommendation rate looks like on the product page:

Visual Analysis:

As a frequent shopper of sephora.com, I had some initial questions in mind: 

Which brands are the most expensive? 

 

Note: A maximum of 60 products per brand was collected

To start off my analysis, I plotted the 30 most expensive brands using the median. In total, 316 different brands were scraped. The majority of brands in the expensive price range cap before $200 with Dyson big step ahead with a median over $400. Most brands in this graph seem to be in the fragrance and skincare categories. However, Google was a shock. Sephora offers the Google Home, perhaps the company is charting into new territory?  

Which star rating is the most common? 

Note: The star rating percentage is the total sum of each star rating divided by the total # of star ratings

A large majority of products have 5 and 4 stars. I cannot imagine a product with a 1 or 2 average star rating to stay around in the beauty market for long. This factor is what most likely contributed to such a low RMSE on my star rating machine learning model.

How are product loves (ex: 80k) and recommendation rates correlated to features like average star rating and number of reviews? 

Note: I modified the data before visualization to remove missingness and outliers. 

Looking at the multiple correlations between the number of reviews, loves, ratings, and recommendation rates, we can come to the following conclusions (mainly about loves and recommendation rates because reviews and ratings are pretty self-explanatory). 

  • Recommendation Rate and Number of Reviews have a slight positive correlation. When the number of reviews increases, the recommendation rate increases. 
  •  Number of Loves and Average Rating have a positive correlation. As the average rating increases, the number of loves increases. 
  • Number of Loves and Recommendation Rate have a slight negative correlation. As the number of loves decreases, the number of Loves increases. 
    • This might be because the recommendation rates are high for newer products with a fewer amount of loves and reviews. 

Which skin types are products most commonly advertised for?

A bit of an even playing field here, as each skin type has a product for the buyer's needs. 

How do product categories compare with ingredient callouts like Vegan, Clean or Formulated-Without? 

 

I had a hunch that skincare and makeup would have the most offerings, but was shocked to see that 2 of the top categories (Men, Nails) had no clean or vegan callout in the details. After a quickly glancing through the Tools & Brushes category, it was apparent that products like a hairdryer or tweezers are not applicable to the callouts. 

It's nice to see that all product categories (with the exception of fragrance and brushes) are more likely to have a dedicated callout for not containing certain undesirable ingredients than not. However, the industry still has a bit more work to do in order for more products to meet the Clean and Vegan label standards.

Machine Learning: 

 

For the machine learning portion of the project, I want to predict the product's average star rating and recommendation rate. I used a linear model and a decision tree model, both easily accessed by Python's Scikit-learn. 

For my linear model, I chose ElasticNet, which is a combination of both ridge and lasso regression in that attempts to shrink for model-complexity/ multicollinearity and do a sparse feature selection at the same time. I optimized the model using cross-validation against different alphas.

For my decision tree model, I chose XGBoost, which stands for eXtreme Gradient Boosting. The XGBoost library implements fast and high-performance gradient boosting decision tree models. To break that down: 

Decision trees generally do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces. Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction.

Before running the final model, I performed CV on a variety of hyperparameters such as the learning late, minimum child weights, minimum samples split, minimum samples leaf, and the number of estimators. This enabled me to achieve the lowest model error. 

Predicting Product Average Rating: 

I found that newer products had abnormally higher or lower average ratings and recommendation scores, so to mitigate, I dropped any products with less than 150 reviews to get a more accurate score from a larger majority of purchasers. 

XGBoost:

Unsurprisingly, the most important factor on average star rating is the number of reviews followed by loves, price, and size. It's also notable which brands and product types had an impact on the rating. From the details section of the product page, the important variables were 'formulated_without', 'research',  and 'all_skin_label. 

Elastic Net: 

For the ElasticNet graphs, the top 15 rows have the highest positive variable importance while the bottom 10 rows have the highest negative variable importance. The only variable besides a brand/category to make an impact in the ElasticNet model is 'discounted_value', which is not a product on sale but usually a bundle of products at a special price. This model is very different from XGBoost in that the number of reviews, loves, and size all lack importance. 

Predicting Product Recommendation Rate: 

In order to train the model on products that only have recommendation rates, I had to drop almost 6000 products, leaving with me about 1500 products to predict on. 

XGBoost:

Similar to the average star rating, the recommendation rate's most important variables include price, loves, reviews, in addition to the various star ratings playing an important role. From the details section, the 'vegan' and 'formulated without' labels were the only variables present in the top 30 of importance. The error is still 2-3% which might make or break my purchase if it predicted 87% and only 85% recommended. Gathering additional variables from the product page would most likely reduce this error.

ElasticNet: 

fourand5StarRatio by far has the most impact on the product recommendation rate for the ElasticNet model. The error is still 2-3% which might make or break my purchase if it predicted 87% and only 85% recommended. 

Results: 

 

Conclusions and further improvements:

 

In the future, I would like to collect all of the reviewers' data from an accessible API as additional predictors for my machine learning models. Unfortunately, my sale price predictions had too high of RMSE's to post, but adding the reviewer's data and profiles would surely help with the model accuracy. 

 

About Author

Heather Kleypas

View all posts by Heather Kleypas >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application