Data Science Recommendation System for Pet Food Product

Posted on Nov 7, 2017

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Market  Overview

Sales of pet food in the US has increased by 40 percent for the first quarter of 2017 compared with  last year. The number of  food options available has also grown. As a pet owner, I am always struggling on choosing a “better” food for my fluffy friend who refuses to touch some that I’ve bought. I decided to work on this web scraping project to get insights of the pet food using collected data from previous years. The findings would not only benefit pet owners but also new vendors who try to break into this market.

Pet food makes up the largest part of  $15.92 billion pet market in the United States in 2016 .  That is a highly concentrated; just give vendors accounting for about 70% of retails sales (Nestlé, Mars, Big Heart, Colgate, and Blue Buffalo).  All vendors have to appeal to what pet owners seeks. According to survey results, their main criteria are  “high quality” or “real meat” products.


What is considered a “high quality” pet food product? The assumption for the research is e that the quality of pet food product depends on its  ingredients. The guaranteed analysis would be the first place to start the analysis for pet food ingredient. It tells you how much of the total percent of the food comes from protein, fat, fiber, and moisture.  

There are ~4500 products available in the market and product information is scraped from the US leading online retails for both cat and dog. Below is a sample of the data from scraping.


The data is preprocessed through Python Pandas and the ingredient is separated as individual features with complicated multi-step data processing.  

Data Science Recommendation System for Pet Food Product


Base on the hypothesis , the research would includes below three steps:


  1. Comparing the basic data in “Guaranteed Analysis”  for nutritional info
  2. Identify products considered as “Good quality” and “Bad Quality” by customers
  3. Observe the differences between different qualities

Methods used in experiment:

  • Scrapy
  • K-mean Clustering
  • ANOVA test
  • Scatter text


Clustering of Products

There are ~4500 products data preprocessed. Fiveclusters of products are optimized by unsupervised learning methods based on approximately 70 different ingredients.  And products are clustered into 5 segments using K-means clustering based on portion of different ingredients.


The data is PCA-reduced to 2 dimensions. The charts displays the 5 clusters of product based the distance to the mean of each ingredient attribute.

Data Science Recommendation System for Pet Food Product

The Data Quality

The result is based on assumption that  the number of reviews and star ratings of the product reflect its perceived  quality by customers.  In order to differentiate different product segments by quality, the ANOVA test is conducted. Cluster 3 and 4 is observed to have higher rating and more customer reviews than cluster 1 and 2. That indicates that, in fact,  perceived  product quality  does depend on the ingredients.


Chart 1 - the stars of ratings among 5 clusters of product

Data Science Recommendation System for Pet Food Product

Chart 2 - the numbers of customer reviews among 5 clusters of product


Chart 3 - Cluster 0 is represents outliers with missing reviews.

Following cluster Cluster 3 and 4 are analysed as ‘Good’ quality products ( more reviews and higher ratings)while Cluster 1 and 2 contain ‘Bad’ quality products ( fewer reviews and lower ratings). Below is a scatter words for product ingredients.

Chart 4 - a frequency comparing between ‘good’ and ‘bad’ quality products

Certain ingredients appears only frequently in ‘good’ products:

  • Lutine (fish oil)
  • FOS(Sweetener)
  • Mannan-oligosaccharides or MOS (fiber)


Certain ingredients appears only frequently in ‘bad’ product:

  • Gastrointestinal - prevent stomach flu
  • Niacin - prevent gas
  • Potassium - prevent digestive system disorder


The portion of major ingredients is also compared between ‘good’ and ‘bad’ products. Good products have a higher portion of Glucosamine and Chondroitin (supplements) than bad products. Bad products have higher portion of moisture than good products. That the portion of moisture squeeze the portion of other nutrients such as proteins. This explains that the wet food is less popular than dry food.


Chart 5 - Ingredient Distribution

Data Science Conclusions

Functionality and Nutrition Level are the main contributors to quality differences. Joint/vision support, higher level of protein/fiber and better tastes are highlighted by good feedbacks from customers. While digestion support with ingredients other than fiber and higher level of moisture/ash are observed with bad feedbacks from customers.  

The ingredients differentiate the qualities considered by customers. The findings of ‘good’ and ‘bad’ ingredients will be helpful for manufacturers to produce ‘good quality’ pet food, and better adapt to the fast growing pet food market.

Next Step

Due to the limited time of the project, the research’s focus had a limited scope. For further study, it would be helpful to observe other differences to better understand the pet food market. That would entail looking at attributes like price, key benefits, customer’s comments and other available information about the product.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


About Author

Summer Sun

Summer is passionate about data science. She has 3 years’ experience analyzing large scale client data for major financial institutions. She loves contact from any challenging project.
View all posts by Summer Sun >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI