Data Science Recommendation System for Pet Food Product
The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Market Overview
Sales of pet food in the US has increased by 40 percent for the first quarter of 2017 compared with last year. The number of food options available has also grown. As a pet owner, I am always struggling on choosing a โbetterโ food for my fluffy friend who refuses to touch some that Iโve bought. I decided to work on this web scraping project to get insights of the pet food using collected data from previous years. The findings would not only benefit pet owners but also new vendors who try to break into this market.
Pet food makes up the largest part of $15.92 billion pet market in the United States in 2016 . That is a highly concentrated; just give vendors accounting for about 70% of retails sales (Nestlรฉ, Mars, Big Heart, Colgate, and Blue Buffalo). All vendors have to appeal to what pet owners seeks. According to survey results, their main criteria are โhigh qualityโ or โreal meatโ products.
Scraping
What is considered a โhigh qualityโ pet food product? The assumption for the research is e that the quality of pet food product depends on its ingredients. The guaranteed analysis would be the first place to start the analysis for pet food ingredient. It tells you how much of the total percent of the food comes from protein, fat, fiber, and moisture.
There are ~4500 products available in the market and product information is scraped from the US leading online retails Chewy.com for both cat and dog. Below is a sample of the data from scraping.
The data is preprocessed through Python Pandas and the ingredient is separated as individual features with complicated multi-step data processing.
Research
Base on the hypothesis , the research would includes below three steps:
- Comparing the basic data in โGuaranteed Analysisโ for nutritional info
- Identify products considered as โGood qualityโ and โBad Qualityโ by customers
- Observe the differences between different qualities
Methods used in experiment:
- Scrapy
- K-mean Clustering
- ANOVA test
- Scatter text
Clustering of Products
There are ~4500 products data preprocessed. Fiveclusters of products are optimized by unsupervised learning methods based on approximately 70 different ingredients. And products are clustered into 5 segments using K-means clustering based on portion of different ingredients.
The data is PCA-reduced to 2 dimensions. The charts displays the 5 clusters of product based the distance to the mean of each ingredient attribute.
The Data Quality
The result is based on assumption that the number of reviews and star ratings of the product reflect its perceived quality by customers. In order to differentiate different product segments by quality, the ANOVA test is conducted. Cluster 3 and 4 is observed to have higher rating and more customer reviews than cluster 1 and 2. That indicates that, in fact, perceived product quality does depend on the ingredients.
Chart 1 - the stars of ratings among 5 clusters of product
Chart 2 - the numbers of customer reviews among 5 clusters of product
Chart 3 - Cluster 0 is represents outliers with missing reviews.
Following cluster Cluster 3 and 4 are analysed as โGoodโ quality products ( more reviews and higher ratings)while Cluster 1 and 2 contain โBadโ quality products ( fewer reviews and lower ratings). Below is a scatter words for product ingredients.
Chart 4 - a frequency comparing between โgoodโ and โbadโ quality products
Certain ingredients appears only frequently in โgoodโ products:
- Lutine (fish oil)
- FOS(Sweetener)
- Mannan-oligosaccharides or MOS (fiber)
Certain ingredients appears only frequently in โbadโ product:
- Gastrointestinal - prevent stomach flu
- Niacin - prevent gas
- Potassium - prevent digestive system disorder
The portion of major ingredients is also compared between โgoodโ and โbadโ products. Good products have a higher portion of Glucosamine and Chondroitin (supplements) than bad products. Bad products have higher portion of moisture than good products. That the portion of moisture squeeze the portion of other nutrients such as proteins. This explains that the wet food is less popular than dry food.
Chart 5 - Ingredient Distribution
Data Science Conclusions
Functionality and Nutrition Level are the main contributors to quality differences. Joint/vision support, higher level of protein/fiber and better tastes are highlighted by good feedbacks from customers. While digestion support with ingredients other than fiber and higher level of moisture/ash are observed with bad feedbacks from customers.
The ingredients differentiate the qualities considered by customers. The findings of โgoodโ and โbadโ ingredients will be helpful for manufacturers to produce โgood qualityโ pet food, and better adapt to the fast growing pet food market.
Next Step
Due to the limited time of the project, the researchโs focus had a limited scope. For further study, it would be helpful to observe other differences to better understand the pet food market. That would entail looking at attributes like price, key benefits, customerโs comments and other available information about the product.
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.