Yunnan Sourcing Tea Storefront and Analysis of the High End Tea Market

Posted on Apr 28, 2021

Github | LinkedIn | Yunnan Sourcing


Where many online tea wholesalers curate particular, international selections of teas, Yunnan Sourcing distinguishes itself by highlighting local sources based on data. Furthermore what makes it a compelling target for analysis is its focus on "verified purchase reviews."

We will begin our analysis by laying the groundwork, that is establishing what products such a specialized storefront might offer. From there we will attempt to infer customer behavior and interaction with these items. Finally we will investigate the local brands represented on the store, specifically how well represented are they and how relatively popular they are amongst the consumer base.

Data Scraping & Gathering

In order to perform the information scraping, I used the Scrapy package for Python, directly grabbing the information from the website. The spider crawled through a hidden "collections" page, which contained all of the information, albeit repeated many times because one item might be featured in as many as 10 collections.

There were two main challenges involved in the scraping: the first concerned an infinitely scrolling page for each category within a collection. I was able to overcome this very simply after close inspection of the page revealed a hidden pagination of the category items. The second challenge, however, proved far more complicated as it obscured virtually all of customer interactive information: all customer information.

Yunnan Sourcing manages all of its customer information with Yotpo, a self-described Marketing E-Commerce platform, which inserts code into the host site upon request. After close inspection of the Network requests, I discovered that Yunnan Sourcing sent requests to a Yotpo server for information. Using Trillworks' Curl Converter, I converted that request into a Python POST request and retrieved JSON data. I successfully extracted the customer information from these files using BeautifulSoup.

What's on the Storefront?

The first question we approach when looking at the storefront, before addressing individual product types, is simply this: how many tea products are there versus NON-tea products? We can break this down very quickly sorting products by general Item Type, but by coloring them differently depending on whether they are tea or not.

What becomes immediately apparent is that both forms of Pu-erh Tea, Raw and Ripe, are the most populous products on the storefront by far - a phenomenon that makes a certain amount of sense considering the long-standing local history. The "Teawares" category, however, is something of an oddity, because so many of the other categories could ostensibly be binned within this one. In point of fact, this is sometimes the case; "Teawares" is a catch-all of any kind of non-tea item, ranging from tea ceremony towels to draining tea tables.

Programmatic differentiation of the Teawares category should, in theory, be possible through the product tagging. However, the tags are extremely inconsistent: many items are commonly mis-tagged in a plethora of ways, some are missing tags completely. With our product and brand analyses, this should not present too much of a problem; while there is overlap with other categories, the main difference is that Teawares items are rebranded under the Yunnan Sourcing brand, where items in other categories are not.

In the non-teas selection, we see that the majority of items are various, locally specialized pottery types. As the graph tails off to the right, we see a number of Yunnan Sourcing proprietary items (i.e. Gift Cards & Samplers) as well as some niche, utility items (e.g. Packaging & Charcoal specifically). Therefore, when we look further into these items, we will restrict much of our analysis to non-proprietary items that are not in the Teawares category.

The teas graph paints a very different picture; after the majority categories of Pu-erh, we can see that Oolong, Black, Hei Cha, White, and Green Teas are all fairly common on the storefront. Herbals, Blooming, Yellow, and Purple teas are all significantly reduced on the storefront. This makes sense, considering Purple and Blooming teas are extremely specialty items. Yellow Teas are very niche as well. Flower & Herbal teas, however, are neither niche nor specialty, so their low number on the storefront is conspicuous.

Customer Purchasing Habits

From the customer information we were able to glean several data types:

  • Wishlist Score
  • Review Star Score
  • Review Text

In attempting to infer actual customer purchasing behavior, Wishlist Score would be a poor indicator as it only indicates aspirational purchasing behavior. Review Star Scores alone do not impute purchasing habits, let alone quality, given the well-known weighting issue of reviews - review text presents the same issue, albeit even further complicated by the internationally mingled review language. Raw review numbers, however, ARE filtered by the "verified purchase" process - each customer who reviews a product must have purchased it, therefore we can infer customer purchase distribution.

Customer Verified Reviews - Non-Teas
Customer Interaction by Median Verified Reviews

Looking at the top of the graph, the majority of the high median customer interaction items are the proprietary items. This is extremely explainable by the way that these items are extremely few by comparison; with fewer items per category, the customer reviews would be more focused, less distributed than they might be at a large product category. That the utility products of Charcoal and Packaging make it into the higher echelons is similarly explicable, however, it may indicate to a savvy business owner that their inclusion might be worthwhile.

The pottery, on the other hand paints another picture: that there is some significant interaction, particularly with Yixing Pottery & Silver Teaware. The Teawares category naturally features a significant amount of interaction. While the remaining categories all feature much lower median scores, they all, yet maintain a significant range of review purchase interaction.

Customer Verified Reviews - Leaf Teas

The tea breakdown is particularly fascinating; the pu-erh teas are relatively low by median - expected as we discussed regarding distribution above - but that the most reviewed products reflect international trends, not what is necessarily on the store front! From the analysis above, Green and Black Teas are both reduced on the storefront, while Herbals are virtually absent.

According to the Specialty Tea Institute of America, US imports specifically track Black and Green Teas, not the rest. This reflects the common knowledge that this country is a Coffee country, followed by Black and Green Teas, that the rest are not as well tracked because they are not imported in great enough quantities. Herbals are a strong domestic product, but likewise see great demand.

Price & Purchasing Cross-Examination

To further understand these data, it will also be important to sort out the relationship between price and review interaction. Below you can see a plot with reviews on the y-axis and price on the x-axis, the latter set to the log10 scale.

From here we can make several observations about these pottery types:

  • Yixing pottery is clearly some of the most popular pottery, particularly with a concentration on cheaper products
  • Silverware generally has less interaction but is much more expensive
  • Chaozhou Hong Ni and Qin Zhou Pottery occupy a similar range as Yixing pottery, but are not nearly as popular
  • Jian Shui has a cheaper range of prices, but does not have the same popularity as Yixing
  • Both Silver & Qin Zhou Pottery have select bumps in the higher range of popularity and may be worth investigating in detail
General Storefront Recommendations

Leaf tea pricing features several complications, which shall be addressed below, however, regarding pottery, we can already provide some suggestions. In order to appeal to the higher end tea market, investing in Yixing pottery would certainly be prudent. Select Silver Teaware offerings also represent a prudent investment, considering they see some significant purchase interaction. Chaozhou Hong Ni and Jian Shui pottery may present popular higher-end and lower-end offerings respectively.

Leaf Tea & Pricing

Analysis of leaf tea pricing versus interaction is necessarily complicated for the many ways in which a given tea is sold; most teas are sold according to weight, which we could programmatically compare to the display price, however, a large number of teas are sold in traditional formats, in a number of cases divorced from their value per weight. Among these methods, it is common for teas to be sold by cake, by block, and by caddy.

Rather than hone in purely on the price per gram, from here it would be effective for our analysis if we were to turn our attention to brand data, including general brand pricing practices.

Brand Analysis

As we established above, Raw and Ripe Pu-erh teas constitute disproportionate amounts of the products on offer and therefore will be worth examination on their own. Whilst Hei Cha and White tea both feature a range of branded sources, the rest of the non-Pu-erh teas are, by and large, re-branded to Yunnan Sourcing. Therefore, in order to achieve the richest analysis possible, we will restrict our focus here to the Raw Pu-erh, Ripe Pu-erh, White Tea, and Hei Cha.

Raw Pu-erh Teas

With a denser graph, we can examine multiple axes at the same time. Where each dot is a single brand, their size represents the total products they offer on the storefront and their color represents their mean Review Star Score. With median Item Pricing on the x-axis and median Wishlist Score on the y-axis, we can examine not only review interaction (inferred from the Star Score) but also aspirational purchasing targets charted against price.

What immediately stands out about the graph is that Raw Pu-erh Brands feature a concentrated, nearly normal distribution of Wishlist Scores around a median of 35. Looking at the median pricing distribution, we see two peaks in the top marginal plot with one at $40 USD and the other at $90 USD. 

Of the darker-colored brands, CNNP stands out in a significant way: not only is it well-represented on the storefront, but it also has a high wishlist score compared to virtually every other brand of both lower and higher prices. As far as low-priced brands are concerned, the Xiaguan Tea Factory also stands out as wishlisted more than the average brand and still well-reviewed.

One of the unique properties of this kind of perspective is the consideration for outliers; where many analyses looking for trends will exclude outliers, a business might be interested in seeking disproportionately popular brands as up-and-coming sources. In this case, we can highlight such unique cases as the Pin Xiang Tea Factory; whilst it has fewer products on the storefront than many brands, it has a significant pull from customer wishlisting behavior. 

Ripe Pu-erh Brands

The distribution of Ripe Pu-erh Wishlist Scores is fairly similar to the previous example: here we see a relatively normal distribution with a peak at a score of 30. The distribution is completely different, featuring a single peak, below $30 USD, which tails off significantly towards the higher prices.

Here we see similar brands from the previous analysis: CNNP occupying the same position as before with a relatively high wishlist score, balanced by a higher price as well as the Xiaguan Tea Factory ranking relatively highly for a cheaper brand. Considering that Raw and Ripe Pu-erh teas are extremely similar in their processing, this is not an altogether surprising trend.

What stands out about this graph is the differences: the Jiu Wan Tea Factory is a stand out, not making the cut for labeling in the former graph, but here they appear well-reviewed, well-represented on the storefront, and also fairly consistently desired. Lastly, by way of the outliers mention from the previous graph, here we see another oddity, the Golden Horse Brand

Pu-Erh Recommendations

Were we to convey our recommendations to a client business, it would be prudent that we suggest they look into investing in the work of the Xiaguan Tea Factory and CNNP for both Raw and Ripe Pu-erh, for lower pricing and higher pricing respectively. From there we could refine our parameters for illuminating outliers according to client requirements, however, I would personally recommend the Pin Xiang Tea Factory and the Golden Horse Brand for Raw Pu-erh and Ripe Pu-erh respectively for their high draw and low pricing.

Hei Cha & White Tea - Respective Observations & Recommendations

As we have already thoroughly analyzed the previous two graphs, we will move more expediently through this analysis.

Hei Cha Brand Analysis

With fewer brands represented than the Pu-erh teas, we see fewer trends here, however, some brands do stand out from the pack: the strongest, low-price contender seems to be the Three Cranes Brand with both solid representation and review interaction. Gao Jia Shan and Bai Sha Xi both seem like potentially prudent high-end brands, where Zhu Xiang Ji stands as a potentially solid outlier to examine.

White Tea Analysis

Here we see even greater variance than with the distribution of Hei Cha on the storefront; the most significant "brand" is the Yunnan Sourcing brand, which rebrands a great many products. After them, the lowest-priced brand of note would be the Yi Shan Tea Factory. We can relatively discount the Cha Nong Hao Brand, which, while highly wishlisted, was not consistently reviewed, suggesting less general purchasing traffic. As far as the highest wishlisted and most consistently reviewed high-end product, Bao Feng Xiang Ji would likely be worth investment. 

Summary Observations

Thus far we have been able to thoroughly investigate the product breakdown of the storefront outside and what elements of the storefront are most often purchased by the consumer base. Where further analysis of pricing structures became problematic, we were able to take our investigation further to understanding what kind of publicly available sourcing information we can lift from the website. To this end, we were able to provide concrete recommendations for client networking regarding all clearly sourced teas, namely Pu-erh teas, Hei Cha, and White teas.

In a future return to this project, I would make better allowance for scraping of the product notes and for review texts. From there we would be able to employ Natural Language Processing, which we could use either to parse the store-provided notes for flavor profile descriptors or to provide in-depth review text analysis. This kind of research in Yunnan Sourcing's data, applied at large, could provide a serious edge to any client intending to enter or succeed in the higher-end tea market.


About Author

Theodore Cheek

Data Science & Machine Learning Engineer | A Passionate Puzzle-Solver and Pattern-Finder who enjoys translating data into clear and beautiful visualizations. Fluent with R, SQL, and Python.
View all posts by Theodore Cheek >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI