Data Analysis on the Values of Wines
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
If you ever had to make up your mind staring at dozens of similarly priced bottles of Californian Cabernet in a wine store, or if you wanted to learn more about the background data of the wine you just tried when travelling, then you probably know of Vivino.
It’s the SoundHound of wines. Just take a picture of the label, and Vivino will match it to one of the 1+ billion label photos in its database and then deliver all its basic information.
Launched in 2010 in Denmark by Heini Zachariassen (check out his youtube channel), it quickly built a massive community of users ranging from casual drinkers to wine experts. Today Vivino has 43 million registered users who rate wines, write reviews and like each other’s posts.
In 2017, Vivino decided to capitalize on their vast network and launched a wine marketplace, where wineries and merchants register to sell their wines. Tens of thousands of wine items are sold on Vivino globally, with about 14,400 available when accessing from the U.S. East Coast. This is a typical picture on the front page:
With so many wines listed by the merchants, the question for Vivino is, which ones are good deals?
If there is a way to estimate the ‘true’ price of the wine, Vivino could signal to its loyal customers that some wines are priced much below their value. And there are many reasons why a good wine may be offered at a discount: it may be a promotion by the merchant, a liquidation of stock, or sometimes just an oversupply from a good vintage year.
But can Vivino quickly identify the fair price range of a bottle without having to rely entirely on manual verification?
Exploring the data
An important number to remember when scraping dynamic websites with Selenium is 86,400. That’s the number of seconds per day. Selenium spent about 4-5 seconds navigating each wine page, which means a hard cap of only 20,000 samples per day. For many machine learning applications, this is a [very] small dataset.
With that said, Vivino had only 14,400 items currently selling in the U.S., so all of them got scraped. For each wine, 13 key characteristics were obtained:
We first explore the data visually to verify if the price is correlated with any of the features. As we can see, prices have 67% correlation with the average rating – wines with higher ratings are usually more expensive. There is also a slightly negative correlation with the number of ratings, and the relationship resembles the typical inverse price/quantity curve.
However, there is still lots of variability. One great 4.2-star wine can be priced at $20 and another at $200! A lot of variability in the wine prices comes from perceived quality, as opposed to any objective characteristics. This variability may be explained by other predictors such as expert reviews or social media mentions and probably has a good deal of irreducible, natural variation.
Wine styles and countries of origin clearly have explanatory power, with medians ranging from $20-25 to $60-80. These are clearly good predictors of price:
However, we should be cautious and resist the temptation of throwing all the features into the model. This may lead to overfitting and reduce the predictive power on out-of-sample, previously unseen data.
For example, the specific winery and the vintage certainly influence the price but we had to remove them from the predictive features b/c they were too granular (less than 3 wine items per winery).
Modelling the price
We built a few simple models to capture the relationship between the features and the price. Because the relationship is complex and highly non-linear, a one-layer fully-connected neural network showed the best results both in-sample and out-of-sample. Ordinary least squares is shown here merely as benchmark; you would not expect it to work well with numerous categorical features and a modest dataset.
The model can predict the price with Mean Absolute Error of about $10 and R2 of about 75%. The percentage price difference is distributed quasi-normally, so we can infer that the model probably has extracted all useful information there was in our limited scraped features.
For our practical purposes, the most interesting part of this curve is on the right side, where our model thinks about 10% of the items must be priced at least 50% higher than the current listed price at Vivino. These are the candidates to be promoted as great deals!
For a quick reference on converting categorical features (e.g. 1,149 wine-producing regions) into machine-learnable format, selecting the feature subsets and comparing models' performance, please see the project's github.
Conclusions
We built a model that explains about 75% of variability in the wine prices, and for about 10% of the items, the model predicts a 1.5x higher price than the actual one. The system could flag such items automatically and pass it for human verification. Such AI-aided marketing may prove to be another way to build customer loyalty and curate Vivino’s relationships with the wineries and merchants.
Probably the most promising direction to improve predictions is analyzing reviews, and typical Vivino wines have thousands of them. As in any online marketplace, the hive mind is usually well-informed. An NLP engine could extract sentiment from these reviews and make our price predictions more precise.
On a more general note, Vivino seems to have the ambition of becoming the Amazon of wine. To succeed in the growing $300 b global wine industry with less than 5% share of online sales, Vivino will need to reinvent itself as data company much the same way Amazon has done.