Wine Witness: a data-driven approach to item promotion at Vivino

Dmitri Levonian
Posted on May 1, 2020

If you ever had to make up your mind staring at dozens of similarly priced bottles of Californian Cabernet in a wine store, or if you wanted to learn more about the wine you just tried when travelling, then you probably know of Vivino

It’s the SoundHound of wines. Just take a picture of the label, and Vivino will match it to one of the 1+ billion label photos in its database and then deliver all its basic information.

Launched in 2010 in Denmark by Heini Zachariassen (check out his youtube channel), it quickly built a massive community of users ranging from casual drinkers to wine experts. Today Vivino has 43 million registered users who rate wines, write reviews and like each other’s posts.

In 2017, Vivino decided to capitalize on their vast network and launched a wine marketplace, where wineries and merchants register to sell their wines. Tens of thousands of wine items are sold on Vivino globally, with about 14,400 available when accessing from the U.S. East Coast. This is a typical picture on the front page:

With so many wines listed by the merchants, the question for Vivino is, which ones are good deals?

If there is a way to estimate the ‘true’ price of the wine, Vivino could signal to its loyal customers that some wines are priced much below their value.  And there are many reasons why a good wine may be offered at a discount: it may be a promotion by the merchant, a liquidation of stock, or sometimes just an oversupply from a good vintage year.

But can Vivino quickly identify the fair price range of a bottle without having to rely entirely on manual verification?

Exploring the data

An important number to remember when scraping dynamic websites with Selenium is 86,400. That’s the number of seconds per day. Selenium spent about 4-5 seconds navigating each wine page, which means a hard cap of only 20,000 samples per day. For many machine learning applications, this is a [very] small dataset.

With that said, Vivino had only 14,400 items currently selling in the U.S., so all of them got scraped. For each wine, 13 key characteristics were obtained:

We first explore the data visually to verify if the price is correlated with any of the features. As we can see, prices have 67% correlation with the average rating – wines with higher ratings are usually more expensive.  There is also a slightly negative correlation with the number of ratings, and the relationship resembles the typical inverse price/quantity curve.

However, there is still lots of variability. One great 4.2-star wine can be priced at $20 and another at $200! A lot of variability in the wine prices comes from perceived quality, as opposed to any objective characteristics. This variability may be explained by other predictors such as expert reviews or social media mentions and probably has a good deal of irreducible, natural variation.

Wine styles and countries of origin clearly have explanatory power, with medians ranging from $20-25  to $60-80.  These are clearly good predictors of price:

However, we should be cautious and resist the temptation of throwing all the features into the model. This may lead to overfitting and reduce the predictive power on out-of-sample, previously unseen data.

For example, the specific winery and the vintage certainly influence the price but we had to remove them from the predictive features b/c they were too granular (less than 3 wine items per winery).

Modelling the price

We built a few simple models to capture the relationship between the features and the price. Because the relationship is complex and highly non-linear, a one-layer fully-connected neural network showed the best results both in-sample and out-of-sample.  Ordinary least squares is shown here merely as benchmark; you would not expect it to work well with numerous categorical features and a modest dataset.

The model can predict the price with Mean Absolute Error of about $10 and R2 of about 75%. The percentage price difference is distributed quasi-normally, so we can infer that the model probably has extracted all useful information there was in our limited scraped features.

For our practical purposes, the most interesting part of this curve is on the right side, where our model thinks about 10% of the items must be priced at least 50% higher than the current listed price at Vivino. These are the candidates to be promoted as great deals!

For a quick reference on converting categorical features (e.g. 1,149 wine-producing regions) into machine-learnable format, selecting the feature subsets and comparing models' performance, please see the project's github.


We built a model that explains about 75% of variability in the wine prices, and for about 10% of the items, the model predicts a 1.5x higher price than the actual one.  The system could flag such items automatically and pass it for human verification.  Such AI-aided marketing may prove to be another way to build customer loyalty and curate Vivino’s relationships with the wineries and merchants.

Probably the most promising direction to improve predictions is analyzing reviews, and typical Vivino wines have thousands of them. As in any online marketplace, the hive mind is usually well-informed. An NLP engine could extract sentiment from these reviews and make our price predictions more precise.

On a more general note, Vivino seems to have the ambition of becoming the Amazon of wine. To succeed in the growing $300 b global wine industry with less than 5% share of online sales, Vivino will need to reinvent itself as data company much the same way Amazon has done.  

About Author

Dmitri Levonian

Dmitri Levonian

Dmitri has managed diverse private assets in Europe for the past 15 years. He is a practitioner of deep learning and member of the TensorFlow Certificate network.
View all posts by Dmitri Levonian >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp