Data Analysis on the Values of Wines

Posted on May 1, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

If you ever had to make up your mind staring at dozens of similarly priced bottles of Californian Cabernet in a wine store, or if you wanted to learn more about the background data of the wine you just tried when travelling, then you probably know of Vivino

It’s the SoundHound of wines. Just take a picture of the label, and Vivino will match it to one of the 1+ billion label photos in its database and then deliver all its basic information.

Launched in 2010 in Denmark by Heini Zachariassen (check out his youtube channel), it quickly built a massive community of users ranging from casual drinkers to wine experts. Today Vivino has 43 million registered users who rate wines, write reviews and like each other’s posts.

In 2017, Vivino decided to capitalize on their vast network and launched a wine marketplace, where wineries and merchants register to sell their wines. Tens of thousands of wine items are sold on Vivino globally, with about 14,400 available when accessing from the U.S. East Coast. This is a typical picture on the front page:

Data Analysis on the Values of Wines

With so many wines listed by the merchants, the question for Vivino is, which ones are good deals?

If there is a way to estimate the ‘true’ price of the wine, Vivino could signal to its loyal customers that some wines are priced much below their value.  And there are many reasons why a good wine may be offered at a discount: it may be a promotion by the merchant, a liquidation of stock, or sometimes just an oversupply from a good vintage year.

But can Vivino quickly identify the fair price range of a bottle without having to rely entirely on manual verification?

Exploring the data

An important number to remember when scraping dynamic websites with Selenium is 86,400. That’s the number of seconds per day. Selenium spent about 4-5 seconds navigating each wine page, which means a hard cap of only 20,000 samples per day. For many machine learning applications, this is a [very] small dataset.

With that said, Vivino had only 14,400 items currently selling in the U.S., so all of them got scraped. For each wine, 13 key characteristics were obtained:

Data Analysis on the Values of Wines

We first explore the data visually to verify if the price is correlated with any of the features. As we can see, prices have 67% correlation with the average rating – wines with higher ratings are usually more expensive.  There is also a slightly negative correlation with the number of ratings, and the relationship resembles the typical inverse price/quantity curve.

However, there is still lots of variability. One great 4.2-star wine can be priced at $20 and another at $200! A lot of variability in the wine prices comes from perceived quality, as opposed to any objective characteristics. This variability may be explained by other predictors such as expert reviews or social media mentions and probably has a good deal of irreducible, natural variation.

Data Analysis on the Values of Wines

Wine styles and countries of origin clearly have explanatory power, with medians ranging from $20-25  to $60-80.  These are clearly good predictors of price:

However, we should be cautious and resist the temptation of throwing all the features into the model. This may lead to overfitting and reduce the predictive power on out-of-sample, previously unseen data.

For example, the specific winery and the vintage certainly influence the price but we had to remove them from the predictive features b/c they were too granular (less than 3 wine items per winery).

Modelling the price

We built a few simple models to capture the relationship between the features and the price. Because the relationship is complex and highly non-linear, a one-layer fully-connected neural network showed the best results both in-sample and out-of-sample.  Ordinary least squares is shown here merely as benchmark; you would not expect it to work well with numerous categorical features and a modest dataset.

The model can predict the price with Mean Absolute Error of about $10 and R2 of about 75%. The percentage price difference is distributed quasi-normally, so we can infer that the model probably has extracted all useful information there was in our limited scraped features.

For our practical purposes, the most interesting part of this curve is on the right side, where our model thinks about 10% of the items must be priced at least 50% higher than the current listed price at Vivino. These are the candidates to be promoted as great deals!

For a quick reference on converting categorical features (e.g. 1,149 wine-producing regions) into machine-learnable format, selecting the feature subsets and comparing models' performance, please see the project's github.


We built a model that explains about 75% of variability in the wine prices, and for about 10% of the items, the model predicts a 1.5x higher price than the actual one.  The system could flag such items automatically and pass it for human verification.  Such AI-aided marketing may prove to be another way to build customer loyalty and curate Vivino’s relationships with the wineries and merchants.

Probably the most promising direction to improve predictions is analyzing reviews, and typical Vivino wines have thousands of them. As in any online marketplace, the hive mind is usually well-informed. An NLP engine could extract sentiment from these reviews and make our price predictions more precise.

On a more general note, Vivino seems to have the ambition of becoming the Amazon of wine. To succeed in the growing $300 b global wine industry with less than 5% share of online sales, Vivino will need to reinvent itself as data company much the same way Amazon has done.  

About Author

Dmitri Levonian

Dmitri has managed diverse private assets in Europe for the past 15 years. He is a practitioner of deep learning and member of the TensorFlow Certificate network.
View all posts by Dmitri Levonian >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI