Data Study on listening habits: Spotify’s Top 200

Posted on Oct 22, 2018

Spotify publishes a chart with the most streamed tracks per country. The chart is updated daily and goes back as far as the beginning of 2017. This Kaggle competition collected all the historical data from 2017. The data contains over 3 million rows and sums up more than 99 billion streams.


Data on Track performance

Exploratory Data Analysis can be helpful to see what kind of interesting insights can be drawn from the dataset. I started off by mapping the top tracks worldwide, where every country is colored by the amount of streams generated.

Top track per country in 2017:

Data Study on listening habits: Spotify’s Top 200

Top track for Mexico was 'Me Rehúso' with 127.5 million streams

The map only shows ‘the tip of the iceberg’ concerning top ranking tracks. It does not shed light onto how the songs ranked over time and how they performed in different countries.

To monitor individual songs over time, we can use a time series plot to visualize how their position changes and compare the different patterns between countries. This can serve as a granular analysis into music adoption and retention per country.

In general, each song seems to have its own particular behaviour, and this becomes more noticeable regarding music of different genres.

Individual track adoption and retention:

Data Study on listening habits: Spotify’s Top 200

‘Despacito’ and ‘HUMBLE.’ ranking over time in Great Britain, Mexico and the U.S.

It is important to note that the data is not representative of the entire music streaming industry and that it has significant selection bias (only Spotify users). Audio streaming comes from important sources other than Spotify. Apple Music probably has very similar behavior in their data, but YouTube has very different streaming habits.

Singling out indiviual songs has provided some interesting insight as to how different songs perform uniquely in each country. However, does this behaviour scale up to countrywide listening habits? On average, does every country stream music in the same way? Or does average music adoption and retention vary between countries?

To answer these questions, I began by grouping the data by country and making a scatter plot to compare, on average, all tracks that reached the top 10 in each country by the number of days it took to reach their top positions against the number of days it took to leave the top charts entirely. The following plot sums up the findings.

Mean track adoption and retention:

Data Study on listening habits: Spotify’s Top 200

Mexico’s new music adoption is slower than most countries, however, the top tracks stay at in the charts for longer

The graph shows an important difference between continents which is probably due to the native speaking languages of each country. Regardless, it seems that if a country is quick at adopting new music, it will also be quick to move on to the next music trend.

There are some countries in which the top tracks stay in the top charts for longer, like Brazil or Mexico. The U.S. and Great Britain are better at adopting music and therefore move on easily to new songs.


Data on Extreme events

Individual song streaming:

Global phenomenons like ‘Despacito’ dwarf most songs

Zooming in again to visualize individual songs, I plotted a random sample of 1,000 songs and their respective total global streams. The plot shows that there are various extreme values in the sample (this holds true for all random samples taken from the datset). In addition, most data points add up to a very low amount of streams in comparison to the extremely popular tracks.

This is surprising given that the dataset only contains information on the most streamed songs in the world.

To can check if this is true on a nationwide scale, I made boxplots showing the total number of streams for each song by country.

Song streaming in top countries:

Red points represent outlying observations in the data

As expected, the outlying observations have very extreme values, so much so that the boxes are not properly visible. Changing the scale can provide a better look, however, regular (non-outlying) observations are predictable and consistent events. Examining the extreme values can be much more insightful.

I labelled songs that belong to the top 20% most streamed tracks to compare the amount of streams they generate versus the rest of the songs in the data.

Streaming from the top 20%

Top streaming countries in order of polarized listening habits

Polarized listening habits refers to the imbalance in music streaming. If polarization is high, then this means that a small number of songs are responsible for a high volume of streams. The higher the imbalance, the higher the polarization.

The barplot shows a pretty stark contast in listening habits between countries. At the top is Great Britain where 20% of the songs account for 90% of total streams during 2017. Sweden’s top songs account for 89% of streams.

At the bottom is Brazil where the top tracks only contibute 75% of total streaming. 79% of Mexico’s streaming comes from the top 20%.

Valuable information has risen from diving into extreme events in the data. Now only one last question remains: does streaming polarization have any relation to music adoption and retention?

This sheds light onto how to market new releases and what to expect within each country. Marketing strategies can be tailored for a country which is highly reliable on top tracks and that easily moves on to the next top trending music.

For future research, we can look into when is the prime moment for an artist to announce their tour dates in specific countries.

About Author

Raul Vallejo

Actuary, statistician and certified Data Scientist. Experienced in building risk models and integrating them into a company-wide modelling strategy. Leader of new multi-department initiatives to create data-driven culture. Raúl Vallejo completed his BA in Actuarial Science at Instituto...
View all posts by Raul Vallejo >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI