Analyzing Spotify Song Metrics to Visualize Popular Songs

Posted on May 6, 2018

How would you go about describing your favorite new song? What makes it catchy to you? According to the people at Spotify, characteristics like energy or mood of a song can actually be quantified, and have many algorithms to describe music in an amazing amount of ways.

Now, why would you care about what the people at Spotify have to say about a song? With a user base of 159 million active monthly users, determining key factors that affect popularity can actually be a powerful tool for record label producers to find new artists to sign, or for aspiring data scientists to show off some nice visualizations and determine what to put on the ultimate summer playlist. Popularity is well defined in their API notes as a function of the number of times a particular song was streamed, as well as how recent those streams are.

About the App

This app visualizes several key factors and investigates their correlation to popularity visually across a wide spectrum of music.

The first two plots offer a degree of interactivity, allowing the user to visualize the difference amongst the genres. The box plot helps to see a more quantitative take on the separation across a wide musical spectrum. The density plot helps more with visualizing across the 0 - 100 scale of popularity, to see if there are any abnormalities with popularity distribution (for instance, classical music seems to have a pretty well defined bi-modal distribution of popularity!)

Besides looking at each genre as a whole, I wanted visualize a subset of each genre on a scatter plot to identify clusters to look at other variables like energy or danceability and how they change along with popularity. However, I ran into many issues in trying to separate the genres and effectively display the information. I decided on a 3D scatter plot, adding another user-input variable to look at two separate correlations with a very interactive plot for the user to zoom in and rotate the axes to better display information to their preference.  I have also included a small table to look at the Pearson correlation coefficient of several of the metrics from Spotify with popularity.

Finally, I took the 50th percentile (in terms of popularity) from each genre in my dataset and displayed them in a datatable in terms of 'threshold values' for each genre. For instance, for a successfully popular metal song, the relative would need to be quite high, as the 50th percentile has a value of 0.902. Also interestingly enough, danceability seems to be a much more crucial factor for pop as opposed to indie pop.

About the Dataset

The dataset was obtained by using the Spotify Web API in combination with the Python 3 library Spotipy. For each genre I chose, I queried 3,000 songs for the Spotify audio analysis and features. The API has a 50 song limit at each time, so I had to create a loop to query the API in 3,000 song chunks, and store them in a relevant pandas dataframe. Afterwards, I wrote the data into a CSV to do the majority of the analysis within R.

The jupyter notebook used to query the server as well as the Shiny application can be found at this GitHub repo.

Future Work

I would love to continue this analysis of popularity metrics with clustering/regression analysis at a further date, or to be able to develop a predictive model and feed information into it via Spotipy to determine up-and-coming popular artists.

For any comments or questions, please reach me via e-mail at [email protected]

About Author

Josh Vichare

BS in Materials Science & Engineering with a concentration on the study of Nanomaterials at Rutgers University. Josh has worked in the biomedical engineering field for close to 4 years in research and development, analyzing various performance metrics...
View all posts by Josh Vichare >

Related Articles

Leave a Comment

Josh Vichare May 8, 2018
Thanks! Spotify's Web API documentation has a lot of the definitions you're looking for I think. As for how they actually determine the value, unfortunately that's not too well known. My guess its an in-house algorithm that they're not too willing to share out in the public.
luca May 8, 2018
hey! great job! how are defined the metrics you're including e.g. danceability, etc?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI