Analyzing Spotify Song Metrics to Visualize Popular Songs

Posted on May 6, 2018

How would you go about describing your favorite new song? What makes it catchy to you? According to the people at Spotify, characteristics like energy or mood of a song can actually be quantified, and have many algorithms to describe music in an amazing amount of ways.

Now, why would you care about what the people at Spotify have to say about a song? With a user base of 159 million active monthly users, determining key factors that affect popularity can actually be a powerful tool for record label producers to find new artists to sign, or for aspiring data scientists to show off some nice visualizations and determine what to put on the ultimate summer playlist. Popularity is well defined in their API notes as a function of the number of times a particular song was streamed, as well as how recent those streams are.

About the App

This app visualizes several key factors and investigates their correlation to popularity visually across a wide spectrum of music.

The first two plots offer a degree of interactivity, allowing the user to visualize the difference amongst the genres. The box plot helps to see a more quantitative take on the separation across a wide musical spectrum. The density plot helps more with visualizing across the 0 - 100 scale of popularity, to see if there are any abnormalities with popularity distribution (for instance, classical music seems to have a pretty well defined bi-modal distribution of popularity!)

Besides looking at each genre as a whole, I wanted visualize a subset of each genre on a scatter plot to identify clusters to look at other variables like energy or danceability and how they change along with popularity. However, I ran into many issues in trying to separate the genres and effectively display the information. I decided on a 3D scatter plot, adding another user-input variable to look at two separate correlations with a very interactive plot for the user to zoom in and rotate the axes to better display information to their preference.  I have also included a small table to look at the Pearson correlation coefficient of several of the metrics from Spotify with popularity.

Finally, I took the 50th percentile (in terms of popularity) from each genre in my dataset and displayed them in a datatable in terms of 'threshold values' for each genre. For instance, for a successfully popular metal song, the relative would need to be quite high, as the 50th percentile has a value of 0.902. Also interestingly enough, danceability seems to be a much more crucial factor for pop as opposed to indie pop.

About the Dataset

The dataset was obtained by using the Spotify Web API in combination with the Python 3 library Spotipy. For each genre I chose, I queried 3,000 songs for the Spotify audio analysis and features. The API has a 50 song limit at each time, so I had to create a loop to query the API in 3,000 song chunks, and store them in a relevant pandas dataframe. Afterwards, I wrote the data into a CSV to do the majority of the analysis within R.

The jupyter notebook used to query the server as well as the Shiny application can be found at this GitHub repo.

Future Work

I would love to continue this analysis of popularity metrics with clustering/regression analysis at a further date, or to be able to develop a predictive model and feed information into it via Spotipy to determine up-and-coming popular artists.

For any comments or questions, please reach me via e-mail at [email protected]

About Author

Josh Vichare

BS in Materials Science & Engineering with a concentration on the study of Nanomaterials at Rutgers University. Josh has worked in the biomedical engineering field for close to 4 years in research and development, analyzing various performance metrics...
View all posts by Josh Vichare >

Related Articles

Leave a Comment

Josh Vichare May 8, 2018
Thanks! Spotify's Web API documentation has a lot of the definitions you're looking for I think. As for how they actually determine the value, unfortunately that's not too well known. My guess its an in-house algorithm that they're not too willing to share out in the public.
luca May 8, 2018
hey! great job! how are defined the metrics you're including e.g. danceability, etc?

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp