Data Analyzing Song Metrics to Visualize Popular Songs
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
How would you go about describing your favorite new song? What makes it catchy to you? According to data from the people at Spotify, characteristics like energy or mood of a song can actually be quantified, and have many algorithms to describe music in an amazing amount of ways.
Now, why would you care about what the people at Spotify have to say about a song? With a user base of 159 million active monthly users, determining key factors that affect popularity can actually be a powerful tool for record label producers to find new artists to sign, or for aspiring data scientists to show off some nice visualizations and determine what to put on the ultimate summer playlist. Popularity is well defined in their API notes as a function of the number of times a particular song was streamed, as well as how recent those streams are.
About the App and Data
This app visualizes several key factors and investigates their correlation to popularity visually across a wide spectrum of music.
The first two plots offer a degree of interactivity, allowing the user to visualize the difference amongst the genres. The box plot helps to see a more quantitative take on the separation across a wide musical spectrum. The density plot helps more with visualizing across the 0 - 100 scale of popularity, to see if there are any abnormalities with popularity distribution (for instance, classical music seems to have a pretty well defined bi-modal distribution of popularity!)
Besides looking at each genre as a whole, I wanted visualize a subset of each genre on a scatter plot to identify clusters to look at other variables like energy or danceability and how they change along with popularity. However, I ran into many issues in trying to separate the genres and effectively display the information. I decided on a 3D scatter plot, adding another user-input variable to look at two separate correlations with a very interactive plot for the user to zoom in and rotate the axes to better display information to their preference. I have also included a small table to look at the Pearson correlation coefficient of several of the metrics from Spotify with popularity.
Finally, I took the 50th percentile (in terms of popularity) from each genre in my dataset and displayed them in a datatable in terms of 'threshold values' for each genre. For instance, for a successfully popular metal song, the relative would need to be quite high, as the 50th percentile has a value of 0.902. Also interestingly enough, danceability seems to be a much more crucial factor for pop as opposed to indie pop.
About the Dataset
The dataset was obtained by using the Spotify Web API in combination with the Python 3 library Spotipy. For each genre I chose, I queried 3,000 songs for the Spotify audio analysis and features. The API has a 50 song limit at each time, so I had to create a loop to query the API in 3,000 song chunks, and store them in a relevant pandas dataframe. Afterwards, I wrote the data into a CSV to do the majority of the analysis within R.
The jupyter notebook used to query the server as well as the Shiny application can be found at this GitHub repo.
Future Work
I would love to continue this analysis of popularity metrics with clustering/regression analysis at a further date, or to be able to develop a predictive model and feed information into it via Spotipy to determine up-and-coming popular artists.
For any comments or questions, please reach me via e-mail at joshvichare@gmail.com.