Data Analysis of Top 50 Spotify Songs in 2021
Spotify is one of the world's most popular subscription streaming services - it includes over 80 million tracks on its platform, and its subscriber base has increased these years dramatically. According to Spotify's financial report, in 2022, the data shows that it has 433 million monthly active users (MAUs), including 188 million premium subscribers.
Streaming music has genuinely changed nowaday's music industry. It not only changes the way we listen to music and react to our preferences, but it also changes the way artists share their works. On the other hand, these profits musicians and music makers who can use large libraries and data to know the users' tastes better.
So, we all have the same question - does hit music have common traits? In this post, we are going to analyze Spotify's top 50 most listened songs worldwide in 2021 to investigate this topic. The dataset is extracted from Spotify with 14 descriptive variables:
- Popularity - The higher the value, the more popular the song is
- Danceability - The higher the value, the easier it is to dance to this song
- Energy - The higher the value, the more energetic the song is
- Key - The key of the song.
- Loudness (dB) - The higher the value, the louder the song
- Mode - Indicates the modality (major or minor) of a track
- Speechiness - The higher the value, the more spoken word the song contains
- Acousticness - The higher the value, the more acoustic the song is
- Instrumentalness - The closer the value to 1, the more instrumental the song is
- Liveness - The higher the value, the more likely the song is a live recording
- Valence - The higher the value, the more positive mood for the song
- Tempo - the overall estimated tempo of a track in beats per minute (BPM)
- Duration - duration of the song in minutes
- Time signature - The time signature is a notational convention to specify how many beats are in each bar/measure.
Before analyzing, let's take a look at this data.
This density plot reacts how popular an artist is relative to other artists on Spotify. The Spotify popularity index is a 0-to-100 score that ranks an artist's popularity relative to other artists. As the numbers grow, the artists will get placed in more editorial playlists and increase the reach of algorithmic playlists and recommendations. Other factors may also affect the rank, for example, the save rate, the number of playlists, the skip rate, and the share rate. All these factors can indirectly bump up or push down a song's popularity index. Here, we can see a relatively higher density in the rank of 85 to 90, which means most of the songs in this top 50 list are in the popularity of 85 -90. And to get into this list, the popularity rank should be at least 65 and above.
Next, let's see who is on the top 50 list on Spotift.
Based on the graph's data, we know that there is a total of 35 artists whose songs are on the top 50 most listened list, while 75% of them have one song placed on the list, and 25% of them have more than one song on this list. The top three popular artists are Olivia Rodrigo, Doja Cat, and Bad Bunny. They all have at least three songs on the list.
After knowing all this basic information, we can start to analyze what features may affect the song's hit. Let's check the correlation table to identify some baseline correlations between the variables.
In this table, we found that "energy" and "loudness" have the highest positive correlation, and "energy" and "acousticness" have a correlated inverse relationship. But unfortunately, with our dependent variable being "popularity", we noticed low correlation values across our independent variables. Even though we can't tell the correlation here, we can still find the common features through the EDA.
Based on the features given, I put them into two groups - one is to show the music character, and the other is how the music is presented.
First, let's check on the group of characters. Three features describe the song's character: Valence, energy, and danceability. All of them are measured from 0 to 1.
In this plot, we can see Valence spread evenly. Valence describes musical positiveness. Tracks with high valence sound more positive, while tracks with low valence sound more negative. So here, it means songs in all kinds of moods would have the chance to be popular.
And let's take a look at danceability and energy. Energy measures the songs' intensity and activity. Typically, energetic tracks tend to be fast and loud. For example, death metal has high energy, while classical music scores low in energy. Danceability describes a track's suitability for dancing based on a combination of musical elements, including tempo, rhythm stability, and beat strength.
So here, we can tell that high danceability and high energy are more popular than low danceability and low energy. This is more obvious in danceability - almost all the songs are with danceability above the score of 0.5.
Next, let's check on the features of music present - acousticness, liveness, Instrumentalness, and speechiness.
Instrumentalness detects whether a track contains no vocals. The closer the value is to 1.0, the greater likelihood the track has no vocal content, and values above 0.5 are intended to represent instrumental tracks. Liveness detects the presence of an audience in the recording. Higher values in liveness represent an increased probability that the track was performed live. Generally, a value above 0.8 provides a strong likelihood that the track is live. Based on instrumentalness and liveness in this plot, we know that all the hit tracks are non-instrumental music and are pre-recorded.
The speechiness scores are all below 0.3, and most of the songs are below 0.2. Speechiness detects the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words (e.g., talk shows or audiobooks), values between 0.33 and 0.66 describe tracks that may contain both music and speech (e.g., rap), and values below 0.33 most likely represent music and other non-speech-like tracks. So this tells us that non-rap and non-speech-like tracks are more likely to get hit.
Last but not least - the Acousticness. Acousticness stands for whether the track is acoustic or not. 1.0 represents high acoustic, meaning the song is more likely to be lower energy and quieter. Acoustic spread comparatively even here, which means both quiet and loud songs have their market, but the song with more energy tend to be more popular, which we can also tell from the previous graph's data.
Next, I want to talk about the other features that may also affect the songs' popularity - keys, duration, tempo, and time signature.
In Western music, there are 12 major keys and 12 minor keys. The bar plot describes how the popularity differs for the same key across different modes. 0 stands for C key, and 10 stands for B key. Based on the theory, most people can sing fairly comfortably in the range from middle C to C' or below. Here, we can tell that C major, C sharp major, and B minor are more popular than all the other keys. And major music is more popular than minor music in this 50 most-listened list. So we can assume that the more popular a track is, the more likely it contains vocals and is more singable for listeners.
And speaking of the duration, tempo, and time signature, the numbers tell us that duration which is from 2.5 to 4 minutes, and time signature in 4/4 t will have more chance to be popular.
Without a doubt, hit songs are not easy to create. But upon this data analysis, we find out that hit songs do have some common characteristics. Yet, there are other features and factors we didn't mention here that can be explored more deeply, but maybe adding these traits into your song may help you construct a popular one.