Data Analysis on the Evolution of Rap Music
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
The current state of rap music today is something that is discussed in many hip hop and rap communities. Numerous people, myself included, believe that rap music has slowly been deteriorating, especially since 2010. That is because today’s rap artists rely solely on beat and not on good lyrics. In my view, the lyrics favored today have little or no word play, and the vocabulary has been severely dumbed down. I was willing to put that impression to the test of data analysis.
One of the perks of being a data scientist is being able to identify metrics to quantify a question like this. Some of the questions I wanted to answer were:
- What are some of the main topics per year for rap music?
- What were the most used words per year? Do they provide further insight into topics?
- What was the change in the usage of derogatory words?
- What is the measure of the vocabulary level of lyrics each year?
The assumption for this project was:
- Rap music will be limited to top rap songs from the billboard website
Data (Web Scraping)
In order to get a general trend of the lyrics of rap music, I decided to scrap the billboard website for top rap songs from 1989 to 2016. Although this is by no means a comprehensive list of the rap songs from the selected time period, it is a good location to start.
I decided to use the python class beautifulsoup to scrape the billboard and stored the results in two dataframes, one dataframe for the artists per year and another for the song titles per year.
Once the artist name and song title for the top rap songs were obtained, I used the unofficial API, tswift, to obtain the lyrics based on the artist name and song title. In order to accomplish this, a lot of the artist name and song title had to be reformatted in order to successfully use the tswift API. Non-alphanumerical characters had to be removed and spaces had to be replaced with “-”. Below are tables displaying the artist name and song titles from 2011 to 2016 after they have been adjusted to function with the tswift API.
The lyrics were then put in a pandas dataframe where they could later be manipulated or utilized for any analysis. Note that the tswift API was unable to identify ~ 10% of the songs. For future work on project, I will create my own API to obtain song lyrics from a website with a bigger database of songs or from different websites.
Data Analysis (NLP)
LDA
The first Natural Language Processing (NLP) technique I wanted to try was latent dirichlet allocation (lda). This is an unsupervised method that aims to iteratively identify the probability of a word in lyrics ( a document) connected to a particular topic as well as the combinations of topics touched on by a particular word. As the number of topics is a very important tuning parameter, I played around with different topic lengths (2,3,4 and 5) to see if I could identify topics that broke down the lyrics per year into different factions.
I also tuned some parameters, known as alpha and beta, for the lda model. Both values are usually between 0 and 1. A higher alpha value corresponds to each document containing a mixture of most topics and visa versa. A high beta value corresponds to each topic likely to contain a mixture of many words in the all the documents and visa versa.
Unfortunately, I was unable to adequately identify different topics in the lyrics per year. What the model did tell me was the usage of derogatory words after 1996 spiked up. Some of the topics per year definitely centered around love, but, apart from that, the lda model provided little insight into the topics. Below is an example of the words belonging to the top three and top 4 topics in 2015 and 2016.
What I hope to do in the future is pass a list of all the lyrics and see if the model can identify different topics in all the lyrics rather than per year.
Data on Top 25 Word Count Per Year
Since I was unable to identify different topics from the lyrics, I decided to look at the top 25 words per year. I did this using the natural language toolkit class in Python. I tokenized the lyrics and found the stem of each word and passed into a natural language toolkit function that identified the top 25 words per year. Rather than display barplots of the top 25 words per year from 1989 - 2016, I displayed the top 25 words per year in 1996, 2006, and 2016 as examples.
As can be seen from the barplots above, words like, what, that, said, they, etc. can be ignored since they provide no insight into the topics. Once they are taken out, it is obvious that derogatory words become more prevalent in the top 25 words per year.
Trend of derogatory words and words alike
Again, a lot of the words obtained did not help in differentiating the lyrics from 1989 to 2016 but what I did notice was the usage of derogatory words in the top 25 words per year. From this I created a list of derogatory words and decided to plot a trend of the number of derogatory words used per year in the lyrics.
Apart from the spikes in the plot, the general trend in the plot states that there is an increase in the usage of derogatory words, especially from early 2000s to 2016.
Another issue I wanted to see was the number of times the words like money, power, and sex were utilized.
As can be seen from the figure above, apart from the spike in 1998, the trend is relatively flat, maybe even decreasing. It goes to show that that rappers might consistently talk about money, power and
sex but the usage of derogatory words is definitely rising.
One point to note is that the slang used today is completely different from slang used 10 or even five years ago. So although the usage of words like “money”, “power”, and “sex” has decreased, different slang could have been used to refer to the same words. What I hope to do in the future is identify all word similar to “money”, “power” and “sex” and check if there is a major change in trend displayed above.
Conclusion
Although this is by no means a comprehensive list of the rap songs from 1989 to 2016, the lyrics from the billboard website was a good place to start. Given more time, more tuning of the LDA function will be necessary to identify if there are topics to be extracted from the lyrics.
Although I was able to prove that the usage of derogatory words have on average been increasing, I was unable to find a correlation that corresponded to a decrease in the lyrical word play and just overall better storytelling of lyrics. What I hope to do in the future is create an API capable of accessing a wider database of rap music lyrics, creating code to measure the vocabulary level of the lyrics, and identifying measures of quantifying lyrical superiority.
References
- Wikipedia reference of latent dirichlet allocation (lda)
- https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
- Lda reference
- https://www.youtube.com/watch?v=BuMu-bdoVrU&t=1039s