Data Analysis on the Evolution of Rap Music

Posted on May 27, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


The current state of rap music today is something that is discussed in many hip hop and rap communities. Numerous people, myself included, believe that rap music has slowly been deteriorating, especially since 2010. That is because today’s rap artists rely solely on beat and not on good lyrics. In my view, the lyrics favored today have little or no word play, and  the vocabulary has been severely dumbed down.  I was willing to put that impression to the test of data analysis.

One of the perks of being a data scientist is being able to identify metrics to quantify a question like this. Some of the questions I wanted to answer were:

  • What are some of the main topics per year for rap music?
  • What were the most used words per year? Do they provide further insight into topics?
  • What was the change in the usage of derogatory words?
  • What is the measure of the vocabulary level of lyrics each year?

The assumption for this project was:

  • Rap music will be limited to top rap songs from the billboard website

Data (Web Scraping)

In order to get a general trend of the lyrics of rap music, I decided to scrap the billboard website for top rap songs from 1989 to 2016. Although this is by no means a comprehensive list of the rap songs from the selected time period, it is a good location to start.

I decided to use the python class beautifulsoup to scrape the billboard and stored the results in two dataframes, one dataframe for the artists per year and another for the song titles per year.  

Once the artist name and song title for the top rap songs were obtained, I used the unofficial API, tswift, to obtain the lyrics based on the artist name and song title. In order to accomplish this, a lot of the artist name and song title had to be reformatted in order to successfully use the tswift API. Non-alphanumerical characters had to be removed and spaces had to be replaced with “-”. Below are tables displaying the artist name and song titles from 2011 to 2016 after they have been adjusted to function with the tswift API.

Data Analysis on the Evolution of Rap Music

Data Analysis on the Evolution of Rap Music

The lyrics  were then put in a pandas dataframe where they could later be manipulated or utilized for any analysis. Note that the tswift API was unable to identify ~ 10% of the songs. For future work on project, I will create my own API to obtain song lyrics from a website with a bigger database of songs or from different websites.  

Data Analysis (NLP)


The first Natural Language Processing (NLP) technique I wanted to try was latent dirichlet allocation (lda). This is an unsupervised method that aims to iteratively identify the probability of a word in lyrics ( a document) connected  to a particular topic as well as  the combinations of topics touched on by  a particular  word. As the number of topics is a very important tuning parameter, I played around with different topic lengths (2,3,4 and 5) to see if I could identify topics that broke down the lyrics per year into different factions.

I also tuned some parameters, known as alpha and beta, for the lda model. Both values are  usually between 0 and 1. A higher alpha value corresponds to each document containing a mixture of most topics and visa versa. A high beta value corresponds to each topic likely to contain a mixture of many words in the all the documents and visa versa.

Unfortunately, I was unable to adequately identify different topics in the lyrics per year. What the model did tell me was the usage of derogatory words after 1996 spiked up. Some of the topics per year definitely centered around love, but, apart from that, the lda model provided little insight into the topics. Below is an example of the words belonging to the top three and top 4 topics in 2015 and 2016.


Data Analysis on the Evolution of Rap Music

What I hope to do in the future is pass a list of all the lyrics and see if the model can identify different topics in all the lyrics rather than per year.  

Data on Top 25 Word Count Per Year

Since I was unable to identify different topics from the lyrics, I decided to look at the top 25 words per year. I did this using the natural language toolkit class in Python. I tokenized the lyrics and found the stem of each word and passed into a natural language toolkit function that identified the top 25 words per year. Rather than display barplots of the top 25 words per year from 1989 - 2016, I displayed the top 25 words per year in 1996, 2006, and 2016 as examples.

As can be seen from the barplots above, words like, what, that, said, they, etc. can be ignored since they provide no insight into the topics. Once they are taken out, it is obvious that derogatory words become more prevalent in the top 25 words per year.  

Trend of derogatory words and words alike

Again, a lot of the words obtained did not help in differentiating the lyrics from 1989 to 2016 but what I did notice was the usage of derogatory words in the top 25 words per year. From this I created a list of derogatory words and decided to plot a trend of the number of derogatory words used per year in the lyrics.

Apart from the spikes in the plot, the general trend in the plot states that there is an increase in the usage of derogatory words, especially from early 2000s to 2016.

Another issue I wanted to see was the number of times the words like money, power, and sex were utilized.

As can be seen from the figure above, apart from the spike in 1998, the trend is relatively flat, maybe even decreasing. It goes to show that that rappers might consistently talk about money, power and
sex but the usage of derogatory words is definitely rising.

One point to note is that the slang used today is completely different from slang used 10 or even five  years ago. So although the usage of words  like “money”, “power”, and “sex” has decreased, different slang could have been used to refer to the same words. What I hope to do in the future is identify all word similar to “money”, “power” and “sex” and check if there is a major change in trend displayed above.


Although this is by no means a comprehensive list of the rap songs from 1989 to 2016, the lyrics from the billboard website was a good place to start. Given more time, more tuning of the LDA function will be necessary to identify if there are topics to be extracted from the lyrics.

Although I was able to prove that the usage of derogatory words have on average been increasing, I was unable to find a correlation that corresponded to a decrease in the lyrical word play and just overall better storytelling of lyrics. What I hope to do in the future is create an API capable of accessing a wider database of rap music lyrics, creating code to measure the vocabulary level of the lyrics, and identifying measures of quantifying lyrical superiority.



  • Wikipedia reference of latent dirichlet allocation (lda)
  • Lda reference


About Author

Efezino Erome-Utunedi

Efezino recently completed his MENG in Mechatronics Design at the University of British Columbia, focusing on controls engineering. He now works full-time at an engineering consulting firm while enrolled in the NYCDSA's 2017 January to May online cohort,...
View all posts by Efezino Erome-Utunedi >

Related Articles

Leave a Comment

Google October 11, 2019
Google Usually posts some very exciting stuff like this. If you are new to this site.
Google September 29, 2019
Google We came across a cool site that you might delight in. Take a search if you want.
Top Rap Songs 2016 - Music News Beat June 21, 2017
[…] lyrics of rap music, I decided to scrap the billboard website for top rap songs from 1989 to 2016. [2] Once the artist name and song title for the top rap songs were obtained, I used the unofficial API, […]
insomniac June 10, 2017
certainly like your web site however you have to check the spelling on several of your posts. A number of them are rife with spelling issues and I in finding it very bothersome to inform the reality however I will surely come again again.
share investment advice June 8, 2017
If some one wishes expert view on the topic of blogging then i propose him/her to go to see this web site, Keep up the good job.
stock quotes June 7, 2017
Thanks in support of sharing such a good thought, post is nice, thats why i have read it completely
best digital camera June 6, 2017
Keep on working, great job!
starting your own business June 5, 2017
I am sure this piece of writing has touched all the internet visitors, its really really fastidious article on building up new web site.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI