Examining Billboard Hot 100 Lyrics from 1987 - 2016

Posted on Feb 20, 2017

Going back to August 4th, 1958, Billboard has released a weekly publication of the "Hot 100" Singles chart

Introduction - Billboard magazine cemented their status as an integral figure of American popular culture with the creation of the Billboard Hot 100 chart. Since 1958, the Hot 100 chart has been accepted as the 'gold standard,' or benchmark of the popular music rankings.

The rankings are based on a formulaic approach, not the subjective to the musical preferences of the individuals tasked with compiling the list. Airplay on roughly one thousand terrestrial radio stations are tracked to form the foundation of the ranking data. Nielsen provides song sales data for both digital and physical formats which are factored into the rankings. Most recently, Billboard added music streaming data to be factored into the hot 100 chart rankings.

Scope - For this project I wanted to analyze the lyrics from popular songs over the past 30 years. In order to have a consistent input source and not have my musical preferences bias the results of the analysis I chose to work with the Billboard Hot 100 charts. Billboard releases weekly Hot 100 charts going back to the 1950's. Click here to find This Week's Hot 100 Chart.

My goal was to analyze the lyrics by year, and find trends in the most popular words used .

Data - All data used in this project was scraped.

The Billboard Hot 100 chart data was scraped from The Ultimate Music Database using a combination of BeautifulSoup, and Regular Expressions.

The twsift unofficial API for MetroLyrics was used to acquire the lyrics corresponding to each song entry in the Billboard Hot 100 charts. This API allows quick access to the lyrical content hosted by MetroLyrics with one major caveat - the song title and artist must be meticulously adjusted (removing non alpha-numerical characters, replacing spaces with '-', and correctly identifying the title & artist) otherwise it wont return the correct lyrics.

Click here for lyric scraping code





Top 25 Words per Year, 1988-2016


1988
topwords88

1989topwords89

1990topwords90

1991

topwords91

1992topwords92

1993topwords93

1994

topwords94

1995topwords95

1996topwords96

1997
topwords97

1998
topwords98

1999
topwords99

2000

2001topwords01

2002topwords02

2003topwords03

2004topwords04

2005topwords05

2006topwords06

2007topwords07

2008topwords08

2009topwords09

2010topwords10

2011topwords11

2012topwords12

2013topwords13

2014topwords14

2015topwords15

2016topwords16

Wordcloud by year from 1987-2016

 

1987
words 1987

1988words 1988

1989

words 1989

1990words 1990

1991words 1991

1992words 1992

1993words 1993

1994words 1994

1995words 1995

1996words 1996

1997words 1997

1998words 1998

1999

words 1999

2000

words 2000

 

2001

words 2001

 

2002

words 2002

 

2003
words 2003

2004words 2004

2005words 2005

2006
words 2006

2007words 2007

2008words 2008

2009words 2009

2010words 2010

2011words 2011

2012words 2012

2013
words 2013

2014words 2014

2015words 2015

2016words 2016

 

 


######Generate Word-cloud by year

from os import path
from wordcloud import WordCloud
def get_wordcloud_year(year):
wordbag = words_by_year(year)

words = remove_nonalphanum(wordbag)
print 0, len(words)

words = words.split()
# Remove single-character & 2-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 2]
print 1, len(words)

# Remove numbers
words = [word for word in words if not word.isdigit()]
print 2, len(words)

# Lowercase all words (default_stopwords are lowercase too)
words = [word.lower() for word in words]
print 3, len(words)

#remove stopwords
words = [word for word in words if word not in all_stopwords]
print 4, len(words)

#wordcloud = WordCloud().generate(words)

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
# plt.imshow(wordcloud)
plt.axis("off")

# lower max_font_size
#wordcloud = WordCloud(max_font_size=50).generate() (str(words))
plt.figure()
#plt.imshow(wordcloud)
plt.axis("off")
#plt.show()
print(len(words))
wordcloud = WordCloud(width = 1000, height = 750, font_path='/Library/Fonts/Verdana.ttf',
relative_scaling = 1.0,
stopwords = all_stopwords,
).generate(' '.join(words))
plt.figure(figsize=(20,12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


 

###Generates histogram of top 25 lyrics

stopwords_file = './stopwords.txt'
custom_stopwords = set(codecs.open(stopwords_file, 'r', 'utf-8').read().splitlines())

all_stopwords = default_stopwords | custom_stopwords

def get_wordfreq_df(year):

wordbag = words_by_year(year).decode('utf-8')#vocab.decode('utf-8')#words_by_year(year)
words = nltk.word_tokenize(wordbag)
words = [word for word in words if len(word) > 2]
words = [word for word in words if not word.isdigit()]
words = [word.lower() for word in words]
words = [word for word in words if word not in all_stopwords]

fdist = nltk.FreqDist(words)

d = Counter(fdist)
word_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
word_df = word_df.rename(columns={'index':'Word',0:'count'})

df = pd.DataFrame(fdist.most_common(25))
df.columns = ['Words', 'Count']
df.sort_index(ascending=False).plot(
kind='barh',
x = 'Words',
title = "Most Common Lyrics in: " + year,
)

 

Conclusion - A lot has changed in regards to popular music over the past 30 years, but one theme stands the test of time - Love. Although in recent years its lead seems to be fading (though that may be an artifact of my data collection), "Love" is consistently one of the most frequently used words in popular music

The most frequent words in the Billboard Hot 100 lyrics since 1987 are:

I'm

Love

Don't

Like

Know

Oh

Just

Got

Baby

Yeah

Want

You're

Cause

Make

Time

Let

Girl

Say

Way

Come

I'll

Ain't

Right

Gonna

Need

About Author

Scott Edenbaum

Scott Edenbaum is a recent graduate from the NYC Data Science Academy. He was hired by the Academy to assist in buildout of the learning management system and seeks to pursue a career as a Data Scientist. Scott's...
View all posts by Scott Edenbaum >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI