Striking the Right Chord

Celina Sprague
Posted on Nov 26, 2018

To see the Shiny app, please click here


Have you ever wondered why all the music you hear today sounds the same? Do you think to yourself, “There’s no good music these days?” Have you wondered how some songs did very well when the reviews were so horrible? Donald Knuth, an accomplished computer scientist, wrote a paper in 1977 on computational complexity theory which explores these two questions: the tendency of popular songs to be mostly repetitive words with little meaning. Furthermore, his idea and analysis became cited by many researchers even today. In tandem with word repetition are “earworms,” the phenomenon of having a song “stuck in your head,” an effect that can be attributed to similar note sequences.

Like an atom identifies a fundamental piece of matter, a note is a fundamental piece of a sequence; therefore, notes can be strung together to form a sequence or combined when played at the same time to create a chord. Additionally, a chord can be represented in “note form” as seen in sheet music or notated in the form of Roman numerals with major chords as “I, II, III, etc.” and minor chords as “i, ii, iii, etc.”

In much of the same way that probability analysis is applied to behaviors and patterns in, for example, gambling, competitions, and elections, probability analysis can be applied to chord sequences. Such analysis, in addition to song metrics like “danceability” and “speechiness,” can serve as the basis of a regression-based model to predict song success as defined by ranking.


The dataset is compiled from three sources: HookTheory, International Federation of the Phonographic Industry (IFPI), and Spotify song metric dataset found on Kaggle. HookTheory is a website focused on chord progression and probability by song, so the site served as a resource to obtain data on chord sequence probability. The IFPI website is similar to Billboard because the site ranks songs and artists; however, IFPI does not inflate ranking with merchandise, so the ranks are purely album or audio sales. Lastly, the Spotify dataset from Kaggle contained various metrics such as “danceability” for a vast amount of songs.


The list of the most used chords is derived from HookTheory through BeautifulSoup, Selenium, and Scrapy. Due to a data limit constraint from HookTheory, only the top 10 chords with the highest probabilities were used to obtain the top seven chord sequence probabilities with lengths of two, three, and four. With a final set of chord probabilities, a song list was created and mapped to each chord sequence probability.

With the list of songs based on chord sequence probability, the Spotify Kaggle dataset categories were merged based on the song. Finally, IFPI song ranking was obtained through BeautifulSoup, despite having rankings from 2007 to 2017.

The regression model was made in R with feature selection from the “StepAIC” function performing in both directions. With the selected variables for the regression, VIF analysis with a threshold of 5 was performed to check for multicollinearity due to the relatively small set of variables (total of 16). Finally, song rank was regressed on chord probability of “1, 6, 5, 4” chord probability of “4, 1, 5” danceability, energy, instrumentalness, key, liveness, speechiness, and tempo.


The results from the VIF analysis for the selected variables is below. As the values are all below 5, it appears multicollinearity is not a large problem here.

The summary statistics for the regression-based model is shown below. The statistics indicate coefficients are all meaningful because they are all statistically significant at some level.


The results from the regression indicate the two-chord sequence probabilities are important for song ranking and do contain songs on the IFPI ranking. The coefficient weightings can be a little misleading at first because they look a little strange.  However, if we look at chord probability for “4, 1, 5,” the intuition can be as follows: holding all else constant, on average, a song containing the chord sequence “4,1,5” is correlated moving up about 59 rankings. This makes sense because a ranking of number 1 is better than a ranking of 500; however, to move from 500 to 1 means to numerically move backward, and so we have a negative weight to the coefficient. Conversely, a positive coefficient weight would indicate a movement away from a ranking of 1.

The coefficients “liveness” and “speechiness” are particularly of interest due to their large coefficient weights. While there is a wide range of literature regarding the definition of “liveness,” the average definition appears to be the measure of co-presence of the artist and or audience. With a weight of about 63, it can be inferred perhaps people who listen to a song with audience noise, typically found from a live performance, might react negatively because the background noise detracts from the song quality. Instances like the above might occur in track records from live performances such as the NBC live taping of Jesus Christ Superstar with John Legend.

“Speechiness” refers to the presence of spoken words in a track. An example of a song with low speechiness could be house or EDM music. This variable had the highest coefficient weight of -149, which could be interpreted as a perception of less talent for artists whose songs have fewer words than for those whose songs feature more words. Consequently,  such songs typically receive a lower rank score.

Overall, the data appears to indicate songs that are described generally as upbeat and positive with quality recording and instruments typically perform better in terms of popularity. Of course, the model does not account for data around the artist, such as hair color, gender, and age. Additionally, the model was formed on a subset of chord probabilities due to the constraint placed by HookTheory, so incorporating additional chord probabilities could yield different results.


Since the days of the Beatles and Queen, the music industry has undergone vast amounts of transformation with the introduction of technologies like auto-tune and new genres like EDM; for some, this modern-day music quality is perceived as poorer compared to songs by Michael Jackson. This web scraping project explored this notion of music quality by looking at the correlation of chord sequence probability and song metrics on song rank, a proxy for song success. The final dataset used for regression analysis was compiled from HookTheory, Spotify Kaggle dataset, and IFPI song ranking through web scraping via BeautifulSoup, Scrapy, and Selenium. The results from the multiple linear regression appeared to confirm the suspicion popular high-ranking songs do tend to have the similar chord or note sequences in addition to other song metrics like danceability and tempo. Hence, perhaps there is some truth to those who believe the “good old days” of music were gone once major artists like Michael Jackson stopped producing new music and new artists like Taylor Swift emerged.  

About Author

Celina Sprague

Celina Sprague

Celina Sprague completed her BA at Barnard College. Her work experience has been primarily in the finance industry where she worked in primarily fixed-income research and managed over 40 economic models. As a past dancer/overall athlete and painter,...
View all posts by Celina Sprague >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp