How to Make Goodreads.com Top 400 List?
I am a bookworm.
I read 3-4 books a month (non-fiction books only) and start 80% of my sentences with "I read somewhere that..."
My interest in books is what motivated me to scrape goodreads.com, the world's largest site for readers and book recommendations, which Amazon acquired in 2013.
The goal of this project was to analyze the common traits of Top 400 books on goodreads.com, and predict the success of a book based on its reviews.
Scraping Goodreads.com
I scraped the books from the ‘Best Book Ever’ list. To capture all the information needed for my analysis, I had to scrape three different page templates.
Below is an example of a page template that lists a few elements required for my analysis (highlighted in orange).
The analysis
I divided my analysis into three parts, each guided by a set of questions:
- Writing:
- Which genre is most represented in the Top 400 list?
- Is there an optimal number of pages?
- Publishing
- Does the year of publication matter?
- Predicting
- Is the number of reviews a predictor of success?
- Which words, at which frequency, can predict success?
The first two parts are intended to guide an author BEFORE he/she writes a book, while the last one is designed to predict the success of a book AFTER it was published.
I used Python for my analysis and Python’s Bokeh package for visualization.
1- Writing
Which genre is most represented in the Top 400 list?
Fiction, Classics and Fantasy are the most represented genres in the Top 400 list. Note that there may be an overlap between the genres as a book could be classified under multiple genres.
The histogram above raises a few questions:
- What is the distribution of each genre within the Top 400 list? For example, are Fiction books mainly in the Top 100 or at the bottom of the Top 400 list?
- Is a book more likely to make it to the Top 400 if its genre is Fiction, Fantasy or Classics? What if these 3 genres are simply highly represented on Goodreads.com? One should expect the likelihood of a genre in the Top 400 list to be proportional to the genre's prevalence for all books on Goodreads.com.
Distribution of each genre within the Top 400 list
The heatmap below addresses the first question: it shows the distribution of the Top 16 genres by ranking category (e.g. Top 100, Top 200, etc.). The color depicts the number of books.
An interesting insight: the genre Literature is not part of the three top genres. Yet, it is highly represented in the Top 100 list.
Bringing in another dataset: our Control list
As raised above, the Top 400 list is meaningless if we can't put it into context by comparing it to the worst 400 list. Unfortunately, there is no such list...
In the absence of a 'Worst Ever Books' list, I scraped a random sample of 10,000 books on goodreads.com, which became my ‘Control’ list.
Disclaimer: I learnt later that the Worst Ever Books list does exist on goodreads.com.
The histograms below compare the distribution of genres between the Control list and the best-ever-book list.
An interesting insight: while the Classics genre is underrepresented on goodreads.com, it is highly represented in the Top 400 list, suggesting that readers are more likely to upvote the Classics.
On the other hand, the Fiction, Fantasy and Young Adult genres represent a significant proportion of the books on goodreads.com. Their likelihood of being in the Top 400 is higher than an underrepresented genre like Classics.
Is there an optimal number of pages?
The short answer is no. As illustrated below, there is a similar trend between the Control and Top 400 lists. The number of pages is not a predictor of success.
2- Publishing
Does the year of publication matter?
For this part, I looked at the distribution of books in the Top 400 per year, and investigated a possible correlation with:
- The average number of pages published per year
- The total number of books sold per year
Total Count vs. Average Number of Pages (by Year of Publication)
A significant proportion of the books in the Top 400 list were published between 2003 and 2006. Although a few peaks in the average number of pages published coincide with peaks in the number of books in the Top 400, there is no correlation between the two variables.
It is interesting to see that the trend for the Control list (graph below) differs from the trend for the Top 400 list. Though a lot of books were published after 2010, they are underrepresented in the Top 400 list.
For those who are curious about the details, below is a list of the Top 10 books in 2003, 2004 and 2006.
Number of Books Sold vs. Number of Books in Top 400
Is there a correlation between the distribution of books in the Top 400 per year and the number of books sold that year? It does not seem to be the case as illustrated below.
However, the dataset is biased (as explained below), so the results are to be taken with a pinch of salt.
I used the dataset from the US census bureau that provides the number of books sold per year. The dataset only includes the number of books sold in US bookstores from 1992 - 2014. In other words:
- The sample size is limited (23 years).
- Online sales are not taken into account, and we expect that means a significant number of books sales from this century not going into the count. Unfortunately, with no official record for online books sales, the actual number could not be verified.
- It does not report the number of unique books sold, meaning we could have 1000 books sold a year that are all from the Harry Potter series.
3- Predicting
Is the number of reviews a predictor of success?
The number of reviews is an indicator of rank in the Top 400 list. We see a clear trend showing that the Top 100 books receive more reviews than the Top 100-200 books, which themselves receive more reviews than the Top 200-300, etc.
Should we conclude that the more reviews the better? Or is the relationship between the number of reviews and the score an inverted bell curve where the best and worst books get more reviews? After all, an average book does not spark strong emotions (whether positive or negative); only the best and worst ones do.
The Control list does not confirm the assumption of an inverted bell curve. On the contrary, the five-star books get fewer reviews than the four-star book, while the one or two-star books receive barely any reviews.
Which words, at which frequency, can predict success?
I started my sentiment analysis on the Reviews column using the AFINN dictionary.
The output of the AFINN method is a float variable (the AFINN score) that, if larger than zero, indicates a positive sentiment and, less than zero, indicates negative sentiment. The scale ranges from -5 to +5.
Testing the AFINN method on the Control list first
A high level of granularity was needed to be able to identify which words could predict a book's ranking in the Top 400 list. Before applying the AFINN method to the Top 400 list, I checked its reliability on a broader list, i.e. the Control list.
The AFINN Lexicon seemed to be a good predictor of success (as illustrated in the graph above). The next step was to see if it could help differentiate a book in the Top 100 list from a book in the Top 300 list…
Unfortunately, as illustrated below, the AFINN Lexicon was not granular enough to predict ranking among the ‘Best Ever Book’ list (where, by definition, all the books are successful).
The scatterplot below offers the same insights as the box plot except that I used a continuous variable (the score) instead of a discrete one (the ranking category), and differentiated the ranking categories by color.
In the scatterplot above, the scores do not drop below -1.5, which makes sense as we don't expect the 'Best Ever' Books to have highly negative sentiment score (i.e -3, -4 or -5). However there’s still a small amount of prediction error; some Top 100 books have a negative sentiment score.
My sentiment analysis needed some fine-tuning:
- Instead of looking at the overall score for the reviews, I analyzed each word within the review to understand how its AFINN score and frequency could predict the ranking of a book.
- I also looked at another library, the Vader lexicon, which offers more granularity.
Looking at individual positive and negative words and how they impact a book's ranking
There was no correlation between a word's frequency and the average goodreads score as illustrated in the graph below.
It is interesting to see that some words with a low AFINN score appear quite frequently in the reviews of successful books. This is because the AFINN lexicon is constructed with unigram features, where the syntax and even the order of words is ignored, meaning that “like” in the sentence “I don’t like” will be interpreted positively.
Looking at another library - Vader lexicon - to help better predict the score
I was not very successful with the AFINN lexicon and decided to look at the Vader lexicon. Its algorithm seemed to be more granular as it output 4 classes of sentiments:
- neg: Negative
- neu: Neutral
- pos: Positive
- compound: Compound (i.e. aggregated score computed by summing the valence scores of each word in the reviews, adjusted according to the Vader rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive))
I used a pairplot to analyze the correlation between:
- the Vader positive score (pos) and the goodreads score
- the Vader negative score (neg) and the goodreads score
- the Vader compound score (compound) and the goodreads score
The graphs with a pearson's r score of 0.5 or higher, signaling a correlation, are in green. Pearson's r can range from -1 to 1 with:
- an r of -1 indicating a perfect negative linear relationship between variables,
- an r of 0 indicating no linear relationship between variables, and
- an r of 1 indicating a perfect positive linear relationship between variables.
There is a negative correlation between the Vader neutral and positive scores, which makes sense: the scores are from the same lexicon after all. However, there is no correlation between the Vader positive and negative scores, which is surprising.
Lastly, there is no correlation between the variables I was interested in (i.e the Vader positive, negative, compound scores vs. the goodreads score).
Would Have, Should Have, Could Have
My analysis suggested that a book in the Classics genre is more likely to make it to the Top 400 list.
There is no clear pattern indicating a preferred year of publication or number of pages, although most of the Top 400 books average 300 pages.
The number of reviews is a good predictor of success: the more, the better. However, it does not help differentiate the four-star from the five-star books. That being said, the five-star books probably sell better than four-star ones, which was not part of my analysis.
Finally, the sentiment analysis part was unsuccessful at predicting the success of a book within the Top 400 list. The AFINN and Vader methods helped predict the likelihood of a high score on goodreads.com. However once a book makes it to the Top 400 list and a high level of granularity is required to differentiate it from other successful books and predict its rank, the two methods were unreliable.
Below is a quick list of the analyses I could have, would have and should have done if given more time and data:
- More data about users writing reviews: I wish the reviews included demographics data on the people writing them to answer questions such as: are people from a specific city, a particular age group or sex more ruthless in their critique?
- Working with unigrams vs. n-grams: The sentiment analysis is constructed with unigram features. This means that the entire text is split in single words, taking words out of their context and thus reducing the accuracy with which it detects the sentiment of reviews. The longer the n-gram (the higher the n), the more context you have to work with. I wish I had known about lexicons that are based on n-grams or the Syuzhet Package in R, which leads me to the next point.
- Using R: Overall I feel that R would have been a better language for the sentiment analysis part. In fact, I started coding in R and translated my code in Python for this part.
- Limited number of reviews per book: I only grabbed the first 30 reviews of a book which does restrict the sentiment analysis and may introduce some bias (although the first reviews are randomly displayed and not sorted by recency).
- More insights: I wish my analysis had been more fruitful, above all with the sentiment analysis part.