Data Analysis on Book Length and Its Popularity
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Goodreads is an Amazon-owned review and discussion website for books. I scraped data on books and their authors, in which the start url result page was essentially a home directory for their most popular lists, called Listopia. Each list has a certain theme, or topic, and the books are ranked within each list. Rather than hand-selecting each list, I decided to scrape the most popular lists to avoid selection bias. Goodreads has more than 7,000 book lists, and I used the Scrapy framework in Python to scrape about 300 lists, which amounted to 22,000 unique books in my dataset.
How can book popularity be measured?
The popularity of a book could be measured in a few different ways. I decided the main two dependent variables to focus on would be a book’s average rating and Goodreads’ index called Score. Ultimately, I determined that Score better represents the popularity because this single statistic determines where a book is placed in the rankings of the list. Whereas, a book’s average rating would not change drastically when scanning the list hierarchy. Most books have a score below 2,000, which are relatively low values. This intuitively makes sense because the ranking index should ideally only a small proportion of the books.
Data Variables Scraped
The web scraper collected data from three pages: the result pages, the book pages, and the author pages.
The following variables were collected from the result pages: the book's "score" it received based on user activity and the url for the book page.
Ten variables were collected from the book page: book title, author name, author page url, unique book identification, number of ratings, number of reviews, book average rating, number of pages, genres that the book is listed in, and the date it was published. The remaining seven variables were collected from the author page: unique author identification, country of birth, most frequent genres, average rating for the author, number of author ratings, number of author reviews, and gender.
Each variables, or "item" as Scrapy names it, is saved into a Python dictionary, which makes it relatively easy to export the results to a CSV file.
Fiction vs. Nonfiction
In order to exploit the differences between genres, I created a new data frame in which I only kept books that had a genre specified as ‘Fiction’ or ‘Nonfiction’ and only 3,000 books remained after filtering. As the graph displays below, the score of fiction books tends to increase as the page length increases. However, as the pages of nonfiction books increases, its score tends to stay the same, on average.
Difficulties Encountered While Scraping Data
A difficulty encountered that was expected, but nevertheless present was the length of time it took for the scraper to complete the task. Initially, I was unsure as to how long the program would take to run, so the data set was compiled in five separate tasks (each with different start URL's). The program eventually took about fifteen hours to scrape 22,000 books.
Initially, I was surprised at the length of time it took, especially because I used Scrapy instead of Selenium, which is notorious for its long computational time. I suspect this was the case because I collected more features than needed for this particular study. Although the dataset can be seen as more comprehensive with as many relevant variables as possible, it certainly added a computational cost and it could have been more efficient if only the variables at focus were collected, which would have amounted to about half of them.
The remaining difficulties were related to data cleaning and specifically using logic and regular expressions to extract the desired variables. For example, in order to determine the author's gender, masculine and feminine pronouns were counted and summed, and the gender was determined by the count of gendered pronouns.
Additionally, there were many variables that were not located in the same location on each page. For instance, when scraping the publishing date, wrong dates would be extracted if the book was updated or a new version was released. In this case, an exception rule was established to first check for an original date if the book was revised, using conditional statements and regular expressions.
The findings of this analysis can give more insight into the market for literature for almost all parties in the supply chain. Young writers, for example, can benefit from this analysis by using this as guidance for the approximate ideal length, given their intention for writing in a specific genre. For example, it seems as though long books are more likely to be rewarded with higher ratings if it is a fiction genre.
Similarly, this information can be used by publishers to guide writers in order to maximize the book rating and thus, the revenue. With that being said, there is a basic assumption being made, which was not proven in this study, but at the very least provides an opportunity for further research: this study did not prove that higher scores are correlated with increased revenue. This study only provided evidence that it’s more likely to have a higher ranking on Goodreads. However, it is likely that a higher ranking is correlated with increased revenue from sales.
Lastly, the consumers can benefit from this as well by steering clear of lengthy nonfiction books, unless it happens to be their absolute favorite topic.
Python code for this project can be found here.