Does Book Length Influence its Popularity? Evidence From Goodreads

Benjamin Rosen
Posted on Feb 20, 2019

Goodreads is an Amazon-owned review and discussion website for books. I scraped data on books and their authors, in which the start url result page was essentially a home directory for their most popular lists, called Listopia. Each list has a certain theme, or topic, and the books are ranked within each list. Rather than hand-selecting each list, I decided to scrape the most popular lists to avoid selection bias. Goodreads has more than 7,000 book lists, and I used the Scrapy framework in Python to scrape about 300 lists, which amounted to 22,000 unique books in my dataset.

How can book popularity be measured?

The popularity of a book could be measured in a few different ways. I decided the main two dependent variables to focus on would be a book’s average rating and Goodreads’ index called Score. Ultimately, I determined that Score better represents the popularity because this single statistic determines where a book is placed in the rankings of the list. Whereas, a book’s average rating would not change drastically when scanning the list hierarchy. Most books have a score below 2,000, which are relatively low values. This intuitively makes sense because the ranking index should ideally only a small proportion of the books.

Variables Scraped

The web scraper collected data from three pages: the result pages, the book pages, and the author pages.

The following variables were collected from the result pages: the book's "score" it received based on user activity and the url for the book page. Ten variables were collected from the book page: book title, author name, author page url, unique book identification, number of ratings, number of reviews, book average rating, number of pages, genres that the book is listed in, and the date it was published. The remaining seven variables were collected from the author page: unique author identification, country of birth, most frequent genres, average rating for the author, number of author ratings, number of author reviews, and gender.

Each variables, or "item" as Scrapy names it, is saved into a Python dictionary, which makes it relatively easy to export the results to a CSV file.

Fiction vs. Nonfiction

In order to exploit the differences between genres, I created a new data frame in which I only kept books that had a genre specified as ‘Fiction’ or ‘Nonfiction’ and only 3,000 books remained after filtering. As the graph displays below, the score of fiction books tends to increase as the page length increases. However, as the pages of nonfiction books increases, its score tends to stay the same, on average.

Difficulties Encountered While Scraping

A difficulty encountered that was expected, but nevertheless present was the length of time it took for the scraper to complete the task. Initially, I was unsure as to how long the program would take to run, so the data set was compiled in five separate tasks (each with different start URL's). The program eventually took about fifteen hours to scrape 22,000 books. Initially, I was surprised at the length of time it took, especially because I used Scrapy instead of Selenium, which is notorious for its long computational time. I suspect this was the case because I collected more features than needed for this particular study. Although the dataset can be seen as more comprehensive with as many relevant variables as possible, it certainly added a computational cost and it could have been more efficient if only the variables at focus were collected, which would have amounted to about half of them.

The remaining difficulties were related to data cleaning and specifically using logic and regular expressions to extract the desired variables. For example, in order to determine the author's gender, masculine and feminine pronouns were counted and summed, and the gender was determined by the count of gendered pronouns. Additionally, there were many variables that were not located in the same location on each page. For instance, when scraping the publishing date, wrong dates would be extracted if the book was updated or a new version was released. In this case, an exception rule was established to first check for an original date if the book was revised, using conditional statements and regular expressions.

Conclusion

The findings of this analysis can give more insight into the market for literature for almost all parties in the supply chain. Young writers, for example, can benefit from this analysis by using this as guidance for the approximate ideal length, given their intention for writing in a specific genre. For example, it seems as though long books are more likely to be rewarded with higher ratings if it is a fiction genre. Similarly, this information can be used by publishers to guide writers in order to maximize the book rating and thus, the revenue. With that being said, there is a basic assumption being made, which was not proven in this study, but at the very least provides an opportunity for further research: this study did not prove that higher scores are correlated with increased revenue. This study only provided evidence that it’s more likely to have a higher ranking on Goodreads. However, it is likely that a higher ranking is correlated with increased revenue from sales. Lastly, the consumers can benefit from this as well by steering clear of lengthy nonfiction books, unless it happens to be their absolute favorite topic.

Python code for this project can be found here.

About Author

Benjamin Rosen

Benjamin Rosen

Ben recently graduated from the NYC Data Science Academy to achieve his goal of becoming a junior data scientist. He has a passion for predictive modeling and using analytics to enhance decision-making. Ben earned his B.A. in economics...
View all posts by Benjamin Rosen >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp