Data Analysis on Book Length and Its Popularity

Benjamin Rosen

Posted on Feb 20, 2019

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Goodreads is an Amazon-owned review and discussion website for books. I scraped data on books and their authors, in which the start url result page was essentially a home directory for their most popular lists, called Listopia. Each list has a certain theme, or topic, and the books are ranked within each list. Rather than hand-selecting each list, I decided to scrape the most popular lists to avoid selection bias. Goodreads has more than 7,000 book lists, and I used the Scrapy framework in Python to scrape about 300 lists, which amounted to 22,000 unique books in my dataset.

How can book popularity be measured?

The popularity of a book could be measured in a few different ways. I decided the main two dependent variables to focus on would be a book’s average rating and Goodreads’ index called Score. Ultimately, I determined that Score better represents the popularity because this single statistic determines where a book is placed in the rankings of the list. Whereas, a book’s average rating would not change drastically when scanning the list hierarchy. Most books have a score below 2,000, which are relatively low values. This intuitively makes sense because the ranking index should ideally only a small proportion of the books.

Data Analysis on Book Length and Its Popularity

Data Variables Scraped

The web scraper collected data from three pages: the result pages, the book pages, and the author pages.

The following variables were collected from the result pages: the book's "score" it received based on user activity and the url for the book page.

Ten variables were collected from the book page: book title, author name, author page url, unique book identification, number of ratings, number of reviews, book average rating, number of pages, genres that the book is listed in, and the date it was published. The remaining seven variables were collected from the author page: unique author identification, country of birth, most frequent genres, average rating for the author, number of author ratings, number of author reviews, and gender.

Each variables, or "item" as Scrapy names it, is saved into a Python dictionary, which makes it relatively easy to export the results to a CSV file.

Fiction vs. Nonfiction

In order to exploit the differences between genres, I created a new data frame in which I only kept books that had a genre specified as ‘Fiction’ or ‘Nonfiction’ and only 3,000 books remained after filtering. As the graph displays below, the score of fiction books tends to increase as the page length increases. However, as the pages of nonfiction books increases, its score tends to stay the same, on average.

Difficulties Encountered While Scraping Data

A difficulty encountered that was expected, but nevertheless present was the length of time it took for the scraper to complete the task. Initially, I was unsure as to how long the program would take to run, so the data set was compiled in five separate tasks (each with different start URL's). The program eventually took about fifteen hours to scrape 22,000 books.

Initially, I was surprised at the length of time it took, especially because I used Scrapy instead of Selenium, which is notorious for its long computational time. I suspect this was the case because I collected more features than needed for this particular study. Although the dataset can be seen as more comprehensive with as many relevant variables as possible, it certainly added a computational cost and it could have been more efficient if only the variables at focus were collected, which would have amounted to about half of them.

The remaining difficulties were related to data cleaning and specifically using logic and regular expressions to extract the desired variables. For example, in order to determine the author's gender, masculine and feminine pronouns were counted and summed, and the gender was determined by the count of gendered pronouns.

Additionally, there were many variables that were not located in the same location on each page. For instance, when scraping the publishing date, wrong dates would be extracted if the book was updated or a new version was released. In this case, an exception rule was established to first check for an original date if the book was revised, using conditional statements and regular expressions.

Conclusion

The findings of this analysis can give more insight into the market for literature for almost all parties in the supply chain. Young writers, for example, can benefit from this analysis by using this as guidance for the approximate ideal length, given their intention for writing in a specific genre. For example, it seems as though long books are more likely to be rewarded with higher ratings if it is a fiction genre.

Similarly, this information can be used by publishers to guide writers in order to maximize the book rating and thus, the revenue. With that being said, there is a basic assumption being made, which was not proven in this study, but at the very least provides an opportunity for further research: this study did not prove that higher scores are correlated with increased revenue. This study only provided evidence that it’s more likely to have a higher ranking on Goodreads. However, it is likely that a higher ranking is correlated with increased revenue from sales.

Lastly, the consumers can benefit from this as well by steering clear of lengthy nonfiction books, unless it happens to be their absolute favorite topic.

Python code for this project can be found here.

About Author

Benjamin Rosen

Ben recently graduated from the NYC Data Science Academy to achieve his goal of becoming a junior data scientist. He has a passion for predictive modeling and using analytics to enhance decision-making. Ben earned his B.A. in economics...

View all posts by Benjamin Rosen >

Python

Can the data from EA's FIFA Potential Rating Help Bettors?

Data Visualization

Using Data to Get Cats Adopted on petfinder.com

Data Visualization

Wine 101: Gathering Data From Vivino

Python

Using Data to Analyze The Library of Audible

Web Scraping

DATA STUDYING THE LABOR MARKET DURING A PANDEMIC

Cancel reply

You must be logged in to post a comment.

No comments found.

Data Analysis on Book Length and Its Popularity

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

How can book popularity be measured?

Data Variables Scraped

Fiction vs. Nonfiction

Difficulties Encountered While Scraping Data

Conclusion

About Author

Benjamin Rosen

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Analysis on Book Length and Its Popularity

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

How can book popularity be measured?

Data Variables Scraped

Fiction vs. Nonfiction

Difficulties Encountered While Scraping Data

Conclusion

About Author

Benjamin Rosen

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!