Data Analysis on Book Length and Its Popularity

Posted on Feb 20, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Goodreads is an Amazon-owned review and discussion website for books. I scraped data on books and their authors, in which the start url result page was essentially a home directory for their most popular lists, called Listopia. Each list has a certain theme, or topic, and the books are ranked within each list. Rather than hand-selecting each list, I decided to scrape the most popular lists to avoid selection bias. Goodreads has more than 7,000 book lists, and I used the Scrapy framework in Python to scrape about 300 lists, which amounted to 22,000 unique books in my dataset.

How can book popularity be measured?

The popularity of a book could be measured in a few different ways. I decided the main two dependent variables to focus on would be a book’s average rating and Goodreads’ index called Score. Ultimately, I determined that Score better represents the popularity because this single statistic determines where a book is placed in the rankings of the list. Whereas, a book’s average rating would not change drastically when scanning the list hierarchy. Most books have a score below 2,000, which are relatively low values. This intuitively makes sense because the ranking index should ideally only a small proportion of the books.

Data Analysis on Book Length and Its Popularity

Data Variables Scraped

The web scraper collected data from three pages: the result pages, the book pages, and the author pages.

The following variables were collected from the result pages: the book's "score" it received based on user activity and the url for the book page.

Ten variables were collected from the book page: book title, author name, author page url, unique book identification, number of ratings, number of reviews, book average rating, number of pages, genres that the book is listed in, and the date it was published. The remaining seven variables were collected from the author page: unique author identification, country of birth, most frequent genres, average rating for the author, number of author ratings, number of author reviews, and gender.

Each variables, or "item" as Scrapy names it, is saved into a Python dictionary, which makes it relatively easy to export the results to a CSV file.

Fiction vs. Nonfiction

In order to exploit the differences between genres, I created a new data frame in which I only kept books that had a genre specified as ‘Fiction’ or ‘Nonfiction’ and only 3,000 books remained after filtering. As the graph displays below, the score of fiction books tends to increase as the page length increases. However, as the pages of nonfiction books increases, its score tends to stay the same, on average.


Data Analysis on Book Length and Its Popularity

Difficulties Encountered While Scraping Data

A difficulty encountered that was expected, but nevertheless present was the length of time it took for the scraper to complete the task. Initially, I was unsure as to how long the program would take to run, so the data set was compiled in five separate tasks (each with different start URL's). The program eventually took about fifteen hours to scrape 22,000 books.

Initially, I was surprised at the length of time it took, especially because I used Scrapy instead of Selenium, which is notorious for its long computational time. I suspect this was the case because I collected more features than needed for this particular study. Although the dataset can be seen as more comprehensive with as many relevant variables as possible, it certainly added a computational cost and it could have been more efficient if only the variables at focus were collected, which would have amounted to about half of them.

The remaining difficulties were related to data cleaning and specifically using logic and regular expressions to extract the desired variables. For example, in order to determine the author's gender, masculine and feminine pronouns were counted and summed, and the gender was determined by the count of gendered pronouns.

Additionally, there were many variables that were not located in the same location on each page. For instance, when scraping the publishing date, wrong dates would be extracted if the book was updated or a new version was released. In this case, an exception rule was established to first check for an original date if the book was revised, using conditional statements and regular expressions.


The findings of this analysis can give more insight into the market for literature for almost all parties in the supply chain. Young writers, for example, can benefit from this analysis by using this as guidance for the approximate ideal length, given their intention for writing in a specific genre. For example, it seems as though long books are more likely to be rewarded with higher ratings if it is a fiction genre.

Similarly, this information can be used by publishers to guide writers in order to maximize the book rating and thus, the revenue. With that being said, there is a basic assumption being made, which was not proven in this study, but at the very least provides an opportunity for further research: this study did not prove that higher scores are correlated with increased revenue. This study only provided evidence that it’s more likely to have a higher ranking on Goodreads. However, it is likely that a higher ranking is correlated with increased revenue from sales.

Lastly, the consumers can benefit from this as well by steering clear of lengthy nonfiction books, unless it happens to be their absolute favorite topic.

Python code for this project can be found here.

About Author

Benjamin Rosen

Ben recently graduated from the NYC Data Science Academy to achieve his goal of becoming a junior data scientist. He has a passion for predictive modeling and using analytics to enhance decision-making. Ben earned his B.A. in economics...
View all posts by Benjamin Rosen >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI