Drama vs. Words: Does the story or the performance determine the rating of an audiobook

Glen Ferguson
Posted on Feb 20, 2017

Github

In 2016, the Wall Street Journal wrote that audiobooks were the fastest growing format in publishing. As an effort to understanding this segment of the publishing industry, I analyzed English-language audiobooks to determine if the overall rating for an audiobook is determined more by the performance or by the story. The data is from the Audible.com website.

Audible.com is a technology company that produces, sells, and markets audiobooks. The audiobooks from Audible.com are purchased through its online platform and can be streamed or downloaded through its proprietary app. Audible was established in 1995 and has over 150 employees. Audible was purchased for $300 million dollars by Amazon.com in 2008. Audible has a library of over 200,000 English-Language audiobooks. This relatively long history and extensive catalog make Audible a good source of data for audiobooks. The web pages for audiobooks also have a consistent format easing the difficulty of scraping.

Using the Scrapy framework, I scraped data from the Audible website on English-language audiobooks. This table was used to populate a table that included data about the book along with the ratings and number of ratings. The data collected from the page is shown below.

audible_page

The scraper was setup to start on the first page listing links audiobooks. Each audiobook link was followed and the data downloaded and returned to the original page. The link to the next page was followed using recursion until the final page is reached. Using this process, a table of ~203,000 audiobooks was generated. Of these, ~42,000 audiobooks were never reviewed. With several hundred more having only Overall reviews. Only books with complete reviews are included in charts with reviews, while the entire dataset is included for charts characterizing the data.

The distribution of the overall rating, performance ratings, and story ratings are shown below. The most common rating for an audiobook is 4 for all categories. The second most common is a 5 for the performance rating and between 4 and 5 for the overall rating. These indicate the ratings are not equally distributed between the lower ratings are not given often. Thus, most rated books should be skewed high. Since people do not usually purchase books, they do not expect to enjoy this is not surprising.

rating_dist

The number of ratings per book shows a sharp peak at 1. Of the ~150,000 rated books on Audible, most have a single rating. The few ratings per book indicate why the distribution of ratings (above) has strong peaks at the integer values. Note that the rating distribution (below) removes all ratings over 1500 for clarity.

num_ratings

Book length is a possible critical measure, longer books may be rated lower due to length. I plotted the density of the overall rating vs. the length in minutes (below) with the outliers removed (2 standard deviations from the mean length). From the plot, it is easy to see that longer books are not rated lower than shorter books. The trend is reverse with longer books having a higher rating than shorter books.

overall_length

We can plot the overall rating against the performance rating and the story rating using a density plot to examine the relationship between these variables. As seen below, there is a very clear linear relationship between these variables. This correlation indicates both the story and the performance are correlated with the overall rating. This can be confirmed by the correlation coefficients, overall-performance 0.81, and overall-story 0.88. Once could tentatively conclude that while both are important, the story is more correlated with the overall score and is more important.

over_performover_story

To future explore the validity of the result I checked the correlation between the performance and the story ratings. This value also shows a strong linear relationship and has a high correlation coefficient of 0.78. This correlation indicates that one cannot make a definite determination based on the correlation to either performance or story as one variable could determine the other.

perform_story

To understand which is more important we need to consider cases where one is held constant while the other can vary. One method is to find books with multiple narrators. The stories in these books don’t change, but the narrators are different. I explored 158 books with multiple narrators. These books have overall, story and performance rating that are almost same for each book but vary with a narrator. It only a few cases did the ratings diverge. This we can conclude that the perception of the story is determined by the performance and the performance rating helps to determine both the overall and the story rating. Writes should carefully consider who narrators their story to have a rating the story deserves.

 

About Author

Glen Ferguson

Glen Ferguson

Glen is an experienced professional who has used data to solve problems in many domain areas. He is currently a data scientist at NYC Data Science Academy, where he has used real-world data to solve problems. Glen worked...
View all posts by Glen Ferguson >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp