Drama vs. Words: Does the story or the performance determine the rating of an audiobook

Posted on Feb 20, 2017

Github

In 2016, the Wall Street Journal wrote that audiobooks were the fastest growing format in publishing. As an effort to understanding this segment of the publishing industry, I analyzed English-language audiobooks to determine if the overall rating for an audiobook is determined more by the performance or by the story. The data is from the Audible.com website.

Audible.com is a technology company that produces, sells, and markets audiobooks. The audiobooks from Audible.com are purchased through its online platform and can be streamed or downloaded through its proprietary app. Audible was established in 1995 and has over 150 employees. Audible was purchased for $300 million dollars by Amazon.com in 2008. Audible has a library of over 200,000 English-Language audiobooks. This relatively long history and extensive catalog make Audible a good source of data for audiobooks. The web pages for audiobooks also have a consistent format easing the difficulty of scraping.

Using the Scrapy framework, I scraped data from the Audible website on English-language audiobooks. This table was used to populate a table that included data about the book along with the ratings and number of ratings. The data collected from the page is shown below.

audible_page

The scraper was setup to start on the first page listing links audiobooks. Each audiobook link was followed and the data downloaded and returned to the original page. The link to the next page was followed using recursion until the final page is reached. Using this process, a table of ~203,000 audiobooks was generated. Of these, ~42,000 audiobooks were never reviewed. With several hundred more having only Overall reviews. Only books with complete reviews are included in charts with reviews, while the entire dataset is included for charts characterizing the data.

The distribution of the overall rating, performance ratings, and story ratings are shown below. The most common rating for an audiobook is 4 for all categories. The second most common is a 5 for the performance rating and between 4 and 5 for the overall rating. These indicate the ratings are not equally distributed between the lower ratings are not given often. Thus, most rated books should be skewed high. Since people do not usually purchase books, they do not expect to enjoy this is not surprising.

rating_dist

The number of ratings per book shows a sharp peak at 1. Of the ~150,000 rated books on Audible, most have a single rating. The few ratings per book indicate why the distribution of ratings (above) has strong peaks at the integer values. Note that the rating distribution (below) removes all ratings over 1500 for clarity.

num_ratings

Book length is a possible critical measure, longer books may be rated lower due to length. I plotted the density of the overall rating vs. the length in minutes (below) with the outliers removed (2 standard deviations from the mean length). From the plot, it is easy to see that longer books are not rated lower than shorter books. The trend is reverse with longer books having a higher rating than shorter books.

overall_length

We can plot the overall rating against the performance rating and the story rating using a density plot to examine the relationship between these variables. As seen below, there is a very clear linear relationship between these variables. This correlation indicates both the story and the performance are correlated with the overall rating. This can be confirmed by the correlation coefficients, overall-performance 0.81, and overall-story 0.88. Once could tentatively conclude that while both are important, the story is more correlated with the overall score and is more important.

over_performover_story

To future explore the validity of the result I checked the correlation between the performance and the story ratings. This value also shows a strong linear relationship and has a high correlation coefficient of 0.78. This correlation indicates that one cannot make a definite determination based on the correlation to either performance or story as one variable could determine the other.

perform_story

To understand which is more important we need to consider cases where one is held constant while the other can vary. One method is to find books with multiple narrators. The stories in these books don’t change, but the narrators are different. I explored 158 books with multiple narrators. These books have overall, story and performance rating that are almost same for each book but vary with a narrator. It only a few cases did the ratings diverge. This we can conclude that the perception of the story is determined by the performance and the performance rating helps to determine both the overall and the story rating. Writes should carefully consider who narrators their story to have a rating the story deserves.

 

About Author

Glen Ferguson

Glen is an experienced professional who has used data to solve problems in many domain areas. He is currently a data scientist at NYC Data Science Academy, where he has used real-world data to solve problems. Glen worked...
View all posts by Glen Ferguson >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI