Using Data to Analyze The Library of Audible

James Welch

Posted on Jul 14, 2021

LinkedIn| Other Work | Github Repository

Humans have been passing down oral stories for generations, and while each of us might only be able to recount a handful, Audible remembers 280,000. Over the past 23 years Audible has established itself as the largest audiobook company and for my web scraping project I wanted to learn more about the library of Audible and how they address customer needs. In this text we will use data to analyze the library of audible.

Audiobooks are a global market sized at $2.7 billion in 2019. Projections suggest that the market will grow at 25% a year, and is thus expected to reach over $7 trillion in the next decade. Since its incorporation in 1997 Audible has become the largest Audiobook distributor in the world and dabbles in publishing audio content.

Listeners can buy audiobooks ala carte or subscribe for $7.95 USD / month for podcasts and Audible Originals or $14.95 USD for credits redeemable for any audiobook. Subscribers listen to over 1 billion hours of content a year and Audible has collected a library of 279,240 titles and shows no signs of slowing.

Over 50,000 audiobooks were added to Audible in 2020, during the global pandemic. Audible is supported and part of a collection of services and tools Amazon has built or acquired around the (audio)book space (Fig 1).

customer-author-journey-337455-4V43gZUs | Data Science Blog — Figure 1: A visual summary of Amazon products from the Author and Reader's perspective.

Authors with distribution rights can approach Amazon for a variety of services. Amazon offers Kindle Direct Publishing, an on-demand printing service, and the Audio Creation Exchange (ACX); which allows Authors to find and contract voice actors to create an audiobook. Amazon also provides distribution channels through the Kindle Store, Amazon Books, Audible, and iTunes; as well as a built in ad system to promote the title.

On the Reader’s side, Amazon acquired Goodreads, the primer book review and reading list website. Goodreads book pages now link directly to the Amazon Books listing which links directly to the Audible listing. Consumers who review the title create more organic search potential for the book, driving the flywheel. At this point, searching most book titles will return an Amazon, Audible or Goodreads page in the top position.

Scraping Strategy

Like all libraries, Audible has a category system to organize the titles. However, unlike the Dewey Decimal System, titles in Audible’s system can be listed under multiple categories which are made available through an overarching categories page. The first Scrapy spider traverses each individual category page (Fig 2A) collecting the category name, links to sub-categories, and the “See all in…” link.

The sub-categories were passed recursively to the spider while the “See all…” link was passed to a second spider. The second spider takes the search result page (Fig 2B) of each category and parses the title information of each entry page by page for all audiobooks listed in that category (up to the results limit of 1200).

scrape-pics-135552-VF3qTviN | Data Science Blog — Figure 2: Example of pages traversed by the Scrapy spiders. A). Category page. B) Search Results page.

The data collected required minor cleaning, mainly removing podcasts and duplicate entries. This resulted in 279,240 unique audiobook titles, which is close to the +200,000 Audible advertises.

Library Growth

Overall, the growth of Audible’s library is exponential (Fig. 3A) with more than half the audiobooks added in the past 5 years. Audible seems to have experienced two phases of growth: after the iTunes deal where Audible became the sole audiobook provider (late 2003) and after the acquisition by Amazon (2008) which led to a decrease in the growth rate but a more consistent approach, averaging 24% for the past decade (Fig. 3B).

This consistency is a benefit as it allows better a better programming cadence for Audibles subscriber base, i.e. it wouldn’t make sense for all the hits to come out at the same time.

lib-growth-443460-IHyjlwBl | Data Science Blog — Figure 3: Yearly growth statistics. A) Cumulative Audiobooks Released. B) Yearly Growth Rate 2000 to present.

Price per Audiobook

While most listeners pay the monthly subscription fee in exchange for credits, audiobooks are still listed with their prices. While these prices range from $0 USD to $115.95 USD, prices can also follow common pricing tricks such as pricing 5₵ USD below the dollar. The most common prices are $20, $7, $15, $4, and $25 USD with the 5₵ rule applied (Fig. 4).

The full Audible subscription (which includes a monthly audiobook credit) is priced at $14.95, with 57% of audiobooks being more expensive than the subscription price. One might expect that different prices might be due to the length of the book as it takes more to produce.

prices-804448-QTcNIkM3 | Data Science Blog — Figure 4: Histogram of Audiobook Prices. Table: Top 5 most common prices.

Audiobook Length and Price-Length Correlations

Most audiobooks are a reasonable length, but with a long tail stretching to 143 hours. There is a large group of audiobooks around 5-10 hours in length with a steep decline from 10-15 hours. There is a disjunction around the 3-hour mark (Fig 5A).

length-320162-pYxHvCvR | Data Science Blog — Figure 5: Histogram of Audiobook lengths in hours. A) Trimmed for lengths under 40 hours. B) Trimmed for lengths under 6 hours, bin width of 15 minutes.

There are more audiobooks than expected at the 3-hour mark compared to 15-minutes to either side and it appears there are fewer immediately before 3-hours (Fig 5B). A Hacker News post suggests that there is some difference in royalties for audiobooks greater than 3 hours.

While I could not find a difference in royalties, searching revealed an ACX page which describes how Audible might set the audiobook prices. There is indeed a difference between 1-3- and 3-5-hour content, Audible generally lists it at more than double the price. When I bin the data by the ACX suggested values (Fig. 6A) we can see that each of the prices corresponds to one of the most common prices (Fig. 6B).

price-bin-452871-JM0IWytU | Data Science Blog — Figure 6: Distribution of Audiobook Prices. A) Density plot of prices, colored by ACX bins. B) Histogram of audiobook prices.

I wanted to examine this price-to-length connection more closely, below I plotted a density plot of price and length for audiobooks under $50 USD and shorter than 40 hours (Fig 7). Each of the peaks is associated with a common price observed above ($19,95, $6.95, $14.95, etc.) which in turn is associated with a length distribution.

Differences in price for the same length-bin seem to be due to length and the number of reviews (a proxy for popularity). I expect that Audible is using a linear model or rules-based system to set the prices of these audiobooks and while we could make a reasonable estimate, the public only has access to customer reviews, not listens or downloads.

price-vs-length-230664-bt6mipVe | Data Science Blog — Figure 7: Density plot of Audiobook Length and Price.

Authors as Narrators

A feature which I thought might be a draw for listeners is Authors self-narrating their audiobooks. This is more prevalent for celebrities or politicians but is also a cost-saving measure for the Authors as they can avoid paying or royalty-sharing with narrators (Fig 8).

auth-narr-obama-014185-lRsXUXq4 | Data Science Blog — Figure 8: Example of popular titles Narrated by the Author.

While this is a popular trend for celebrities, broadly this does not seem to be a growth area for audible (Fig 9A). Since 1997 the percentage of Authors self-narrating has been falling as the growth rate of this trend fails to keep up with the steady 24% growth of the library overall (Fig 9B). This is also explained by the lack of difference between self-narrated titles and those with professional voice actors. The titles are not priced higher, they are not better rated or have more reviews.

auth-narr-795419-w6hDwMyf | Data Science Blog — Figure 9: Author self-narration growth statistics. A) Cumulative audiobooks of Author as Narrator. B) Percentage of new books where the Author acts as Narrator.

Languages: Room to Grow

The real growth engine for Audible might be in other languages as only 5% of the world speaks English, the predominant category of audiobooks (Fig 10). However, plotting the growth of non-English languages over time, while there has been some recent growth there does not appear to be any “catch-up” to target other languages (especially considering “Non-English” is the sum of 43 different languages with non-overlapping consumers). Contrast that with Netflix which has pushed to dub all originals in multiple languages.

lang-timeseries-bleg-996653-zHVOI8nK | Data Science Blog — Figure 10: Cumulative English and non-English audiobook releases.

We can start to see this strategy play out with language or region specific landing pages for China, Spanish, and India (Fig 11). However, all the books offered through these pages are available in this analysis, so the pressing issue is still language availability. One interesting area of expansion is Audible Suno, an India only app, filled with free Audible originals voiced by Indian stars. However, this likely only brings in ad revenue (against the podcasts) not the subscription revenue Audible normally relies on.

lang-offer-119130-BKo8Wz3V | Data Science Blog — Figure 11: Example of Language or Region-Specific Offerings

Audible is in the best position to capture Audiobooks in every emerging language. Voice acting costs are fixed with length, generally $200 per finished hour. Signing exclusive deals with companies or voice actors could further lower the costs. Using internal sales data, Audible might be able to identify that could be popular in each language given a good translation is made available.

Due to the subscription nature of the audiobook environment, there is a strong first mover advantage for entering each market. Listeners will pay when they can find the audiobook titles that appeal to them, a function of promotion and total numbers.

Audible's Story Through My Library

auidble-story-981963-FzkfGpfG | Data Science Blog — Figure 12: Three Titles in My Audible Library Representing the History of Audible

The first is Ready Player One, it was the first Audiobook I listened to on Audible and has the most ratings of any book. In previous years, Audible relied on taking existing best sellers and turning them into audiobooks, Adding in celebrity name recognition like Wil Wheaton to read the book.

The second title is the recently released Project Hail Mary by Andy Weir, author of The Martian. This Audiobook was released as an Audible Original with added production value specifically tailored for an audio experience (complete with background music). This is a sign of Audible making the market, bringing in well-known authors with exclusive content, this can be seen in the Audible Suno app as well.

Finally, in a sign of Audible’s need to transition to multi-lingual content, The Three-Body Problem. An epic series of books, currently becoming a television series; however not available in the original Chinese. This book stands as a testament to the need for Audible to begin making markets by translating and publishing titles in non-English languages.

Audible is an incredible platform for Authors and Listeners to connect and enjoy books. They have a strong pricing model which slightly benefits the consumer especially if you are inclined towards longer titles. Due to this subscriber model, Audible’s long term success depends on continuing to grow its user base, increasing its cash flows. This can be achieved through differentiated offerings, like podcasts or exclusive content; or by expanding the pool of subscribers by offering a diverse array of languages. Ultimately this opportunity is Audible’s to lose.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Fun Facts:

Category with Highest Average Star-Rating:

1^st Math
2^nd Chistian Living

Category with Fewest Reviews:

Flowers and Plants

Category with Fewest Languages:

Thanksgiving

Category with Fewest Audiobooks:

Latin (I guess it really is a dead language)

About Author

James Welch

I was trained as a synthetic biologist and I am working to become a data scientists too. I have expertise in the genetic engineering of a variety of single-celled organisms, DNA and protein design, and industrial process scaling...

View all posts by James Welch >

Python

Path to Victory 2024 Presidential Election

Machine Learning

The Best Bang for Your Buck in Ames, Iowa

Data Visualization

Justin 'AI' Bieber Sings "We All Love Data Science"

Capstone

Creating an End-to-End Machine Learning Pipeline for a Nation-wide Homebuilder

Machine Learning

Machine Learning for Home Improvement: Predicting House Price After Renovation

No comments found.

Using Data to Analyze The Library of Audible

LinkedIn| Other Work | Github Repository

Scraping Strategy

Library Growth

Price per Audiobook

Audiobook Length and Price-Length Correlations

Authors as Narrators

Languages: Room to Grow

Audible's Story Through My Library

Fun Facts:

About Author

James Welch

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Using Data to Analyze The Library of Audible

LinkedIn| Other Work | Github Repository

Scraping Strategy

Library Growth

Price per Audiobook

Audiobook Length and Price-Length Correlations

Authors as Narrators

Languages: Room to Grow

Audible's Story Through My Library

Fun Facts:

About Author

James Welch

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!