The Library of Audible: Web Scraping

Posted on Jul 14, 2021

LinkedIn| Other WorkGithub Repository 

Humans have been passing down oral stories for generations, and while each of us might only be able to recount a handful, Audible remembers 280,000. Over the past 23 years Audible has established itself as the largest audiobook company and for my web scraping project I wanted to learn more about the library of Audible and how they address customer needs.

Audiobooks are a global market sized at $2.7 billion in 2019. Projections suggest that the market will grow at 25% a year, and is thus expected to reach over $7 trillion in the next decade. Since its incorporation in 1997 Audible has become the largest Audiobook distributor in the world and dabbles in publishing audio content. Listeners can buy audiobooks ala carte or subscribe for $7.95 USD / month for podcasts and Audible Originals or $14.95 USD for credits redeemable for any audiobook. Subscribers listen to over 1 billion hours of content a year and Audible has collected a library of 279,240 titles and shows no signs of slowing. Over 50,000 audiobooks were added to Audible in 2020, during the global pandemic. Audible is supported and part of a collection of services and tools Amazon has built or acquired around the (audio)book space (Fig 1).

Figure 1: A visual summary of Amazon products from the Author and Reader's perspective.

Authors with distribution rights can approach Amazon for a variety of services. Amazon offers Kindle Direct Publishing, an on-demand printing service, and the Audio Creation Exchange (ACX); which allows Authors to find and contract voice actors to create an audiobook. Amazon also provides distribution channels through the Kindle Store, Amazon Books, Audible, and iTunes; as well as a built in ad system to promote the title.

On the Reader’s side, Amazon acquired Goodreads, the primer book review and reading list website. Goodreads book pages now link directly to the Amazon Books listing which links directly to the Audible listing. Consumers who review the title create more organic search potential for the book, driving the flywheel. At this point, searching most book titles will return an Amazon, Audible or Goodreads page in the top position.

Scraping Strategy

Like all libraries, Audible has a category system to organize the titles. However, unlike the Dewey Decimal System, titles in Audible’s system can be listed under multiple categories which are made available through an overarching categories page. The first Scrapy spider traverses each individual category page (Fig 2A) collecting the category name, links to sub-categories, and the “See all in…” link. The sub-categories were passed recursively to the spider while the “See all…” link was passed to a second spider.  The second spider takes the search result page (Fig 2B) of each category and parses the title information of each entry page by page for all audiobooks listed in that category (up to the results limit of 1200).

Figure 2: Example of pages traversed by the Scrapy spiders. A). Category page. B) Search Results page.

The data collected required minor cleaning, mainly removing podcasts and duplicate entries. This resulted in 279,240 unique audiobook titles, which is close to the +200,000 Audible advertises.

Library Growth

Overall, the growth of Audible’s library is exponential (Fig. 3A) with more than half the audiobooks added in the past 5 years. Audible seems to have experienced two phases of growth: after the iTunes deal where Audible became the sole audiobook provider (late 2003) and after the acquisition by Amazon (2008) which led to a decrease in the growth rate but a more consistent approach, averaging 24% for the past decade (Fig. 3B). This consistency is a benefit as it allows better a better programming cadence for Audibles subscriber base, i.e. it wouldn’t make sense for all the hits to come out at the same time.

Figure 3: Yearly growth statistics. A) Cumulative Audiobooks Released. B) Yearly Growth Rate 2000 to present.

Price per Audiobook

While most listeners pay the monthly subscription fee in exchange for credits, audiobooks are still listed with their prices. While these prices range from $0 USD to $115.95 USD, prices can also follow common pricing tricks such as pricing 5₵ USD below the dollar. The most common prices are $20, $7, $15, $4, and $25 USD with the 5₵ rule applied (Fig. 4). The full Audible subscription (which includes a monthly audiobook credit) is priced at $14.95, with 57% of audiobooks being more expensive than the subscription price. One might expect that different prices might be due to the length of the book as it takes more to produce.

Figure 4: Histogram of Audiobook Prices. Table: Top 5 most common prices.

Audiobook Length and Price-Length Correlations

Most audiobooks are a reasonable length, but with a long tail stretching to 143 hours. There is a large group of audiobooks around 5-10 hours in length with a steep decline from 10-15 hours. There is a disjunction around the 3-hour mark (Fig 5A).

Figure 5: Histogram of Audiobook lengths in hours. A) Trimmed for lengths under 40 hours. B) Trimmed for lengths under 6 hours, bin width of 15 minutes.

There are more audiobooks than expected at the 3-hour mark compared to 15-minutes to either side and it appears there are fewer immediately before 3-hours (Fig 5B). A Hacker News post suggests that there is some difference in royalties for audiobooks greater than 3 hours. While I could not find a difference in royalties, searching revealed an ACX page which describes how Audible might set the audiobook prices. There is indeed a difference between 1-3- and 3-5-hour content, Audible generally lists it at more than double the price. When I bin the data by the ACX suggested values (Fig. 6A) we can see that each of the prices corresponds to one of the most common prices (Fig. 6B).

Figure 6: Distribution of Audiobook Prices. A) Density plot of prices, colored by ACX bins. B) Histogram of audiobook prices.

I wanted to examine this price-to-length connection more closely, below I plotted a density plot of price and length for audiobooks under $50 USD and shorter than 40 hours (Fig 7). Each of the peaks is associated with a common price observed above ($19,95, $6.95, $14.95, etc.) which in turn is associated with a length distribution. Differences in price for the same length-bin seem to be due to length and the number of reviews (a proxy for popularity). I expect that Audible is using a linear model or rules-based system to set the prices of these audiobooks and while we could make a reasonable estimate, the public only has access to customer reviews, not listens or downloads.

Figure 7: Density plot of Audiobook Length and Price.

Authors as Narrators

A feature which I thought might be a draw for listeners is Authors self-narrating their audiobooks. This is more prevalent for celebrities or politicians but is also a cost-saving measure for the Authors as they can avoid paying or royalty-sharing with narrators (Fig 8).

Figure 8: Example of popular titles Narrated by the Author.

While this is a popular trend for celebrities, broadly this does not seem to be a growth area for audible (Fig 9A). Since 1997 the percentage of Authors self-narrating has been falling as the growth rate of this trend fails to keep up with the steady 24% growth of the library overall (Fig 9B). This is also explained by the lack of difference between self-narrated titles and those with professional voice actors. The titles are not priced higher, they are not better rated or have more reviews.

Figure 9: Author self-narration growth statistics. A) Cumulative audiobooks of Author as Narrator. B) Percentage of new books where the Author acts as Narrator.

Languages: Room to Grow

The real growth engine for Audible might be in other languages as only 5% of the world speaks English, the predominant category of audiobooks (Fig 10). However, plotting the growth of non-English languages over time, while there has been some recent growth there does not appear to be any “catch-up” to target other languages (especially considering “Non-English” is the sum of 43 different languages with non-overlapping consumers). Contrast that with Netflix which has pushed to dub all originals in multiple languages.

Figure 10: Cumulative English and non-English audiobook releases.

We can start to see this strategy play out with language or region specific landing pages for China, Spanish, and India (Fig 11). However, all the books offered through these pages are available in this analysis, so the pressing issue is still language availability. One interesting area of expansion is Audible Suno, an India only app, filled with free Audible originals voiced by Indian stars. However, this likely only brings in ad revenue (against the podcasts) not the subscription revenue Audible normally relies on. 

Figure 11: Example of Language or Region-Specific Offerings

Audible is in the best position to capture Audiobooks in every emerging language. Voice acting costs are fixed with length, generally $200 per finished hour. Signing exclusive deals with companies or voice actors could further lower the costs. Using internal sales data, Audible might be able to identify that could be popular in each language given a good translation is made available. Due to the subscription nature of the audiobook environment, there is a strong first mover advantage for entering each market. Listeners will pay when they can find the audiobook titles that appeal to them, a function of promotion and total numbers. 

Audible's Story Through My Library

Figure 12: Three Titles in My Audible Library Representing the History of Audible

The first is Ready Player One, it was the first Audiobook I listened to on Audible and has the most ratings of any book. In previous years, Audible relied on taking existing best sellers and turning them into audiobooks, Adding in celebrity name recognition like Wil Wheaton to read the book.

The second title is the recently released Project Hail Mary by Andy Weir, author of The Martian. This Audiobook was released as an Audible Original with added production value specifically tailored for an audio experience (complete with background music). This is a sign of Audible making the market, bringing in well-known authors with exclusive content, this can be seen in the Audible Suno app as well.

Finally, in a sign of Audible’s need to transition to multi-lingual content, The Three-Body Problem. An epic series of books, currently becoming a television series; however not available in the original Chinese. This book stands as a testament to the need for Audible to begin making markets by translating and publishing titles in non-English languages.

Audible is an incredible platform for Authors and Listeners to connect and enjoy books. They have a strong pricing model which slightly benefits the consumer especially if you are inclined towards longer titles. Due to this subscriber model, Audible’s long term success depends on continuing to grow its user base, increasing its cash flows. This can be achieved through differentiated offerings, like podcasts or exclusive content; or by expanding the pool of subscribers by offering a diverse array of languages. Ultimately this opportunity is Audible’s to lose.

Fun Facts:

Category with Highest Average Star-Rating:

  • 1st  Math
  • 2nd Chistian Living

Category with Fewest Reviews:

  • Flowers and Plants

Category with Fewest Languages:

  • Thanksgiving

Category with Fewest Audiobooks:

  • Latin (I guess it really is a dead language)

About Author

James Welch

I was trained as a synthetic biologist and I am working to become a data scientists too. I have expertise in the genetic engineering of a variety of single-celled organisms, DNA and protein design, and industrial process scaling...
View all posts by James Welch >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp