Data Study on NYT Bestseller
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction: The Paucity of Book-Market Data
If you want to know what films have generated the highest receipts, box office data has long been available for major releases. For television, it's not hard to find the Nielsen ratings for any show. Music sales are a little trickier, but between RIAA certifications and the Billboard charts, you can usually locate what you need to know.
For books, it's a different story. Did the latest Jodi Picoult bestseller outsell the latest John Grisham? It's hard to say. The publishing companies and book stores do not divulge unit sales Though Nielsen and Amazon both track book sales, their rankings capture only a minority of the market and cannot be compiled over a long period of time without paying a fee.
The New York Times Bestseller Lists don't compile direct unit numbers, but they do provide weekly rankings of books in hardcover, trade paperback, mass-market paperback, and e-book formats, as well as in a variety of subgenres. Granted, there may be ways of gaming their system, and it's not clear that the lists reflect true sales. Still, even if they are not perfect, the bestseller lists provide data that is easily accessible (through the Times API) and formatted to allow comparisons over time.
The number of weeks spent as a bestseller may not directly reflect unit sales, but the two are surely correlated. Moreover, they are important predictive tools in addition to being reflectors of success: getting onto the bestseller list encourages readers to buy the book. They are probably our best tool to investigate the otherwise-opaque book market.
Objective
How well have the major publishers been doing against each other? What imprints dominate each genre? If you're not affiliated with a major publisher, where's your best bet to break through? And how much is a spot in the prestigious Times Book Review worth?
To find out, I acquired the fiction bestsellers on the four major formats (hardcover, trade paperback, mass-market paperback, and e-book) from June 2008 through the beginning of March 2017, then hand-annotated them with data about their publishers' genre and corporate affiliation. The compiled results are visualized in this application. I also used the API to acquire the URLs of every review the Times Book Review published over that time and scraped their text. In analyzing the data, I made the following discoveries:
- The book market is extremely concentrated.
- There are only a few genres not completely dominated by the biggest companies, including literary fiction and romance.
- Books reviewed by the Times are much more likely to be bestsellers, though it's not clear which is the cause and which the effect.
- It is exceedingly hard to determine the judgment of a Times book review without manually reading it.
The Bestseller Oligopoly Is in Danger of Becoming a Monopoly
Though hundreds of publishing imprints are represented on the bestseller lists, 90% of the space is taken up by a handful of parent companies. Until 2013, these were known as the Big Six: Random House, Penguin, HarperCollins, Hachette, Simon & Schuster, and Macmillan. With the merger of the two biggest houses (Random House and Penguin) in 2013, that concentration has only gotten more extreme. A single company now controls nearly half of the bestseller lists.
The nested corporate structure of publishing imprints can be dizzying. For instance, the prestige novels of Margaret Atwood and Ian McEwan are published under the personal imprint of veteran editor Nan A. Talese. This might give the impression of a small, boutique operation. But Talese is owned by the eminent publisher Doubleday, which, since 2009, has been part of Knopf Doubleday, having been merged with Alfred A. Knopf by their joint corporate owner Random House. Random House is now a subsidiary of Penguin Random House, which is itself jointly operated by the international media conglomerates Bertelsmann and Pearson.
What hope does any smaller publisher have against a force so mammoth?
Though the biggest companies dominate mainstream commercial fiction, there is some breathing room within individual genres. For instance, Macmillanโa distant fifth among the Big Five in most areasโleads the Science Fiction/Fantasy market, via their Tor and Minotaur imprints (respectively run by Tom Doherty Associates and St. Martin's).
Independent houses, led by Grove/Atlantic, do well in literary fiction, cumulatively publishing a fifth of literary imprint bestsellers. Perhaps most astonishing is the e-book list. A mere decade ago, the notion of a self-published bestseller would have been laughed at. Now, nearly half of all romance/erotica e-book bestsellers are self-published.
Still, the bestseller list is extremely top-heavy, dominated by the biggest-selling books. On the hardcover, e-book, and mass-market lists, about 10% of total space is taken up by the 1-3 annual books that last on the charts for over a year. This is even more extreme on the trade paperback list, where over 33% of the list is taken up by books that chart for over a year.
When a book like Gone Girl or Fifty Shades of Gray hits the trade paperback list, that is, it stays there for a long time, preventing other books from climbing into the top 20. As a result, only a third as many individual trade paperbacks reach bestseller status as in other formats.
Across the board, the typical bestseller in print lasts only 2-3 weeks. For e-books, attention spans run even shorter: while the top books take up as much space as on the other lists, 75% of digital bestsellers don't merit a second week.
The Book Review Boost Data
The Bestseller lists are compiled by the New York Times Book Review, arguably the most prestigious book review in the country. Out of the ~40,000 new fiction titles released each year,* the Book Review covers about 350. Those 350 are about 10 times as likely to make the hardcover and mass-market lists as the average new title, and 30 times as likely to make the trade paperback list.
However, it's not clear how much of this correlation is actually causative. The Book Review is more likely to review books that are already a good bet to sell well. Furthermore, while the Book Review impact looks big in the chart on the right, it looks less impressive in the chart on the left. Since the number of reviewed books is so small, the majority of books that make the bestseller list are not reviewed at all.
Findings
We can see the importance of book reviews to the trade paperback bestsellers, as they are reviewed nearly as often as the hardcover bestsellers, despite their smaller number of bestsellers. However, only 3% of mass-market bestsellers received a review. Overall, the bestsellers that did not receive a review far outnumber those that did. In other words, prestigious though it may be, a Times book review may not be that relevant to how a book sells. How can we determine their market value? One way is to try to see whether books that receive a good review do any better than those that do not.
New York Times Reviews Are Difficult to Classify
Determining whether or not a Times review is positive, though, is difficult. Machine-learning classifiers are frequently trained to separate positive from negative commentary, but they are usually designed to handle born-digital material that is tailor-made for easy classification. For example, the first review on today's top-selling Amazon book (Jay Asher's Thirteen Reasons Why) contains only slightly more than 100 words, lots of descriptive language reflecting the reviewer's judgment (e.g., "intrigued," "page turner"), and plenty of metadata, including a five-star rating. It's easy to tell that this is a positive review.
A New York Times review is quite different. They average about 1200 words, rendering them computationally complex. Over 75% of their sentences make no direct evaluative statement about the book, instead discussing and contextualizing its contents. They do not come with a star rating, which means we have no labels on which to train and test models. The Times API supplies little metadata. They are written, that is, so that you to actually have to read the article.
This leaves us with two problems. First, since we are focusing only on fiction, we have to separate out fiction reviews from nonfiction reviews. Second, we have to devise a scalable method of evaluating the review's judgment.
Locating the "Fiction" Topic
Separating fiction and nonfiction proved relatively simple. Even if a nonfiction and fiction book share similar contentโsay, Erik Larson's Devil in the White City and Thomas Pynchon's Against the Day, which both are partially set around the Chicago World's Fair of 1893โa review of the latter will likely feature different terminology than the former: it will refer more frequently to the "narrator," the "characters," and the "story."
Topic Modeling
To isolate that discourse, we can use an unsupervised machine-learning process called "topic modeling." Topic modeling assumes that each document corpus is generated by a set of topics, each itself construed as a set of word frequencies across the vocabulary. Each document in the corpus will have a certain mix of those topics, and that mix will be reflected in the specific word distribution in the document.
Training multiple 30-topic models on the book reviews produced a consistent set of topics. As expected, most deal with the books' subject matter. In the model I used, Topic 9's most frequent words include "science," "brain," and "universe." Topic 6, similarly, has "sex," "marriage," and "love." Several topics, though, emphasize the writing process. Most important, Topic 2, whose most frequent words include "novel," "characters," and "fiction," seems to address issues of fictionality.
There was not, unfortunately, a clear numerical boundary separating fiction and nonfiction, but examining specific reviews with high and low concentrations of Topic 2 confirmed that the former addressed fiction and the later nonfiction. I had only to sort out the ones in the middle.
Manual Decision Tree
To do so, I constructed a manual decision tree. Those documents which had less than 8% of the "Fiction" topic I classified as nonfiction, and those over 16% I classified as fiction. For the middle range, I filtered out three types of reviews as nonfiction.
First were several hundred that contained the word "memoir," since memoirs are nonfiction works written in a semi-fictional style.
Second were those with more than 15% of Topic 8 ("Literary Life"), which included author biographies, works of poetry, and literary essays. Third were books with more than 15.5% of Topic 20 ("Public Writing"), which included social criticism and creative nonfiction. For the books that remained, I assigned each one with over 12% of Topic 2 to fiction and the rest to nonfiction. A spot check suggested this process had about 95% accuracy.
In retrospect, I might have achieved better results by manually labeling ~400 reviews as fiction or nonfiction, then letting a Random Forest automate the decision tree process I underwent manually. It might have been faster and more accurate. Still, I am satisfied with my results.
The Limitations of a Sentiment Lexicon
Determining a review's critical evaluation proved more difficult. The most basic algorithmic tool for classifying reviews are sentiment lexicons, which assign positive/negative numbers to each word in a document based on its typical emotional valence. However, because the majority of a Times review describes rather than evaluates a book, sentiment lexicons can become easily confused. For instance, take this passage from Michiko Kakutani's review of Toni Morrison's Home:
Threaded through the story are reminders of our country's vicious inhospitality toward some of its own. On his way south, Frank makes use of a Green Book, part of the essential series of travelers' guides for African-Americans during a more overtly racist era. On a train, he encounters fellow passengers who've been beaten and bloodied simply for trying to buy coffee from a white establishment. He meets a boy who, out playing with a cap gun, was shot by a policeman and lost the use of one arm.
In context, this quote approvingly describes Morrison's depiction of racial intolerance. But even a sentiment lexicon as subtle as the one created by literature professor Matthew Jockers (dubbed "Syuzhet") sees words like "vicious," "bloodied," and "racist" and decides this to be a starkly negative text, giving it one of the lowest ratings in the set. Overall, on a three-category classification test ("positive," mixed," "negative"), a straight sentiment score proved little better than random guessing.
Filtering
A logical route for surmounting this problem is to filter out sentences that do not evaluate the book. The best method I devised for doing so was to eliminate all sentences that make no mention of the author's surname or the book's title. By getting rid of sentences like those above in analyzing Home, the sentiment classifier focuses on ones like "This haunting, slender novel is a kind of tiny Rosetta Stone to Toni Morrison's entire oeuvre" and returns a more accurate score.
A spot check suggested that this approach still yielded only 66% accuracy. One reason is that sentiment lexicons are bad at understanding irony. For instance, take this arch assessment of Nora Roberts's The Villa by Janet Maslin:
So it would be an understatement to say that Nora Roberts deals in feminine wish fulfillment, especially when David turns out to be the kind of man who is excited by making the perfect jewelry purchase for his beloved, when he has teenage children who won't really mind a stepmom, and when he also turns out to be a stern corporate boss ready to upbraid Pilar's ex-husband on the job. Even when David is injured during one of the occasional moments of light mayhem in ''The Villa,'' he remains the romance reader's idea of a perfect 10.
Syuzhet sees words like "perfect," "excited," and "fulfillment" and gives this review a strong positive rating of 7.1. Obviously, that does not accurately reflect Maslin's attitude.
The Limitations of Neural Networks
I attempted several clustering efforts to determine a review's evaluation, but because the majority of each review addressed content, standard algorithms proved ineffective. Hoping for a more sophisticated approach to sentiment classification, one that would take context into account alongside individual word choice, I built a neural network based on the Word2Vec algorithm.**
Word2Vec generates a numeric vector for each word in a corpus, using back-propagation to align the word with its common collocates. Consequently, each word's vector will be similar to those assigned to functionally-similar words. When loaded into a convolution neural network, a Word2Vec model can consequently be used to classify sentences.
I spent some time tuning Word2Vec parameters by manually examining whether they produced appropriate similarity scores for common critical words like "chilling," "hackneyed," and "sympathize." Next, I hand-labeled two thousand sentences from fifty reviews based on whether they were "positive," "neutral," or "negative." I used a subset of those two thousand to train the CNN, which produced 81% accuracy on the test subset. Finally, I loaded every sentence from the fiction reviews into the CNN to predict classifications for those sentences, then recombined the sentences to produce overall classifications.
It didn't work. On a spot check, I found accuracy to be barely better than random guessing. The accuracy improved when the inputs were restricted to sentences mentioning the author or book's name, but they were still below 60%. That left raw word-by-word sentiment score on filtered sentences as the best remaining method.
Data Results
Given the inaccuracies in the sentiment score, any further results will need to be taken with a rock-sized portion of salt. Still, to finish the exercise, I found that these figures did not show significant differences in reviewer attitudes toward bestsellers and non-bestsellers. The bestseller median was slightly higher, but this difference was dwarfed by the standard deviation of the data. Similarly, bestsellers received slightly longer reviews, but not significantly longer.
I am not satisfied with my results. The problem of classifying Times reviews is a difficult one, and the subject will require further study and experimentation.
Lessons for Further Work
The superiority of basic sentiment score to the neural network was surprising but logical. Sentiment scores make word-level distinctions at various levels of intensity, while neural networks based on models like Word2Vec are limited to broad classifications of whole sentences. That is, while the CNN could only judge a sentence on the range of three values {-1, 0, 1}, the Syuzhet sentiment score could evaluate one across a potentially infinite spectrum, though in practice it was limited to real numbers on the range [-6, 6].
Regardless, the sentiment score is more transparent and more flexible, a conclusion Jockers and co-author Jodie Archer reached in their own work text-mining bestsellers.
In further work, I would pursue the following avenues:
- We might refine the sentiment score by producing a criticism-specific sentiment lexicon. A keyness test could be applied to the review corpus to isolate especially prominent critical words to which criticism-specific scores could be assigned. This would not solve problems like those surrounding irony, but it might better handle words that a regular sentiment analysis would misclassify (e.g., "terrifying").
- Given that the CNN was limited by a lack of labeled reviews, we could give it a second chance by importing lexically-similar book reviews from a source that labels its reviews with a ratingโe.g., the starred/unstarred Publisher's Weekly reviews. This would still be difficult, because the rating would be at the review rather than sentence level, but it would improve the training process.
- Once a more satisfactory sentiment measure was devised, we could use that information (in combination with review length, time of review, etc.) to generate a predictive model for the bestsellers.
Conclusion
In some ways, the question of how to classify Times reviews is merely an intellectual problem. Still, it could provide some real insight. If we could confirm that getting a Times review is all that matters, with the review's judgment being secondary, that would have significant implications for marketing strategies. If the Times' opinion doesn't matter, it might be better to simply engage its attention, whether positively or not, rather than to try to court its favor.
*There is no good data on the actual number of new fiction titles published each year. The ProQuest division R.R. Bowker logs about 50,000 new fiction ISBNs each year, but that figure a) double-counts titles released in multiple formats (i.e., a paperback and hardcover of the same book will receive different ISBNs) and b) under-counts digitally-published work. My rough estimates are based on a back-of-napkin calculation that (based on overall market share and prices) there are 18,000 new trade paperback and e-book fiction titles each annually, plus 9,000 new hardcover and mass-market titles each.
**I used Word2Vec instead of Doc2Vec because, again, only a fraction of each review's sentences were evaluative. Using Doc2Vec would likely cause the CNN to cluster reviews based on content rather than evaluation.