Data Study on NYT Bestseller

Posted on Apr 4, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction: The Paucity of Book-Market Data

If you want to know what films have generated the highest receipts, box office data has long been available for major releases. Β For television, it's not hard to find theΒ Nielsen ratings for anyΒ show. Β Music sales are a little trickier, but between RIAA certifications and the Billboard charts, you can usually locateΒ what you need to know.

For books, it's a different story. Β Did the latest Jodi Picoult bestseller outsell the latest John Grisham? Β It's hard to say. Β The publishing companies and book storesΒ do notΒ divulge unit sales Β Though Nielsen and Amazon both track book sales, their rankings capture onlyΒ a minority of the marketΒ and cannot be compiled over a long period of time without paying a fee.

TheΒ New York Times Bestseller Lists don't compileΒ direct unit numbers, but they do provide weekly rankings of books in hardcover, trade paperback, mass-market paperback, and e-book formats, as well as in a variety of subgenres. Β Granted, there may be ways of gaming theirΒ system, and it's not clear that the lists reflect true sales. Β Still, even if they are not perfect, the bestseller lists provide data that is easily accessible (through the TimesΒ API)Β and formatted to allow comparisons over time.

The number of weeks spent as a bestseller may not directly reflectΒ unit sales, but the two are surely correlated. Β Moreover, they are important predictive toolsΒ in addition to beingΒ reflectors of success: getting onto the bestseller list encourages readers to buy theΒ book. Β They are probably our best tool to investigate the otherwise-opaque book market.

Objective

How well have the major publishers been doing against each other? Β What imprints dominate eachΒ genre? Β If you're not affiliated with a major publisher, where's your best bet to break through? Β And how much is a spot in the prestigiousΒ Times Book ReviewΒ worth?

To find out, I acquired the fiction bestsellers on the four major formats (hardcover, trade paperback, mass-market paperback, and e-book) from June 2008 through the beginning of March 2017, then hand-annotated them with data about their publishers' genre and corporate affiliation. Β The compiled results are visualized in this application.Β  I also used the API to acquire the URLs of every review theΒ TimesΒ Book ReviewΒ published over that time andΒ scraped their text. Β In analyzing the data, I made the following discoveries:

  1. The book market is extremely concentrated.
  2. There are only a few genres not completely dominated by the biggest companies, including literary fiction and romance.
  3. Books reviewed by theΒ Times are much more likely to be bestsellers, though it's not clear which is the cause and which the effect.
  4. It is exceedingly hard to determine the judgment of aΒ Times bookΒ review without manually reading it.

Β The Bestseller Oligopoly Is in Danger of Becoming a Monopoly

Though hundreds of publishing imprints are represented on the bestseller lists, 90% of the spaceΒ is taken upΒ by aΒ handful of parent companies. Β Until 2013, these were known as the Big Six: Random House, Penguin, HarperCollins, Hachette, Simon & Schuster, and Macmillan. Β With the merger of the two biggest houses (Random House and Penguin) in 2013, that concentration has only gotten more extreme. Β A single company now controls nearly half of the bestseller lists.

Data Study on  NYT BestsellerData Study on  NYT Bestseller

The nested corporate structure of publishing imprints can be dizzying. Β For instance, the prestige novelsΒ of Margaret Atwood and Ian McEwan are published under the pie_sfpersonal imprint of veteran editorΒ Nan A.Β Talese. Β This might give the impression of a small, boutique operation. Β But Talese is owned by the eminent publisherΒ Doubleday, which, since 2009, has been part of Knopf Doubleday, having been merged withΒ Alfred A. Knopf by their joint corporate owner Random House. Β Random House is now a subsidiary of Penguin Random House, which is itself jointly operated byΒ the international media conglomerates Bertelsmann and Pearson.

 

pie_litWhat hope does anyΒ smaller publisher have against a force so mammoth?

Though the biggest companiesΒ dominate mainstream commercial fiction, there is some breathing room withinΒ individual genres. Β  For instance, Macmillanβ€”a distant fifth among the Big Five in most areasβ€”leads the Science Fiction/Fantasy market, via their Tor and Minotaur imprints (respectively run by Tom Doherty Associates andΒ St. Martin's).

IndependentΒ houses, led by Grove/Atlantic, do well in literary fiction, cumulatively publishing a fifthΒ of literary imprint bestsellers. Β Perhaps mostΒ astonishing isΒ the e-book list. Β A mere decade ago, the notion of a self-published bestseller would have been laughed at. Β Now, nearly half of all romance/erotica e-book bestsellers are self-published.

pie_romStill, the bestseller list is extremely top-heavy, dominated by the biggest-selling books. Β On the hardcover, e-book, and mass-market lists, about 10% of total space is taken up by the 1-3Β annual booksΒ that last on the charts for over a year. Β This is even more extreme on the trade paperback list, where over 33%Β of the list is taken up by books that chart for over a year.

When a book like Gone GirlΒ orΒ Fifty Shades of GrayΒ hits the trade paperback list, that is, it stays there for a long time, preventing other books from climbing into the top 20.Β  As a result, only a third as many individualΒ trade paperbacks reach bestseller status as inΒ other formats.

Across the board, the typical bestseller in print lasts only 2-3 weeks. Β For e-books, attention spans run even shorter: while the top books take up as much space as on the other lists, 75% of digital bestsellers don't merit a second week.

TheΒ Book Review Boost Data

Data Study on  NYT BestsellerThe Bestseller lists are compiled by theΒ New York Times Book Review, arguably the most prestigious book review in the country. Β Out of the ~40,000 new fiction titles released each year,* theΒ Book ReviewΒ covers about 350. Β Those 350 are about 10Β times as likely to make the hardcover and mass-marketΒ lists as the average new title, and 30 times as likely to make the trade paperbackΒ list.

However, it's not clear how much of this correlation is actually causative. Β TheΒ Book ReviewΒ is more likely to review books that are already a good betΒ to sell well. Β Furthermore, while theΒ Book ReviewΒ impact looks big inΒ the chart on the right, it looks less impressive in the chart on the left. Β Since the number of reviewed books is so small, the majority of books that make the bestseller list are notΒ reviewed at all.best_reviews

Findings

We can see the importance of book reviews to the trade paperback bestsellers, as they are reviewed nearly as often as the hardcover bestsellers, despite their smaller Β number of bestsellers. Β However, only 3% of mass-market bestsellers received a review. Β Overall, the bestsellers that did not receive a review far outnumber those that did. Β In other words, prestigious though itΒ may be, a Times book reviewΒ may not be that relevant to how a book sells. Β How can we determine their marketΒ value? Β One way is toΒ try toΒ see whether books that receive aΒ goodΒ review do any better than those that do not.

 

 

Β New York TimesΒ Reviews Are Difficult to Classify

Determining whether or not a TimesΒ review is positive, though, is difficult. Β Machine-learning classifiers areΒ frequently trained to separate positive from negative commentary, butΒ they are usually designed to handle born-digital material that is tailor-made for easy classification. Β For example, the firstΒ review on today'sΒ top-selling Amazon book (Jay Asher's Thirteen Reasons Why) Β contains only slightly more than 100 words, lots of descriptive languageΒ reflecting the reviewer's judgment (e.g., "intrigued," "page turner"), and plenty of metadata, includingΒ a five-star rating. Β It's easy to tell that this is a positive review.

AΒ New York Times review is quite different. Β They average about 1200 words, rendering them computationally complex. Β Over 75% of their sentences make no direct evaluative statement about the book, instead discussing and contextualizing its contents. Β They do not come with a star rating, which means we have no labels on which to train and test models. Β The TimesΒ API suppliesΒ little metadata. Β They are written, that is,Β so thatΒ you to actually have to read the article.

This leaves us with two problems. Β First, since we are focusing only on fiction, we have to separate out fiction reviews from nonfiction reviews. Β Second, we have toΒ devise aΒ scalable method of evaluating the review's judgment.

Locating the "Fiction" Topic

Separating fiction and nonfiction proved relatively simple. Β Even if a nonfiction and fiction book share similar contentβ€”say, Erik Larson'sΒ Devil in the White CityΒ and Thomas Pynchon'sΒ Against the Day, which both are partially set aroundΒ the Chicago World's Fair of 1893β€”a review of the latter will likely featureΒ different terminology than the former: it will refer more frequently to the "narrator," theΒ "characters," and the "story."

Topic Modeling

To isolate that discourse, we can use an unsupervised machine-learning process called "topic modeling." Β Topic modeling assumes that each document corpus is generated byΒ a set of topics, each itselfΒ construed as a set of word frequencies across the vocabulary. Β Each document in the corpus will have a certain mix of those topics, and that mix will be reflected in the specific word distribution in the document.

topic_modelTraining multipleΒ 30-topic models on theΒ book reviews produced a consistent set of topics. Β As expected, most deal with the books' subject matter. Β In the model I used,Β  Topic 9's most frequent words includeΒ "science," "brain," and "universe." Β Topic 6, similarly, hasΒ "sex," "marriage," and "love." Β Several topics, though, emphasize the writing process.Β  Most important, Topic 2, whose mostΒ frequent words include "novel," "characters," and "fiction," seems to address issues of fictionality.

There was not, unfortunately, a clear numerical boundary separating fiction and nonfiction, but examining specific reviews withΒ high and low concentrations of Topic 2Β confirmed that the former addressedΒ fiction and the later nonfiction. Β I had only to sort out the ones in the middle.

Manual Decision Tree

To do so, I constructedΒ a manual decision tree. Β Those documents which had less than 8% of the "Fiction" topic I classified as nonfiction, and those over 16% I classified as fiction. Β For the middle range, I filtered out three types of reviews as nonfiction.

First were several hundredΒ that contained the word "memoir," since memoirs are nonfiction works written in a semi-fictional style.

Second were those with more than 15% of Topic 8 ("Literary Life"), which included author biographies, works of poetry, and literary essays. Β Third were books with more than 15.5% of Topic 20Β ("Public Writing"), which includedΒ social criticism and creative nonfiction. Β For the books that remained, I assignedΒ each one withΒ over 12% of Topic 2Β to fiction and the rest to nonfiction. Β A spot check suggested this process had about 95% accuracy.

In retrospect, I might have achieved better results by manually labeling ~400 reviews as fiction or nonfiction, then letting a Random Forest automate the decision tree process I underwentΒ manually. Β It might have been faster and more accurate. Β Still, I amΒ satisfied with my results.

The Limitations of a Sentiment Lexicon

Determining aΒ review's critical evaluationΒ proved more difficult. Β The most basic algorithmic tool for classifying reviewsΒ are sentiment lexicons, which assign positive/negative numbers to each word in a documentΒ based on itsΒ typical emotional valence. Β However, because the majority of aΒ Times review describes rather than evaluates a book, sentiment lexicons can become easily confused. Β For instance, take this passage from Michiko Kakutani's review of Toni Morrison'sΒ Home:

Threaded through the story are reminders of our country's vicious inhospitality toward some of its own. On his way south, Frank makes use of a Green Book, part of the essential series of travelers' guides for African-Americans during a more overtly racist era. On a train, he encounters fellow passengers who've been beaten and bloodied simply for trying to buy coffee from a white establishment. He meets a boy who, out playing with a cap gun, was shot by a policeman and lost the use of one arm.

In context, this quote approvingly describes Morrison's depictionΒ of racial intolerance. Β But even a sentiment lexicon as subtle as the one created by literature professor Matthew Jockers (dubbed "Syuzhet") sees words like "vicious," "bloodied," and "racist" and decidesΒ this to be a starkly negative text, giving it one of the lowest ratings in the set. Β Overall, on a three-category classification test ("positive," mixed," "negative"), a straight sentiment score proved little better than random guessing.

Filtering

A logical route for surmounting this problem isΒ to filter out sentencesΒ that do not evaluateΒ the book. Β The best method I devised for doing so was to eliminate all sentences that make no mention of the author's surname or the book's title. Β By getting rid ofΒ sentences like those above in analyzing Home, the sentiment classifier focuses on ones like "This haunting, slender novel is a kind of tiny Rosetta Stone to Toni Morrison's entire oeuvre" and returns a more accurate score.

A spot check suggested that this approach still yielded only 66% accuracy. Β One reason is that sentiment lexicons are badΒ at understanding irony. Β For instance, take this arch assessment of Nora Roberts'sΒ The Villa by Janet Maslin:

So it would be an understatement to say that Nora Roberts deals in feminine wish fulfillment, especially when David turns out to be the kind of man who is excited by making the perfect jewelry purchase for his beloved, when he has teenage children who won't really mind a stepmom, and when he also turns out to be a stern corporate boss ready to upbraid Pilar's ex-husband on the job. Even when David is injured during one of the occasional moments of light mayhem in ''The Villa,'' he remains the romance reader's idea of a perfect 10.

Syuzhet seesΒ words like "perfect," "excited," and "fulfillment" and givesΒ this review a strong positive rating of 7.1. Β Obviously, that does not accurately reflect Maslin's attitude.

The Limitations of Neural Networks

I attempted several clustering efforts to determine a review's evaluation, but because the majority of each review addressed content, standard algorithms proved ineffective. Β Hoping for a more sophisticated approach to sentiment classification, one that would take context into account alongside individual word choice, I builtΒ a neural network based on theΒ Word2Vec algorithm.**

Word2Vec generates a numeric vector for each word in a corpus, using back-propagation to align the word withΒ its common collocates. Β Consequently, each word's vector will be similar to those assigned toΒ functionally-similar words. Β When loaded into a convolution neural network, a Word2Vec model can consequently be used to classify sentences.

I spent some time tuning Word2Vec parameters by manually examining whether theyΒ produced appropriate similarity scores forΒ common critical words like "chilling," "hackneyed," and "sympathize." Β Next, I hand-labeled two thousand sentences from fifty reviews based on whether they were "positive," "neutral," or "negative." Β I used a subset of those two thousand to train the CNN, which produced 81% accuracy on the test subset. Β Finally, I loaded every sentence from the fiction reviews into the CNN to predict classifications for those sentences, then recombined the sentences to produce overall classifications.

It didn't work. Β On a spot check, I found accuracy to be barely better than random guessing. Β The accuracy improved when the inputs were restricted to sentences mentioning the author or book's name, but they were still below 60%. Β That left raw word-by-wordΒ sentiment score on filtered sentencesΒ as the best remaining method.

Data Results

Given the inaccuraciesΒ in the sentiment score, any further results will need to be taken with a rock-sized portion of salt. Β Still, to finish the exercise, I found that these figures did not show significant differences in reviewer attitudes toward bestsellers and non-bestsellers. Β The bestseller median was slightly higher, but this difference was dwarfed by the standard deviation of the data. Β Similarly, bestsellers received slightly longer reviews, but not significantly longer.

bad_sentbad_box2

I am not satisfied with my results. Β The problem of classifying TimesΒ reviews is a difficult one, andΒ the subject will requireΒ further study and experimentation.

Lessons for FurtherΒ Work

The superiority of basic sentiment score to the neural network was surprising but logical. Β Sentiment scores makeΒ word-level distinctions at variousΒ levels of intensity,Β while neural networks based on models likeΒ Word2Vec are limited toΒ broad classifications of whole sentences. Β That is, while the CNN Β could only judgeΒ a sentence on the range of three values {-1, 0, 1}, the Syuzhet sentiment score could evaluate one across a potentially infinite spectrum, though in practice it was limitedΒ to real numbers on the range [-6, 6].

Regardless,Β the sentiment score is more transparent andΒ  more flexible, a conclusionΒ Jockers and co-author Jodie Archer reachedΒ in their ownΒ work text-mining bestsellers.

In further work,Β I would pursue the following avenues:

  1. We might refine the sentiment score by producing a criticism-specific sentiment lexicon. Β A keyness test could be applied to the review corpus to isolate especially prominent critical wordsΒ to whichΒ criticism-specific scores could be assigned. Β This would not solve problems like those surrounding irony, but it might better handleΒ words that a regular sentiment analysis wouldΒ misclassify (e.g., "terrifying").
  2. Given that the CNN was limited by a lack of labeled reviews, we couldΒ give it a second chance by importing lexically-similar book reviews from a source that labels its reviews with a ratingβ€”e.g., the starred/unstarred Publisher's WeeklyΒ reviews. Β This would still be difficult, because the rating would be at the review rather than sentence level, but it would improve the training process.
  3. Once a more satisfactory sentiment measure was devised, we could use that information (in combination with review length, time of review, etc.) to generate a predictive model for the bestsellers.

Conclusion

In some ways, the question of how to classify TimesΒ reviews is merely an intellectual problem.Β  Still, it could provide some real insight. Β If weΒ could confirm that getting a TimesΒ review is all that matters, with theΒ review's judgment being secondary, that would have significant implications for marketing strategies. Β If theΒ Times'Β opinion doesn't matter, it might be better to simply engage its attention, whether positively or not, rather than to try to court its favor.

*There is no good data on the actual number of new fiction titles published each year. Β The ProQuest division R.R. Bowker logs about 50,000 new fiction ISBNs each year, but that figure a) double-counts titles released in multiple formats (i.e., a paperback and hardcover of the same book will receive differentΒ ISBNs) and b) under-counts digitally-published work. Β My rough estimates are based on a back-of-napkin calculation that (based on overall market shareΒ and prices) there are 18,000 new trade paperback and e-book fiction titles each annually, plus 9,000 new hardcover and mass-market titlesΒ each.

**I used Word2Vec instead of Doc2Vec because, again, only a fraction of each review's sentences were evaluative. Β Using Doc2Vec would likely cause the CNN to cluster reviews based on content rather than evaluation.

NYT_API_logo

About Author

David Letzler

Dr. David Letzler has received a Ph.D. in English Literature from the Graduate Center at CUNY and an M.A. in creative writing from Temple University. While researching long, complicated novels and the cognitive science of attention for his...
View all posts by David Letzler >

Related Articles

Leave a Comment

David Letzler May 12, 2017
Fair question. It's 2013-2017. I probably should have input that on the graphic (and may still, if I get some spare time), but you can work it out from context. Basically, there's one single time split allowed on the dashboard (which you can explore yourself by clicking the link in the Introduction) at 2013, separating out the period before and after the big Penguin-Random merger.
rkiga May 12, 2017
I assumed it was for 2008-2017, but that seems like a lot of self-published Romance when stretching that far back. Also, is this site missing a "submit comment" button? I can only reply by hitting enter in the Email box. 0_o
rkiga May 12, 2017
Hey David, Really great work. But I can't figure out the dates for the charts of SFF, Literary, and Romance. Can you list them?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI