Use Neural Networks to Find the Best Words to Title Your eBook

Posted on Aug 22, 2016


The eBook business is thriving. The likes of Amazon Kindle, Apple iBookstore, and Google eBookstore all provide a robust variety of channels by which to publish any eBook on any subject you could think of.  Amazon generates an average of 1.07MM in eBook paid sales volume, which translates to about $5.8MM in revenue, every day.

A huge community of eBook followers exist due to its proven model to generate passive income for good writers. There are many great writers out there, but why is it sometimes difficult to generate the expected revenue in the market? The key to real success for capable, passionate writers is preparation and research. You have to know the market and current trends.

Maximizing Potential

It is not enough for writers come up with a topic they are interested in, start drafting and writing what they feel most passionate about, get published, then sit back and reap the benefits of their efforts. There are many websites that offer advice and recommendations on how to get into the eBook publishing business, but it boils down to good old fashioned market research and audience targeting to drive attention and trigger those sales.

Many tools and utilities are available to aid writers research their subjects of choice and understand keyword search volume. They also reveal how much competition exists for those subject areas you're writing about, and what the reading public is clamoring for. At what price point are customers willing to pay for the eBook?

When you think about the purchasing behavior of eBook readers, they would normally go to their favorite eBook stores, type in the relevant keywords, and start browsing through the hundreds, if not thousands of eBooks out there. If your eBook does not appear in the first or second page of the most relevant search results, your chances of being noticed are slim. Consumers will always try to go for the most relevant, most interesting, most popular, most highly rated, most favorably reviewed, and most inexpensive eBook with the best value they get their hands on. Sounds daunting right?

The eBook market is saturated with all the popular topics, so one needs to be more creative in their approach on the entire eBook publishing process. Do not start writing a word until an understanding of how the market will react is clearly achieved.

Business of eBooks

ebooks-businessThere is an interesting market psychology that drives the eBook market. The savviest writers are great marketers first, excellent wordsmiths second. If one's intent is to purely follow their passion and write something they truly care about with the prospect of generating income as a second thought, then no amount of market research will convince the author to sway from his or her prime directive.
However, if you're like most people, you would want to not only write something you really care about, but also try to maximize the potential of earning passive income while you're doing it. And why not? This mindset has been prevalent especially in this generation as more Baby Boomers are in or close to their retirement periods. Many feel compelled to follow their passion more, and at the same time find ways to supplement their retirement income. Writing and publishing eBooks has been one of these passive income generators that a lot of people vie for, but sadly, not many succeed.

eBook Fever

There's so much potential of passive income generation that eBooks offer that there are many models built around the business of eBooks itself. Websites and courses exist to teach, mentor, and guide eBook writers to how to go about the business of honing and excelling in the craft of research, marketing and writing rolled into one.

The most successful ones make it a point to create and publish eBooks once every two to three months. These savvy publishers go for volume, so it becomes a numbers game (to an extent). But it's not done blindly to crank out these eBooks. The electronic format and plethora of delivery channels make it really convenient and efficient to produce and publish eBooks at an unheard of rate that traditional book publishers have never seen before. So it boils down to research and more research.

So where does one start? Right out of the gate.

Open Sesame

When potential readers are hunting for the best eBook, there are cues that can be managed effectively to maximize their attention and garner a rise on a potential sale. Once search results are displayed, the book's title and cover are the little-known doorways to grabbing the reader's attention. Book design covers are there to speak visually to the consumer, the eBook title (and subtitle) are there to compete among the thousands of keywords and appeal and attract to the consumer mental and emotional state.

Let's focus on the how to choose a title for an eBook to increase the likelihood of being more visible, and appeal to the most anticipated result. We will focus on three main objectives: how to find the top words to title a Kindle eBook, and how to find the most similar words related to the subject. We will also have a discussion on how to use and combine two technical approaches to fulfill our objectives.

Titles and headlines change the way we think. What you read affects what you see, and what grabs your attention changes the way you feel. The hook, line, and sinker exist in all forms of media. Part of advertising agencies' expertise is coming up with all these catch phrases and headline grabbers. From news pieces, articles, blogs, news, printed and electronic books, to movies, marquees, and advertisements. These are just a sample of how titling can make a difference.

Word Counts and the Doc2Vec Neural Network


I targeted the Amazon Kindle site as the source for gathering information to do an in depth analysis on eBook inventory listing, ratings, reviews, and pricing. Code was written to dynamically web-scrape and scour the entire result set based on the key word search. From there, a process was developed to tokenize and clean up the titles into group of words which are categorized and counted to the proper rating for that specific eBook title. A total weighted score is calculated by getting the summation of the product of the rating score multiplied by the word count.

A second and complementary approach using Doc2Vec was used to analyze the entire eBook title listing. Doc2Vec was built based on the Word2Vec approach towards documents. The objective is to convert whole documents or in this case, bodies of text based on eBook titles into a digital representation in a multidimensional vector space. What that means is that any title can be represented as a vector point in a multidimensional space. This point on its own does not offer any value, but when a cluster of these points are created, a pattern starts to emerge and relationships between the words start to form. The magic and power of Doc2Vec is its ability to mathematically create context out of words, and inversely, also produce words given its context.

Doc2vec is based on a very thin two-layer neural network. Doc2Vec learns representations for words and labels simultaneously. It operates in a purely unsupervised mode and needs no labels other than an arbitrary unique ID per text example. You train it to find similar (using cosine similarity) words in context of each other based on the frequency of co-occurrences of words that are near each other. You can even pass entirely new text to the trained model and it can infer a compatible vector, and find the best and most similar words most likely to appear in context.

The approach is to create the weighted word frequency count to get the most relevant words from Amazon Kindle site and intersect that set of word list with the result from the Doc2Vec neural network result set. The intersected result set then uses the relevancy score from Amazon Kindle site to sort and produce the highly recommended list of words to use to form an eBook title that could increase the likelihood of being noticed and getting the sale. This serves two purposes. One, it ensures that the list relevancy remains high due to its consistent use of the Amazon Kindle inherent scoring mechanism. Second, the Doc2Vec neural network output of the most similar words that it learned from the entire title listings from Amazon Kindle produces possibly related search words. Utilizing these words will fortify the likelihood that the eBook title will appear on the first two to three pages of Amazon's search results.

Technical Implementation


The initial process is all about the scraping strategy and approach. Python was the language and the BeautifulSoup library took care of the web-scraping. A fully object oriented approach was also implemented so that full modularity, reuse, componentization, and abstraction was achieved. The ReviewCorpus class did pre-processing on incoming titles. This class are four methods.

This class method removes the higher order non ascii characters. It is fairly normal at times to get these characters and it detracts from accurately processing text for a higher dimensionalized vector space.

These class methods use the Python nltk library functions to tokenize the titles, remove stop words, and take out all punctuation.

The next class method named .add then creates a dictionary that maps each title with a count against each rating category.

The AmazonKindle class scrapes the Amazon Kindle site dynamically. This code needs to be very robust as the search results from Amazon could go to a maximum of 400 pages, and each page could have from 15-20 title listing with metadata (titles, ratings, number of comments, prices).

The first method of the AmazonKindle class initializes the query string and scrapes the maximum page from the Amazon result page. This tells how deep the looping code needs to go to scrape through all the titles.

The buildURL method is then used to formulate the URL to scrape. Since Amazon generates this dynamically, I had to get the URL down to a reproducible format and protocol.

The retrieveSource method is the main module to collect the entire HTML byte stream. The header needs to be formulated to effect the scraping as a browser client as Amazon does not allow scraping bots on its site.

The main method is the processRecord method as it's designed to now process and scrape out the HTML content for the essential data and store the result in memory.

Lastly, the AmazonKindle class is designed as a class iterator to pass back a generator result set for further processing.

After the two classes above are created, a Python application is created to bring it all together. The main application will pass the parameters to the processing classes, clean all text up as it collects it, gathers all the metadata, stores it as vectors, then write it all out to CSV files for further processing by the Doc2Vec procedure.

The next step is to now vectorize the entire title list and train the neural network on the entire corpus. The Doc2Vec function is called with the right hyper-parameters. How the hyper-parameters are set is the most critical step in the vectorization, model build out, and training. The code is as follows.

An inference engine is called to get the most similar set of words from the vectorized cloud. The parameters need to be passed in manually for this version, but can be abstracted easily as appropriate.

Finally, the data sets are converted to a set format, then the set intersect function called to obtain the final recommended list of words that best be used for formulating the eBook title.

Here are some examples of the output using the weighted word frequency count and Doc2Vec neural network, tried with various keyword combinations, and final result sorted using Amazon's relevancy score.





Future Next Steps

It would be an interesting exercise to use the recommended bag of words to literally come up with final title recommendations. I believe that the use of a Deep Learning model using Theano Keras has this capability to formulate human readable titles using natural language generation.

by Bernard Ong, Data Argonaut
Email: [email protected]
Cell: 201-916-5241

About Author

Bernard Ong

Data Scientist with track record of driving innovative technology projects and programs to successful implementation. Blend application architecture and machine learning skills with domain knowledge to drive strategy and execution excellence. Background includes managing multimillion-dollar portfolios, turnaround initiatives,...
View all posts by Bernard Ong >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI