Data Analysis on the Formula of a Hallmark Holiday Movie
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
As the holidays approach, many of us eagerly await a new crop of Hallmark Holiday movies: positive, reassuring, brightly-lit confections that are as sweet and reliable as gingerbread. Part of their appeal is formed from utilizing data to create a certain implicit formula -- a woman in a stressful big city job goes home for the holidays, falls for a local working man, and realizes she's missing out on life.
Small towns, evil corporations, a wise older woman ... there are recurring motifs that made me wonder if I could apply machine learning to the plots of these movies to understand (1) what the formulas are; and (2) if a computer could write a Hallmark movie (for extra credit).
NLG has been tried on Christmas movies before, and the results were quite funny. Perhaps I could do better.
My initial hypothesis was that there are a certain (unknown) number of plot types that make up the Hallmark Holiday movie universe, and that these types could be determined from a structured analysis of plot summaries. Because the stories seemed formulaic, they could potentially be auto-generated using natural-language generation (NLG) methods, which were new to me.
Assembling the Data Set
Step one was to pull a list of titles. The Hallmark Channel has been running original movies with a Christmas theme for a quarter century, although the rate of production skyrocketed in 2015 as they became very popular. Pandemic production issues slowed the pipeline slightly in 2020, but the pace remains rapid.
Although 1995-2010 saw fewer than five original titles a year, the years 2018-2021 saw almost 40 each year. It's quite a pipeline.
Luckily, there is a list of original Hallmark production titles on Wikipedia, which I was able to scrape using Scrapy. Holiday movies aren't distinguished from others, so there was some manual selection in cutting the list. Once I had my titles, I was able to use the API for The Movie Database project (TMDB), which maintains information about films and TV shows, to pull the 'official' plot summaries.
There were 260 plot summaries in my corpus. The summaries ranged in length and style, and their lack of standardization and detail caused some challenges in the analysis. However, short of watching all the movies and building my own summaries, the TMDB summaries (which were provided by the network, I assume) were my data.
My intended audience was writers, producers and TV execs who want to understand how the Hallmark Holiday genre works and the elements of a successful production slate. These popular movies could also be used to inform other narrative projects with similar valence.
Of the 260 summaries, all but two were Christmas movies. Many summaries were disappointingly brief and generic, but many were better than that. There were about 15,000 words in total in the final data set.
Lauren Gabriel leaves everything behind in Boston to embark on a new chapter in her life and career. But an unforeseen detour to the charming town of Grandon Falls has her discover unexpected new chapters - of the heart and of family - helping her to embrace, once again, the magic of Christmas.
Over the years, the stories and themes of the Hallmark Holiday films changed, as the network nosed around and then settled on a set of typed tropes. For example, earlier films used Santa as a character more often and spirits as worthy guides for the heroine. By 2015 or so, Hallmark had found its soul: small towns, high-school boyfriends, family businesses threatened by big corps, and so on.
Data on Feature Engineering
After lemmatizing and tokening, removing stopwords and other standard text preprocessing, I realized that the corpus would have to be standardized to gain insight into its themes and to provide training data for any NLG model. For example, the summaries had names for characters, but those names didn't matter to me - I just cared that it was <MALE> or <FEMALE> (for the main characters), or <CHILD> or <SIBLING> or <PARENT> or <GRANDPARENT> with respect to the main character. Often there was also a <BOSS>.
(If you're curious, the most common names for characters mentioned in the corpus were: Jack, Nick and Chris.)
Likewise, towns were often named, but my only interest was that it was a <SMALLTOWN>, or (in the case of those bustling metropolises our heroines liked to leave in the beginning of the story) <BIGCITY>. And the evil big corporation might be named, but I wanted to tokenize it as <BIGCORP>.
Note the <BRACKETS> which would indicate to the model that these were tokens rather than the words originally in the corpus. How to make the substitutions without a lot of manual lookups?
I ended up using Spacy to tag the parts of speech. Although it requires some computer cycles, Spacy is a great NLP library that will tag each word by its part of speech, including place names, personal names and proper nouns. The tags themselves are then accessible to a Python script as part of a dictionary-lookup substitution.
In the case of character names, I was able to tag them using Spacy and then run them through Genderize to get a likely gender. This doesn't always work, as viewers of "It's Pat" on Saturday Night Live know, but a quick scan let me correct mistakes.
I could also automate much of the <TOKENIZATION> using dictionary substitutions. For example, I could find instances of "L.A." and "Los Angeles" and "New York City" and so on and substitute <BIGCITY>. However, a careful manual check was needed to do some cleanup.
In the end, I had a corpus of 260 plots with major character, location and relationship types <TOKENIZED>.
Frequency Data Analysis
Word frequencies were high for terms such as 'family', 'child', 'help, 'love', 'parent', 'small town'. This agreed with my personal memories of the films -- i.e., an abiding emphasis on families, home towns, and positive mojo.
Bigrams and trigrams (common two- and three-letter combos) uncovered even more of the Hallmark spirit than word frequencies. Among bigrams, the most common were 'high school', 'fall love', and 'return hometown'. Common trigrams were 'high school sweetheart', 'old high school' and 'miracle really happens'.
It is possible just to look at the common trigrams and get a very good feel for the alternate reality that is the mini-metaverse of Hallmark Holiday films.
The heart of my NLP analysis consisted of LDA topic modeling, using the Gensim library. Latent Dirichlet Allocation (LDA) is a statistical method that takes a group of documents (in our case, plot summaries) and models them as a group of topics, with each word in the document attached to a topic. It finds terms that appear together (frequency) and groups them into "topics" which can be present to a greater or lesser degree in each particular document.
Data on Topic Modeling
Often used for categorizing technical and legal documents, I thought it could be used to find the different holiday themes I detected in the plot summaries.
First, I did a grid search for parameters using "coherence score' as the target variable to maximize. The purpose of this search was to find a likely number of distinct topics, or plot types. I guessed there were 5-10, and this hyperparameter tuning exercise indicated that 8 topics appeared to be the most likely best fit.
Training the topic model on the plot summaries, I generated 7-8 distinct topics, with some overlap in words, as expected. These topics were analyzed using pyLDAvis, which allows for interactively probing the topics and changing some parameters to make them easier to interpret. (Figure 4 shows the pyLDAvis interactive view.)
Here some manual work -- call it 'domain knowledge' (e.g., watching the movies) -- was needed. I tagged the plots with the topics and focused on those that clearly fell into one topic or another. I then came up with a rough summary of these plots and gave that 'theme' a name. The manual tagging was needed because the theme name itself often didn't actually apear in the summaries.
The 8 Types of Hallmark Holiday Movies
The 8 themes I ended up identifying, along with my own names and sketches, were:
- SETBACK: Disappointed in work/love, a woman moves to a small town to heal/inherit
- BOSS: A cynical businessman hires a spunky woman for holiday-related reason (like planning a party)
- MIXUP: A travel mixup/storm forces some incompatible people to work together
- ALT-LIFE: A wish upon Santa/spirit is granted and a woman is shown an alternative life -- often, this involves time travel
- TAKEOVER: A big corporation threatens a family-run business in a small town
- RIVALS: Two seemingly incompatible rivals are forced to work together for some goal
- IMPOSTER: Dramatic irony: Someone lies about who they are -- or gets amnesia and doesn't know who they are
- FAMILY/CRISIS: A woman is forced to return home because of a family crisis
As usual with LDA, there was some overlap among the themes. In particular, #1 co-occured with others often; it started the story moving. For example, the heroine might suffer a SETBACK at work which encourages her to go back home (#1), and she encounters a MIXUP on the way (#3) that lands her in a delightful small town (this is the plot of "Christmas Town").
Interestingly, when I looked at the distribution of themes over the course of the Hallmark seasons, they were fairly evenly present. This made me think the producers at the network are well aware of these themes and seek to balance them to avoid repetition.
Text Generation Using Markov Chains, LSTM and GPT-2
As an experiment, I looked at three different methods of generating text, the idea being to use the plots as training data for a model that would generate an original plot in the style of the others. Text generation or NLG is an emerging field that has made amazing strides in recent years - as I discovered - and has developed uncanny capabilities. I was only able to touch the surface in my work.
I began with traditional text generation methods, which were hardly magical.
Markov Chains were the most intuitive: they use the corpus and predict the next word (word-by-word) based on a distribution of the next words seen in the training data. Because it's at the word-by-word level (not larger chunks of text) -- at least, the way I implemented it -- the results were coherent only in very small sequences. Overall, it didn't work to put together sentences or stories that made sense.
Figure 6 shows a few examples of text generated in this way.
Long Short-Term Memory (LSTM) is a form of recurrent neural network (RNN) AI model. They were created as a way to solve RNN's long-term memory problem, as RNN's tend to forget earlier parts of a sequence (e.g., of text) due to a vanishing gradient. They also make predictions at the word level based on weights derived in the training stage.
Training was done over 6 epochs and 100-character sequences using 'categorical cross-entropy' as the loss function. It took about two hours on my average-powered setup, so it's time-intensive. Longer training would improve disappointing results. (See Figure 7.)
Frankly, LSTM was a misfire. It required a great deal of training and although I did train for a few hours, my results were coherent only for a short (half-sentence) of text. More training might have helped, but I was more interested in moving on the 'transformers', the current state of the art for NLG.
GPT-2 -- this is an open source version of the OpenAI transformers models. It's pretrained on vast amounts of text data, giving it a very good basic model of English text. (GPT-3 -- which is much better at NLG -- is not available open source and I could not get access.) Training GPT-2 using the plots, I was able to 'direct' it toward my particular genre. The results were much more coherent than the other methods, while still falling short of useful new plots. (See Figure 8.)
To implement, I used the Transformers library provided by Huggingface/PyTorch, pretrained on data from the web. I trained the model for 50 epochs in batches of 32 characters (about 6 words).
Clearly, transformers are the way forward with NLG. GPT-3 has generated a lot of excitement in the past year or so, and its ability to create human-readable text that is original in a wide number of genres is astonishing. The state of the art could create a Hallmark movies plot already, and this tool will only get better as GPT-4 and other transformer models appear.
My hypothesis that Hallmark holiday movies tend to cluster around a set of common plots was validated. Specifically, I found:
- Hallmark Holiday movies have a consistent set of themes: small towns, families, career setbacks, old boyfriends, spirits and wishes
- Analyzing the text required standardization to avoid missing themes: man/woman/small town, etc.
- LDA topic modeling worked fairly well in identifying 7-8 key topics, with some overlap
- NLG yielded inconsistent results, with transformers pre-trained model living up to its reputation as a leap forward
Additional analyses I'd like to do would be to examine 'plots' as a time series. They are a sequence of events that happen in order. Adding the step-by-step flow would be an intriguing exercise.
Have a great holiday -- and enjoy the movies!