NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Alumni > Data Analysis on the Formula of a Hallmark Holiday Movie

Data Analysis on the Formula of a Hallmark Holiday Movie

Martin Kihn
Posted on Sep 27, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

As the holidays approach, many of us eagerly await a new crop of Hallmark Holiday movies: positive, reassuring, brightly-lit confections that are as sweet and reliable as gingerbread. Part of their appeal is formed from utilizing data to create a certain implicit formula -- a woman in a stressful big city job goes home for the holidays, falls for a local working man, and realizes she's missing out on life.

Small towns, evil corporations, a wise older woman ... there are recurring motifs that made me wonder if I could apply machine learning to the plots of these movies to understand (1) what the formulas are; and (2) if a computer could write a Hallmark movie (for extra credit).

NLG has been tried on Christmas movies before, and the results were quite funny. Perhaps I could do better.

My initial hypothesis was that there are a certain (unknown) number of plot types that make up the Hallmark Holiday movie universe, and that these types could be determined from a structured analysis of plot summaries. Because the stories seemed formulaic, they could potentially be auto-generated using natural-language generation (NLG) methods, which were new to me.

Assembling the Data Set

Step one was to pull a list of titles. The Hallmark Channel has been running original movies with a Christmas theme for a quarter century, although the rate of production skyrocketed in 2015 as they became very popular. Pandemic production issues slowed the pipeline slightly in 2020, but the pace remains rapid.

Although 1995-2010 saw fewer than five original titles a year, the years 2018-2021 saw almost 40 each year. It's quite a pipeline.

Data Analysis on the Formula of a Hallmark Holiday Movie

Luckily, there is a list of original Hallmark production titles on Wikipedia, which I was able to scrape using Scrapy. Holiday movies aren't distinguished from others, so there was some manual selection in cutting the list. Once I had my titles, I was able to use the API for The Movie Database project (TMDB), which maintains information about films and TV shows, to pull the 'official' plot summaries.

There were 260 plot summaries in my corpus. The summaries ranged in length and style, and their lack of standardization and detail caused some challenges in the analysis. However, short of watching all the movies and building my own summaries, the TMDB summaries (which were provided by the network, I assume) were my data.

My intended audience was writers, producers and TV execs who want to understand how the Hallmark Holiday genre works and the elements of a successful production slate. These popular movies could also be used to inform other narrative projects with similar valence.

Of the 260 summaries, all but two were Christmas movies. Many summaries were disappointingly brief and generic, but many were better than that. There were about 15,000 words in total in the final data set.

Examples

Here's a typical example of a summary for the film "Christmas Town," starring the adorable Hallmark-ubiquitous Candace Cameron Bure:

Lauren Gabriel leaves everything behind in Boston to embark on a new chapter in her life and career. But an unforeseen detour to the charming town of Grandon Falls has her discover unexpected new chapters - of the heart and of family - helping her to embrace, once again, the magic of Christmas.

Over the years, the stories and themes of the Hallmark Holiday films changed, as the network nosed around and then settled on a set of typed tropes. For example, earlier films used Santa as a character more often and spirits as worthy guides for the heroine. By 2015 or so, Hallmark had found its soul: small towns, high-school boyfriends, family businesses threatened by big corps, and so on.

Data on Feature Engineering

After lemmatizing and tokening, removing stopwords and other standard text preprocessing, I realized that the corpus would have to be standardized to gain insight into its themes and to provide training data for any NLG model. For example, the summaries had names for characters, but those names didn't matter to me - I just cared that it was <MALE> or <FEMALE> (for the main characters), or <CHILD> or <SIBLING> or <PARENT> or <GRANDPARENT> with respect to the main character. Often there was also a <BOSS>.

(If you're curious, the most common names for characters mentioned in the corpus were: Jack, Nick and Chris.)

Other Algorithms

Likewise, towns were often named, but my only interest was that it was a <SMALLTOWN>, or (in the case of those bustling metropolises our heroines liked to leave in the beginning of the story) <BIGCITY>. And the evil big corporation might be named, but I wanted to tokenize it as <BIGCORP>.

Note the <BRACKETS> which would indicate to the model that these were tokens rather than the words originally in the corpus. How to make the substitutions without a lot of manual lookups?

I ended up using Spacy to tag the parts of speech. Although it requires some computer cycles, Spacy is a great NLP library that will tag each word by its part of speech, including place names, personal names and proper nouns. The tags themselves are then accessible to a Python script as part of a dictionary-lookup substitution.

In the case of character names, I was able to tag them using Spacy and then run them through Genderize to get a likely gender. This doesn't always work, as viewers of "It's Pat" on Saturday Night Live know, but a quick scan let me correct mistakes.

I could also automate much of the <TOKENIZATION> using dictionary substitutions. For example, I could find instances of "L.A." and "Los Angeles" and "New York City" and so on and substitute <BIGCITY>. However, a careful manual check was needed to do some cleanup.

In the end, I had a corpus of 260 plots with major character, location and relationship types <TOKENIZED>.

Frequency Data Analysis

Word frequencies were high for terms such as 'family', 'child', 'help, 'love', 'parent', 'small town'. This agreed with my personal memories of the films -- i.e., an abiding emphasis on families, home towns, and positive mojo.

Bigrams and trigrams (common two- and three-letter combos) uncovered even more of the Hallmark spirit than word frequencies. Among bigrams, the most common were 'high school', 'fall love', and 'return hometown'. Common trigrams were 'high school sweetheart', 'old high school' and 'miracle really happens'.

It is possible just to look at the common trigrams and get a very good feel for the alternate reality that is the mini-metaverse of Hallmark Holiday films.

Data Analysis on the Formula of a Hallmark Holiday Movie

Data Analysis on the Formula of a Hallmark Holiday Movie

The heart of my NLP analysis consisted of LDA topic modeling, using the Gensim library. Latent Dirichlet Allocation (LDA) is a statistical method that takes a group of documents (in our case, plot summaries) and models them as a group of topics, with each word in the document attached to a topic. It finds terms that appear together (frequency) and groups them into "topics" which can be present to a greater or lesser degree in each particular document.

Data on Topic Modeling

Often used for categorizing technical and legal documents, I thought it could be used to find the different holiday themes I detected in the plot summaries.

First, I did a grid search for parameters using "coherence score' as the target variable to maximize. The purpose of this search was to find a likely number of distinct topics, or plot types. I guessed there were 5-10, and this hyperparameter tuning exercise indicated that 8 topics appeared to be the most likely best fit.

Training the topic model on the plot summaries, I generated 7-8 distinct topics, with some overlap in words, as expected. These topics were analyzed using pyLDAvis, which allows for interactively probing the topics and changing some parameters to make them easier to interpret. (Figure 4 shows the pyLDAvis interactive view.)

Here some manual work -- call it 'domain knowledge' (e.g., watching the movies) -- was needed. I tagged the plots with the topics and focused on those that clearly fell into one topic or another. I then came up with a rough summary of these plots and gave that 'theme' a name. The manual tagging was needed because the theme name itself often didn't actually apear in the summaries.

The 8 Types of Hallmark Holiday Movies

The 8 themes I ended up identifying, along with my own names and sketches, were:

  1. SETBACK: Disappointed in work/love, a woman moves to a small town to heal/inherit
  2. BOSS: A cynical businessman hires a spunky woman for holiday-related reason (like planning a party)
  3. MIXUP: A travel mixup/storm forces some incompatible people to work together
  4. ALT-LIFE: A wish upon Santa/spirit is granted and a woman is shown an alternative life -- often, this involves time travel
  5. TAKEOVER: A big corporation threatens a family-run business in a small town
  6. RIVALS: Two seemingly incompatible rivals are forced to work together for some goal
  7. IMPOSTER: Dramatic irony: Someone lies about who they are -- or gets amnesia and doesn't know who they are
  8. FAMILY/CRISIS: A woman is forced to return home because of a family crisis

As usual with LDA, there was some overlap among the themes. In particular, #1 co-occured with others often; it started the story moving. For example, the heroine might suffer a SETBACK at work which encourages her to go back home (#1), and she encounters a MIXUP on the way (#3) that lands her in a delightful small town (this is the plot of "Christmas Town").

Interestingly, when I looked at the distribution of themes over the course of the Hallmark seasons, they were fairly evenly present. This made me think the producers at the network are well aware of these themes and seek to balance them to avoid repetition.

Text Generation Using Markov Chains, LSTM and GPT-2

As an experiment, I looked at three different methods of generating text, the idea being to use the plots as training data for a model that would generate an original plot in the style of the others. Text generation or NLG is an emerging field that has made amazing strides in recent years - as I discovered - and has developed uncanny capabilities. I was only able to touch the surface in my work.

I began with traditional text generation methods, which were hardly magical.

Markov Chains

Markov Chains were the most intuitive: they use the corpus and predict the next word (word-by-word) based on a distribution of the next words seen in the training data. Because it's at the word-by-word level (not larger chunks of text) -- at least, the way I implemented it -- the results were coherent only in very small sequences. Overall, it didn't work to put together sentences or stories that made sense.

Figure 6 shows a few examples of text generated in this way.

LSTM

Long Short-Term Memory (LSTM) is a form of recurrent neural network (RNN) AI model. They were created as a way to solve RNN's long-term memory problem, as RNN's tend to forget earlier parts of a sequence (e.g., of text) due to a vanishing gradient. They also make predictions at the word level based on weights derived in the training stage.

Training was done over 6 epochs and 100-character sequences using 'categorical cross-entropy' as the loss function. It took about two hours on my average-powered setup, so it's time-intensive. Longer training would improve disappointing results. (See Figure 7.)

Frankly, LSTM was a misfire. It required a great deal of training and although I did train for a few hours, my results were coherent only for a short (half-sentence) of text. More training might have helped, but I was more interested in moving on the 'transformers', the current state of the art for NLG.

GPT-2

GPT-2 -- this is an open source version of the OpenAI transformers models. It's pretrained on vast amounts of text data, giving it a very good basic model of English text. (GPT-3 -- which is much better at NLG -- is not available open source and I could not get access.) Training GPT-2 using the plots, I was able to 'direct' it toward my particular genre. The results were much more coherent than the other methods, while still falling short of useful new plots. (See Figure 8.)

To implement, I used the Transformers library provided by Huggingface/PyTorch, pretrained on data from the web. I trained the model for 50 epochs in batches of 32 characters (about 6 words).

Clearly, transformers are the way forward with NLG. GPT-3 has generated a lot of excitement in the past year or so, and its ability to create human-readable text that is original in a wide number of genres is astonishing. The state of the art could create a Hallmark movies plot already, and this tool will only get better as GPT-4 and other transformer models appear.

Conclusions

My hypothesis that Hallmark holiday movies tend to cluster around a set of common plots was validated. Specifically, I found:

  1. Hallmark Holiday movies have a consistent set of themes: small towns, families, career setbacks, old boyfriends, spirits and wishes
  2. Analyzing the text required standardization to avoid missing themes: man/woman/small town, etc.
  3. LDA topic modeling worked fairly well in identifying 7-8 key topics, with some overlap
  4. NLG yielded inconsistent results, with transformers pre-trained model living up to its reputation as a leap forward

Additional analyses I'd like to do would be to examine 'plots' as a time series. They are a sequence of events that happen in order. Adding the step-by-step flow would be an intriguing exercise.

Have a great holiday -- and enjoy the movies!

About Author

Martin Kihn

SVP Strategy, Salesforce Marketing Cloud - formerly VP Research at Gartner, covering data-driven marketing, marketing and advertising analytics and data management platforms. Interested in ad tech, measurement and digital advertising and marketing. Author and balletomane.
View all posts by Martin Kihn >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application