Data-driven Crossword Puzzle Solving
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
There’s a misconception that being good at crosswords and getting NYT crossword answers quickly requires knowledge of trivia, and it couldn't be more false. Sometimes a crossword constructor will resort to an obscure word just to get all the clues to fit, but trivia runs counter to the goal of the New York Times crossword, and puzzles with too many esoteric clues don't get printed. Let's collect data to see how to solve crossword puzzles.
A good New York Times crossword puzzle consists of two elements:
- clever puns/jokes
- clue-answer pairs that reflect the zeitgeist
“Zeitgeist” is a German word which means “spirit of the times”. The zeitgeist is the opposite of trivia; it is the collection of cultural references that should be familiar to most people.
My project is about the second bullet point: trying to understand and visualize how the NYT crossword puzzle stays current and captures the spirit of the time in which it is published.
I. Data
I scraped five years' of clue-answer pairs from the crossword blog xwordinfo.com using scrapy. There was a minor issue where my spider would get redirected if I tried to grab too much data at a time, so I had to crawl in chunks. Ultimately I was able to get most of the data I wanted, and I believe that anything excluded is missing completely at random.
I also scraped all the words added to the OED over the past four years (about 3000 words), and the entire Urban Dictionary word of the day archive (about 4000 words). Lastly, I obtained a list of the 5000 most common English words from the Corpus of Contemporary American English (COCA).
II. Analysis
I examined two different classes of answers:
- frequently-used answers, and how the clues to these answers change throughout time
- answers that have recently been used for the very first time (known as "debuts")
A. Frequently-Used Answers
In order to analyze frequent words, let’s briefly summarize which words are actually showing up a lot in the crossword puzzle. Here are the most commonly used crossword answers, along with their frequency counts over the last five years.
So we can see it's a lot of three-letter words, and a lot of the same letters appearing throughout the list. In fact, out of the 5000 most common crossword puzzle answers, about 1400 are three letters long (of the 5000 most common words in the English language, only about 300 have three letters). There aren't 1400 common three-letter words in the English language, so we get a lot of three-letter prefixes, acronyms, and names.
We can learn a lot about the crossword puzzle by tracing some of these three-letter answers throughout time and cataloging how the clues change. I chose the following clues intentionally to illustrate how the NYT crossword puzzle stays current.
HBO has been an answer in the crossword puzzle fourteen times in the past 5 years, and it’s always clued with a specific TV show: "Game of Thrones” network", "The Newsroom" channel, etc.
In this graph, the colorful dots represent the show appearing as an HBO crossword clue, and the black dots represent the year that show premiered.
If you follow which TV show was used throughout time, you can see that:
- The editors are trying to switch it up, so in any year they use a few different shows
- The editors are trying to stay current, so as new shows come out they add them to the clue roster – most recently, True Detective
- The editors are trying to use the most popular shows, so there’s at least a year lag between a show premiering and the show appearing in the crossword for the first time
If it keeps up its current level of popularity, I predict that West World will appear in the crossword puzzle as an HBO answer sometime in 2018.
The next answer I analyzed was LIN. LIN has appeared thirteen times in the past 5 years, and it’s always clued as a person's first or last name, for example: "Justin who directed four of the Fast and the Furious movies" or "Jeremy of the NBA".
In this graph, the orange dots represent LIN clues and the colorful dots represent relevant current events.
Jeremy Lin is a basketball player who started for the Knicks in 2012 and sparked a fan craze called LINSANITY. And you can see that Jeremy Lin was the go-to LIN clue for a while after that. But then he went and played for Houston, and the crossword constructors started rotating with Justin Lin the director, and Lin Biao, a figure in Communist China. And I don't think it's a coincidence that Lin Manuel-Miranda was LIN clue three days before his show won 11 Tonys, or that Justin Lin showed up a few months after the Star Trek premier.
You can see that LIN hasn’t appeared yet in 2017, but Jeremy Lin is back in NYC, playing for the Brooklyn Nets, and I predict that Jeremy Lin will make a crossword comeback.
B. Debut Answers
Debut answers are words that appear as answers in a puzzle for the very first time. There's usually at least a few debut answers every day. Here are some debut answers from the most recent Sunday crossword:
Debut answers usually come in one of two types:
- Long multi-word jokes, usually related to the theme of the puzzle, that will probably never reappear
- MODELYODEL, MASSAGEPASSAGE
- New words added to the crossword corpus that may reappear
- slang words like SWOLE
- tech jargon like MOOC
- celebrities like Amy POEHLER (I’m surprised this is her first xword appearance because she has been famous for a while but I guess her last name is a little long for the crossword.)
I was curious about whether words were being added faster to the Oxford English dictionary corpus or the NYT crossword puzzle corpus, so I scraped all the new additions to the OED over the past four years. It turns out to be kind of a dead heat with few discernible patterns.
In this timeline graph, each side of the bar represents the word's addition to a corpus; the color of the bar represents whether the crossword or the OED was first.
I was pretty surprised that "emoji" was adapted before "selfie". I was taking selfies long before I ever used an emoji.
I also scraped urban dictionary to see if their words of the day end up in the NYT crossword, and they do! There’s actually a lot more overlap with UD than with the OED. Here's a few of the overlapping words below, along with the debut date for each corpus.
It's not too surprising that words show up in the urban dictionary a lot earlier than they show up in the crossword. But it is interesting is that almost all of these words were submitted to urban dictionary before 2010. It’s possible that more recent words just haven’t shown up in the crossword yet, but I think it’s suggestive that UD had a golden age is now on the decline.
III. Conclusion
I used to try to do crossword puzzles from before I was born, and I found them impossible. So I assumed that the puzzles were just objectively harder back then.
Now after this project I no longer think that’s the case. I think that the NYT Crossword is so aligned with its publication era that it's very difficult to do puzzles that you didn't live through.
IV. Ideas for Further Exploration
- Natural Language Processing
- Get better at grouping clues and answers that are similar but not identical
- Figure out how to distinguish between compound words and multi-word answers
- Catalogue new portmanteaus and compound words
- Build a crossword solver
V. Acknowledgements
Thanks to Zeyu Zhang for teaching me how to scrape a password-protected website, and to Thomas Kolasa for reminding me not to push my password to github.
VI. Addendum
A debut word from Feb 10, 2017, and the only Friday crossword I've ever solved without cheating: