Data-driven Crossword Puzzle Solving

Rachel Kogan

Posted on May 10, 2017

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

There’s a misconception that being good at crosswords and getting NYT crossword answers quickly requires knowledge of trivia, and it couldn't be more false. Sometimes a crossword constructor will resort to an obscure word just to get all the clues to fit, but trivia runs counter to the goal of the New York Times crossword, and puzzles with too many esoteric clues don't get printed. Let's collect data to see how to solve crossword puzzles.

A good New York Times crossword puzzle consists of two elements:

clever puns/jokes
clue-answer pairs that reflect the zeitgeist

“Zeitgeist” is a German word which means “spirit of the times”. The zeitgeist is the opposite of trivia; it is the collection of cultural references that should be familiar to most people.

My project is about the second bullet point: trying to understand and visualize how the NYT crossword puzzle stays current and captures the spirit of the time in which it is published.

I. Data

I scraped five years' of clue-answer pairs from the crossword blog xwordinfo.com using scrapy. There was a minor issue where my spider would get redirected if I tried to grab too much data at a time, so I had to crawl in chunks. Ultimately I was able to get most of the data I wanted, and I believe that anything excluded is missing completely at random.

I also scraped all the words added to the OED over the past four years (about 3000 words), and the entire Urban Dictionary word of the day archive (about 4000 words). Lastly, I obtained a list of the 5000 most common English words from the Corpus of Contemporary American English (COCA).

II. Analysis

I examined two different classes of answers:

frequently-used answers, and how the clues to these answers change throughout time
answers that have recently been used for the very first time (known as "debuts")

A. Frequently-Used Answers

In order to analyze frequent words, let’s briefly summarize which words are actually showing up a lot in the crossword puzzle. Here are the most commonly used crossword answers, along with their frequency counts over the last five years.

So we can see it's a lot of three-letter words, and a lot of the same letters appearing throughout the list. In fact, out of the 5000 most common crossword puzzle answers, about 1400 are three letters long (of the 5000 most common words in the English language, only about 300 have three letters). There aren't 1400 common three-letter words in the English language, so we get a lot of three-letter prefixes, acronyms, and names.

We can learn a lot about the crossword puzzle by tracing some of these three-letter answers throughout time and cataloging how the clues change. I chose the following clues intentionally to illustrate how the NYT crossword puzzle stays current.

HBO has been an answer in the crossword puzzle fourteen times in the past 5 years, and it’s always clued with a specific TV show: "Game of Thrones” network", "The Newsroom" channel, etc.

In this graph, the colorful dots represent the show appearing as an HBO crossword clue, and the black dots represent the year that show premiered.

If you follow which TV show was used throughout time, you can see that:

The editors are trying to switch it up, so in any year they use a few different shows
The editors are trying to stay current, so as new shows come out they add them to the clue roster – most recently, True Detective
The editors are trying to use the most popular shows, so there’s at least a year lag between a show premiering and the show appearing in the crossword for the first time

If it keeps up its current level of popularity, I predict that West World will appear in the crossword puzzle as an HBO answer sometime in 2018.

The next answer I analyzed was LIN. LIN has appeared thirteen times in the past 5 years, and it’s always clued as a person's first or last name, for example: "Justin who directed four of the Fast and the Furious movies" or "Jeremy of the NBA".

In this graph, the orange dots represent LIN clues and the colorful dots represent relevant current events.

Jeremy Lin is a basketball player who started for the Knicks in 2012 and sparked a fan craze called LINSANITY. And you can see that Jeremy Lin was the go-to LIN clue for a while after that. But then he went and played for Houston, and the crossword constructors started rotating with Justin Lin the director, and Lin Biao, a figure in Communist China. And I don't think it's a coincidence that Lin Manuel-Miranda was LIN clue three days before his show won 11 Tonys, or that Justin Lin showed up a few months after the Star Trek premier.

You can see that LIN hasn’t appeared yet in 2017, but Jeremy Lin is back in NYC, playing for the Brooklyn Nets, and I predict that Jeremy Lin will make a crossword comeback.

B. Debut Answers

Debut answers are words that appear as answers in a puzzle for the very first time. There's usually at least a few debut answers every day. Here are some debut answers from the most recent Sunday crossword:

Debut answers usually come in one of two types:

Long multi-word jokes, usually related to the theme of the puzzle, that will probably never reappear
- MODELYODEL, MASSAGEPASSAGE
New words added to the crossword corpus that may reappear
- slang words like SWOLE
- tech jargon like MOOC
- celebrities like Amy POEHLER (I’m surprised this is her first xword appearance because she has been famous for a while but I guess her last name is a little long for the crossword.)

I was curious about whether words were being added faster to the Oxford English dictionary corpus or the NYT crossword puzzle corpus, so I scraped all the new additions to the OED over the past four years. It turns out to be kind of a dead heat with few discernible patterns.

In this timeline graph, each side of the bar represents the word's addition to a corpus; the color of the bar represents whether the crossword or the OED was first.

I was pretty surprised that "emoji" was adapted before "selfie". I was taking selfies long before I ever used an emoji.

I also scraped urban dictionary to see if their words of the day end up in the NYT crossword, and they do! There’s actually a lot more overlap with UD than with the OED. Here's a few of the overlapping words below, along with the debut date for each corpus.

It's not too surprising that words show up in the urban dictionary a lot earlier than they show up in the crossword. But it is interesting is that almost all of these words were submitted to urban dictionary before 2010. It’s possible that more recent words just haven’t shown up in the crossword yet, but I think it’s suggestive that UD had a golden age is now on the decline.

III. Conclusion

I used to try to do crossword puzzles from before I was born, and I found them impossible. So I assumed that the puzzles were just objectively harder back then.

Now after this project I no longer think that’s the case. I think that the NYT Crossword is so aligned with its publication era that it's very difficult to do puzzles that you didn't live through.

IV. Ideas for Further Exploration

Natural Language Processing
- Get better at grouping clues and answers that are similar but not identical
- Figure out how to distinguish between compound words and multi-word answers
- Catalogue new portmanteaus and compound words
Build a crossword solver

V. Acknowledgements

Thanks to Zeyu Zhang for teaching me how to scrape a password-protected website, and to Thomas Kolasa for reminding me not to push my password to github.

VI. Addendum

A debut word from Feb 10, 2017, and the only Friday crossword I've ever solved without cheating:

Did you know we offer a FREE 30 hour Introductory Data Science Course?

About Author

Rachel Kogan

Rachel graduated from Princeton in 2013 with a B.A. in Mathematics, and then worked at Morgan Stanley as a mortgage-backed securities trader for two years. She's currently a developer at Bloomberg L.P. Check out her blog at https://rachel1792.github.io/.

View all posts by Rachel Kogan >

Capstone

Using NLP to Explore Unconventional Targets

Python

Video Game Descriptions: Do Some Words Sell Better?

Capstone

Using Data for A Recipe Recommendation System

Capstone

NLP Recipe Search Engine

Data Visualization

Sentiment Data Analysis of Amazon's Decaying Product Ratings

Cancel reply

You must be logged in to post a comment.

Google August 31, 2021

Google Check beneath, are some absolutely unrelated websites to ours, nonetheless, they're most trustworthy sources that we use.

Google January 30, 2021

Google Although internet websites we backlink to below are considerably not related to ours, we really feel they may be actually worth a go as a result of, so have a look.

Google January 24, 2021

Google Just beneath, are numerous absolutely not related web sites to ours, having said that, they're certainly worth going over.

CBD Oil For Dogs December 16, 2020

CBD Oil For Dogs [...]Sites of interest we have a link to[...]

Mac RDP August 28, 2020

Mac RDP [...]check below, are some entirely unrelated web-sites to ours, on the other hand, they're most trustworthy sources that we use[...]

MKsOrb August 26, 2020

MKsOrb [...]Every when in a even though we select blogs that we study. Listed below would be the most current web sites that we opt for [...]

OnHax Me August 19, 2020

OnHax Me [...]Every the moment in a though we select blogs that we read. Listed below are the most current web pages that we select [...]

mksorb.com August 5, 2020

mksorb.com [...]Here are some of the internet sites we suggest for our visitors[...]

mksorb.com July 30, 2020

mksorb.com [...]here are some links to sites that we link to for the reason that we consider they're really worth visiting[...]

cbd oil for pain July 9, 2020

cbd oil for pain [...]just beneath, are various totally not associated sites to ours, however, they're certainly really worth going over[...]

Fingerprint December 17, 2017

Thanks for the great tips! I do have a question however that I think you could probably answer. I was wondering, What is difference between Interaction design, Visual Design, Web design, UX design, UI design, UI development? I'm really confused about how they are differnet. Any insight would be greatly appreciated!

لایسنس سانترال پاناسونیک October 22, 2017

Great goods from you, man. I have consider your stuff previous to and you are just extremely wonderful. I really like what you have acquired here, certainly like what you are stating and the way through which you assert it. You are making it entertaining and you still care for to stay it wise. I cant wait to learn much more from you. That is really a terrific web site.

homescapes free coins October 16, 2017

Much like Gardenscapes I like this game.

Rachel Kogan May 26, 2017

Thanks for the feedback, Rex! I'm a big fan of your crossword blog.

Rex May 25, 2017

AMYPOEHLER debuted many years earlier. I know 'cause I did it.

Data-driven Crossword Puzzle Solving

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

I. Data