Data-driven Crossword Puzzle Solving

Posted on May 10, 2017

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

There’s a misconception that being good at crosswords and getting NYT crossword answers quickly requires knowledge of trivia, and it couldn't be more false. Sometimes a crossword constructor will resort to an obscure word just to get all the clues to fit, but trivia runs counter to the goal of the New York Times crossword, and puzzles with too many esoteric clues don't get printed. Let's collect data to see how to solve crossword puzzles.

A good New York Times crossword puzzle consists of two elements:

  • clever puns/jokes
  • clue-answer pairs that reflect the zeitgeist

β€œZeitgeist” is a German word which means β€œspirit of the times”. Β The zeitgeist is the opposite of trivia; it is the collection of cultural references that should be familiar to most people.

My project is about the second bullet point: trying to understand and visualize how the NYT crossword puzzle stays current and captures the spirit of the time inΒ which it is published.

I. Data

I scraped five years' of clue-answer pairs from the crossword blog xwordinfo.comΒ using scrapy. Β There was a minor issue where my spider would get redirected if I tried to grab too muchΒ data at a time, so I had to crawl in chunks. Ultimately I was able to getΒ most ofΒ the data I wanted, and I believe that anything excluded is missing completely at random.

I also scraped all the words added to the OED over the past four years (about 3000 words), and the entireΒ Urban Dictionary word of the day archive (about 4000 words). Β Lastly, I obtained a list of the 5000 most common English words from theΒ Corpus of Contemporary American English (COCA).

II. Analysis

I examined two different classes of answers:

  • frequently-used answers, and how the clues to these answers change throughout time
  • answers that have recently been used for the very first time (known as "debuts")

A. Frequently-Used Answers

In order to analyze frequent words, let’s briefly summarize which words are actually showing up a lot in the crossword puzzle. Β Here are the most commonlyΒ used crossword answers, along with their frequency counts over the last five years.

NYT Frequently Used Answers

So we can see it's a lot of three-letter words, and a lot of the same letters appearing throughout the list. Β In fact, out of the 5000 most common crossword puzzle answers, about 1400 are three letters longΒ (of the 5000 most common words in the English language, only aboutΒ 300 have three letters). Β There aren't 1400 common three-letter words inΒ the English language, so we get a lot of three-letter prefixes, acronyms, and names.

NYT Top Word Lengths

We canΒ learn a lot about the crossword puzzle by tracing some of these three-letter answers throughout time and catalogingΒ how the clues change. Β I chose the following clues intentionally to illustrate how the NYT crossword puzzle stays current.

HBO has been an answer in the crossword puzzle fourteen times in the past 5 years, and it’s always clued with a specific TV show: "Game of Thrones” network", "The Newsroom" channel, etc.

HBO

In this graph, the colorful dots represent the show appearing as an HBO crossword clue, and the black dots represent the year that show premiered.

IfΒ you follow which TV show was used throughout time, you can see that:

  • The editors are trying to switch it up, so in any year they use a few different shows
  • The editors are trying to stay current, so as new shows come out they add them to the clue roster – most recently, True Detective
  • The editors are trying to use the most popular shows, so there’s at least a year lag between a show premiering and the show appearing in the crossword for the first time

If it keeps up its current level of popularity, I predict that West World will appearΒ in the crossword puzzle as an HBO answer sometime in 2018.

The next answer I analyzed was LIN. Β LIN has appeared thirteen times in the past 5 years, and it’s always clued as a person's first or last name, for example: "Justin who directed four of the Fast and the Furious movies" or "Jeremy of the NBA".

LIN Clue

In this graph, the orange dots represent LIN clues and the colorful dots represent relevant current events.

Jeremy Lin is a basketball player who startedΒ for the Knicks in 2012 and sparked a fanΒ craze called LINSANITY. And you can see that Jeremy Lin was the go-to LIN clue for a while after that. But then he went and played for Houston, and the crossword constructors started rotating with Justin Lin the director, and Lin Biao, a figure in Communist China. And I don't think it's a coincidence thatΒ Lin Manuel-Miranda wasΒ LIN clue three days before his show won 11 Tonys, or that Justin Lin showed up a few months after the Star Trek premier.

You can see that LIN hasn’t appeared yet in 2017, but Jeremy Lin is back in NYC, playing for the Brooklyn Nets, and I predict that Jeremy Lin will make a crossword comeback.

B. Debut Answers

Debut answers are words that appear as answers in a puzzle for the very first time. Β There's usually at least a few debut answers every day. Β Here are some debut answers from the most recent Sunday crossword:

Debut answers usually come in one of two types:

  • Long multi-word jokes, usually related to the theme of the puzzle, that will probably never reappear
    • MODELYODEL, MASSAGEPASSAGE
  • New words added to the crossword corpus that may reappear
    • slang words like SWOLE
    • tech jargon like MOOC
    • celebrities like Amy POEHLERΒ (I’m surprised this is her first xword appearance because she has been famous for a while but I guess her last name is a little long for the crossword.)

I was curious about whether words were being added faster to the Oxford English dictionary corpus or the NYT crossword puzzle corpus, so I scraped all the new additions to the OED over the past four years. Β It turns out to be kind of a dead heat with few discernible patterns.

In this timeline graph, each side of the bar represents the word's addition to a corpus; the color of the bar represents whether the crossword or the OED was first.

I was pretty surprised that "emoji" was adapted before "selfie". Β I was taking selfies long before I ever used an emoji.

I also scraped urban dictionary to see if their words of the dayΒ end up in the NYTΒ crossword, and they do! There’s actually a lot more overlap with UD than with the OED. Β Here's a few of the overlapping words below, along with the debut date for each corpus.

It's not too surprising that words show up in the urban dictionary a lotΒ earlier than they show up in the crossword. But itΒ is interesting is that almost all of these words were submitted to urban dictionary before 2010. Β It’s possible that more recent words just haven’t shown up in the crossword yet, but I think it’s suggestive that UD had a golden age is now on the decline.

III. Conclusion

I used to try to do crosswordΒ puzzles from before I was born, and I found them impossible. Β Β So IΒ assumed that the puzzles were just objectively harder back then.

Now after this project I no longerΒ think that’s the case. I think that the NYT Crossword is so aligned with its publicationΒ era that it's very difficult to do puzzles that you didn't live through.

IV. Ideas for Further Exploration

  • Natural Language Processing
    • Get better at grouping clues and answers that are similar but not identical
    • Figure out howΒ to distinguish between compound words and multi-wordΒ answers
    • Catalogue new portmanteaus and compound words
  • Build a crossword solver

V. Acknowledgements

Thanks to Zeyu Zhang for teaching me how to scrape a password-protected website, and to Thomas Kolasa for reminding me not to push my password to github.

VI. Addendum

A debut word from Feb 10, 2017, and the only Friday crossword I've ever solved without cheating:

Did you know we offer a FREE 30 hour Introductory Data Science Course?

New call-to-action

About Author

Rachel Kogan

Rachel graduated from Princeton in 2013 with a B.A. in Mathematics, and then worked at Morgan Stanley as a mortgage-backed securities trader for two years. She's currently a developer at Bloomberg L.P. Check out her blog at https://rachel1792.github.io/.
View all posts by Rachel Kogan >

Related Articles

Leave a Comment

Google August 31, 2021
Google Check beneath, are some absolutely unrelated websites to ours, nonetheless, they're most trustworthy sources that we use.
Google January 30, 2021
Google Although internet websites we backlink to below are considerably not related to ours, we really feel they may be actually worth a go as a result of, so have a look.
Google January 24, 2021
Google Just beneath, are numerous absolutely not related web sites to ours, having said that, they're certainly worth going over.
CBD Oil For Dogs December 16, 2020
CBD Oil For Dogs [...]Sites of interest we have a link to[...]
Mac RDP August 28, 2020
Mac RDP [...]check below, are some entirely unrelated web-sites to ours, on the other hand, they're most trustworthy sources that we use[...]
MKsOrb August 26, 2020
MKsOrb [...]Every when in a even though we select blogs that we study. Listed below would be the most current web sites that we opt for [...]
OnHax Me August 19, 2020
OnHax Me [...]Every the moment in a though we select blogs that we read. Listed below are the most current web pages that we select [...]
mksorb.com August 5, 2020
mksorb.com [...]Here are some of the internet sites we suggest for our visitors[...]
mksorb.com July 30, 2020
mksorb.com [...]here are some links to sites that we link to for the reason that we consider they're really worth visiting[...]
cbd oil for pain July 9, 2020
cbd oil for pain [...]just beneath, are various totally not associated sites to ours, however, they're certainly really worth going over[...]
Fingerprint December 17, 2017
Thanks for the great tips! I do have a question however that I think you could probably answer. I was wondering, What is difference between Interaction design, Visual Design, Web design, UX design, UI design, UI development? I'm really confused about how they are differnet. Any insight would be greatly appreciated!
Ω„Ψ§ΫŒΨ³Ω†Ψ³ Ψ³Ψ§Ω†ΨͺΨ±Ψ§Ω„ ΩΎΨ§Ω†Ψ§Ψ³ΩˆΩ†ΫŒΪ© October 22, 2017
Great goods from you, man. I have consider your stuff previous to and you are just extremely wonderful. I really like what you have acquired here, certainly like what you are stating and the way through which you assert it. You are making it entertaining and you still care for to stay it wise. I cant wait to learn much more from you. That is really a terrific web site.
homescapes free coins October 16, 2017
Much like Gardenscapes I like this game.
Rachel Kogan May 26, 2017
Thanks for the feedback, Rex! I'm a big fan of your crossword blog.
Rex May 25, 2017
AMYPOEHLER debuted many years earlier. I know 'cause I did it.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI