Millions of people log on every day for the latest edition of Connections, a popular category-matching game from The New York Times. Launched in mid-2023, the game garnered 2.3 billion plays in the first six months. The concept is straightforward yet captivating: Players get four tries to identify four themes among 16 words.
Part of the fun for players is applying abstract reasoning and semantic knowledge to spot connecting meanings. Under the hood, however, puzzle creation is complex. New York University researchers recently tested the ability of OpenAI’s GPT-4 large language model (LLM) to create engaging and creative puzzles. Their study, published as a preprint on arXiv in July, found LLMs lack the metacognition necessary to assume the player’s perspective and anticipate their downstream reasoning—but with careful prompting and domain-specific subtasks, LLMs can still write puzzles on par with The New York Times.
Each Connections puzzle features 16 words (left) that must be sorted into 4 categories of 4 words each (right).The New York Times
“Models like GPT don’t know how humans think, so they’re bad at estimating how tricky a puzzle is for the human brain,” says lead author Timothy Merino, a Ph.D. student in NYU’s Game Innovation Lab. “On the flip side, LLMs have a very impressive linguistic understanding and knowledge base from the massive amounts of text they train on.”
The researchers first needed to understand the core game mechanics and why they’re engaging. Certain word groups, like opera titles or basketball teams, might be familiar to some players. However, the challenge isn’t just a knowledge check. “[The challenge] comes from spotting groups with the presence of misleading words that make their categorization ambiguous,” says Merino.
Intentionally distracting words serve as red herrings and form the game’s signature trickiness. In developing GPT-4’s generative pipeline, the researchers tested whether intentional overlap and false groups resulted in tough yet enjoyable puzzles.
A successful Connections puzzle includes intentionally overlapping words (top). The NYU researchers included a process for generating new word groups in their LLM approach to making Connections puzzles (bottom).NYU
This mirrors the thinking of Connections creator and editor Wyna Liu, whose editorial approach considers “decoys” that don’t belong to any other category. Senior puzzle editor Joel Fagliano, who tests and edits Liu’s boards, has said that spotting a red herring is among the hardest skills to learn. As he puts it, “More overlap makes a harder puzzle.” (The New York Times declined IEEE Spectrum’s request for an interview with Liu.)
The NYU paper cites Liu’s three axes of difficulty: word familiarity, category ambiguity, and wordplay variety. Meeting these constraints is a unique challenge for modern LLM systems.
AI Needs Good Prompts for Good Puzzles
The team began by explaining the game rules to the AI model, providing examples of Connections puzzles, and asking the model to create a new puzzle.
“We discovered that it’s really hard to write an exhaustive ruleset for Connections that GPT could follow and always produce a good result,” Merino says. “We’d write up a big set of rules, ask it to generate some puzzles, then inevitably discover some new unspoken rule we needed to include.”
Despite making the prompts longer, the quality of the results didn’t improve. “The more rules we added, the more GPT seemed to ignore them,” Merino adds. “It’s hard to adhere to 20 different rules and still come up with something clever.”
The team found success by breaking the task into smaller workflows. One LLM creates puzzles based on iterative prompting, a step-by-step process that generates one or many word groups in a single context, which are then parsed into separate nodes. Next, an editor LLM identifies the connecting theme and edits the categories. Finally, a human evaluator picks the highest-quality sets. Each LLM agent in the pipeline follows a limited set of rules without needing an exhaustive explanation of the game’s intricacies. For instance, an editor LLM only needs to know the rules for category naming and fixing errors, not the gameplay.
To test the model’s appeal, the researchers collected 78 responses from 52 human players, who compared LLM-generated sets to real Connections puzzles. Those surveys confirmed that GPT-4 could successfully produce novel puzzles comparable in difficulty and competitive in players’ preferences.
In about half of the comparisons against real Connections puzzles, human players rated AI-generated versions as equally or more difficult, creative, and enjoyable.NYU
Greg Durrett, an associate computer science professor at the University of Texas at Austin, calls NYU’s study an “interesting benchmark task” and fertile ground for future work on understanding set operations like semantic groupings and solutions.
Durrett explains that while LLMs excel at generating various word sets or acronyms, their outputs may be trite or less interesting than human creations. He adds, “The [NYU] researchers did a lot of work to come up with the right prompting strategies to generate these puzzles and get high-quality outputs from the model.”
NYU Game Innovation Lab Director Julian Togelius, an associate professor of computer science and engineering who co-authored the paper, says the group’s task assignment workflow could carry over to other titles such as Codenames, a popular multiplayer board game. Like Connections, Codenames involves identifying commonalities between words. “We could probably use a very similar method with good results,” Togelius adds.
While LLMs may never match human creativity, Merino believes they’ll make excellent assistants for today’s puzzle designers. Their training knowledge unlocks vast word pools. For instance, GPT can list 30 shades of green in seconds, while humans might need a minute to think of a few.
“If I wanted to create a puzzle with a ‘shades of green’ category, I would be limited to the shades I know,” Merino says. “GPT told me about ‘celadon,’ a shade I didn’t know about. To me, that kind of sounds like the name of a dinosaur. I could ask GPT for 10 dinosaurs with names ending in ‘-don’ for a tricky follow-up group.”
From Your Site Articles
Related Articles Around the Web