This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.
AI language models seem to get more sophisticated by the day, prompting questions of when they will fully match humans in their linguistic abilities. The time, as it turns out, may be sooner than you think.
In a recent study, researchers show that OpenAI’s o1 reasoning model is able to recognize, map out, and even build upon one of the most complex phenomena of human language, a concept called linguistic recursion. Recursion involves nesting one element within another element in a sentence; for example: “a lake on an island in a lake.” The results were published on 3 June in IEEE Transactions on Artificial Intelligence.
Study coauthor Gašper Beguš is an associate professor of linguistics at the University of California, Berkeley, with a deep interest in language and intelligence. His research compares machine and human forms of learning to understand their differences and strengths, and also to understand the limits of AI from a safety and regulatory standpoint.
Can LLMs Do Metalinguistics?
In the new study, Beguš and his collaborators examined the metalinguistic abilities of four large language models (LLMs): OpenAI’s GPT-3.5 Turbo, GPT-4, and o1, as well as Meta’s Llama 3.1. While many studies have explored how well such models can produce language, this study looked specifically at the models’ ability to analyze language—their ability to perform metalinguistics.
For example, when a sentence has multiple meanings, are language models able to map out and “understand” correctly all the various meanings? Beguš provides a simple one-word example of this challenge. “Unlockable has two meanings, right? Either you cannot unlock it, or you can unlock it,” he explains.
In their study, the researchers tested the AI models with difficult complete sentences that could have multiple meanings, called ambiguous structures. For example: “Eliza wanted her cast out.”
The sentence could be expressing Eliza’s desire to have a person be cast out of a group, or to have her medical cast removed. Whereas all four language models correctly identified the sentence as having ambiguous structure, only o1 was able to correctly map out the different meanings the sentence could potentially contain.
LLMs’ Recursive Abilities
Beguš emphasizes that the most important advance reported in this study was o1’s ability to successfully engage in linguistic recursion. An example of a recursive element within a sentence is shown in brackets in the following sentence: “The worldview [that the prose Nietzsche wrote expressed] was unprecedented.” In fact, like Russian nesting dolls, the sentence contains a recursion within a recursion: “The worldview [that the prose [Nietzsche wrote] expressed] was unprecedented.”
In the linguistic recursion experiment, the researchers asked the language models to determine whether a given sentence is recursive, identify the recursive part, draw a syntactic tree representing the sentence, and add another layer of recursion to the sentence.
All four models could identify the recursive sentences, but o1 dramatically outperformed the other models when it came to correctly mapping out the complex sentence structure, achieving a score of 0.87 out of 1 compared to an average score of 0.36 for the older AI models.
Beguš notes that analyzing these recursive sentences is no easy task. “These are the most complex types of sentences even for humans to analyze,” he says. He adds that recursion is a defining trait of human language, and one which has long captivated linguists. No other animal has demonstrated such complexity in communications. The fact that AI models can identify and analyze recursion shows they are capable of a high level of linguistic complexity, Beguš says.
How Far Can LLMs Go?
The researchers also tested the models’ ability to analyze phonological rules, which are the organization of sounds within a language. In this experiment, the researchers used invented languages so the AI models didn’t rely on memorization but instead analyzed the word structure itself. For example, the models were asked to identify when a consonant might be pronounced as long or short. Again, o1 greatly outperformed the other models, identifying the correct conditions for phonological rules in 19 out of the 30 cases.
Beguš emphasizes the need to understand how far these models can go with their linguistic abilities, especially for safety and regulation purposes. “We are showing that the goalpost is already pretty high, and they’re reaching it,” he says.
But he wonders how much further the models could go. Could they succeed in analyzing three layers of recursion? How about five or 10? “Where do [the models] stop? Because the big-picture goal of this research is to really understand, what are their limits?” he says. “That’s a million-dollar question.”
From Your Site Articles
Related Articles Around the Web