AI chatbots such as ChatGPT and other applications powered by large language models have found widespread use, but are infamously unreliable. A common assumption is that scaling up the models driving these applications will improve their reliability—for instance, by increasing the amount of data they are trained on, or the number of parameters they use to process information. However, more recent and larger versions of these language models have actually become more unreliable, not less, according to a new study.
Large language models (LLMs) are essentially supercharged versions of the autocomplete feature that smartphones use to predict the rest of a word a person is typing. ChatGPT, perhaps the most well-known LLM-powered chatbot, has passed law school and business school exams, successfully answered interview questions for software-coding jobs, written real estate listings, and developed ad content.
But LLMs frequently make mistakes. For instance, a study in June found that ChatGPT has an extremely broad range of success when it comes to producing functional code—with a success rate ranging from a paltry 0.66 percent to 89 percent—depending on the difficulty of the task, the programming language, and other factors.
Research teams have explored a number of strategies to make LLMs more reliable. These include boosting the amount of training data or computational power given to the models, as well as using human feedback to fine-tune the models and improve their outputs. And LLM performance has overall improved over time. For instance, early LLMs failed at simple additions such as “20 + 183.” Now LLMs successfully perform additions involving more than 50 digits.
However, the new study, published last week in the journal Nature, finds that “the newest LLMs might appear impressive and be able to solve some very sophisticated tasks, but they’re unreliable in various aspects,” says study coauthor Lexin Zhou, a research assistant at the Polytechnic University of Valencia in Spain. What’s more, he says, “the trend does not seem to show clear improvements, but the opposite.”
This decrease in reliability is partly due to changes that made more recent models significantly less likely to say that they don’t know an answer, or to give a reply that doesn’t answer the question. Instead, later models are more likely to confidently generate an incorrect answer.
How the LLMs fared on easy and tough tasks
The researchers explored several families of LLMs: 10 GPT models from OpenAI, 10 LLaMA models from Meta, and 12 BLOOM models from the BigScience initiative. Within each family, the most recent models are the biggest. The researchers focused on the reliability of the LLMs along three key dimensions.
One avenue the scientists investigated was how well the LLMs performed on tasks that people considered simple and ones that humans find difficult. For instance, a relatively easy task was adding 24,427 and 7,120, while a very difficult one was adding 1,893,603,010,323,501,638,430 and 98,832,380,858,765,261,900.
The LLMs were generally less accurate on tasks humans find challenging compared with ones they find easy, which isn’t unexpected. However, the AI systems were not 100 percent accurate even on the simple tasks. “We find that there are no safe operating conditions that users can identify where these LLMs can be trusted,” Zhou says.
In addition, the new study found that compared with previous LLMs, the most recent models improved their performance when it came to tasks of high difficulty, but not low difficulty. This may result from LLM developers focusing on increasingly difficult benchmarks, as opposed to both simple and difficult benchmarks. “Our results reveal what the developers are actually optimizing for,” Zhou says.
Chatbots are bad with uncertainty
The second aspect of LLM performance that Zhou’s team examined was the models’ tendency to avoid answering user questions. The researchers found that more recent LLMs were less prudent in their responses—they were much more likely to forge ahead and confidently provide incorrect answers. In addition, whereas people tend to avoid answering questions beyond their capacity, more recent LLMs did not avoid providing answers when tasks increased in difficulty.
This imprudence may stem from “the desire to make language models try to say something seemingly meaningful,” Zhou says, even when the models are in uncertain territory. This leaves humans with the burden of spotting errors in LLM output, he adds.
Finally, the researchers examined whether the tasks or “prompts” given to the LLMs might affect their performance. They found that the most recent LLMs can still prove highly sensitive to the way in which prompts are stated—for instance, using “plus” instead of “+” in an addition prompt.
How chatbots mess with human expectations
These findings highlight the way in which LLMs do not display patterns of reliability that fit human expectations, says Lucy Cheke, a professor of experimental psychology at the University of Cambridge in England who measures cognitive abilities within AI models.
“If someone is, say, a maths teacher—that is, someone who can do hard maths—it follows that they are good at maths, and I can therefore consider them a trustworthy source for simple maths problems,” says Cheke, who did not take part in the new study. “Similarly, if that person can answer ‘2,354 + 234’ correctly, it follows that I can probably trust their answer to ‘2,354 plus 234.’ But neither of these types of assumptions hold with these larger models.”
Moreover, the study found that human supervision was unable to compensate for all these problems. For instance, people recognized that some tasks were very difficult, but still often expected the LLMs to be correct, even when they were allowed to say “I’m not sure” about the correctness. The researchers say this tendency suggests overconfidence in the models.
“Individuals are putting increasing trust in systems that mostly produce correct information, but mix in just enough plausible-but-wrong information to cause real problems,” Cheke says. “This becomes particularly problematic as people more and more rely on these systems to answer complex questions to which they would not be in a position to spot an incorrect answer.”
Despite these findings, Zhou cautions against thinking of LLMs as useless tools. “They are still very useful for a number of applications—for example, in tasks where users can tolerate errors,” he says. “A car that doesn’t fly is not unreliable, because no one expects cars to fly. This is what happened with early LLMs—humans didn’t expect much from them. But in the past few years, as LLMs were getting more and more powerful, people started relying on them, perhaps too much.”
Zhou also does not believe this unreliability is an unsolvable problem. If the new findings are accounted for in the next generation of LLMs, we may start seeing more adoption and less skepticism about LLMs, he says. But until researchers find solutions, he plans to raise awareness about the dangers of both over-reliance on LLMs and depending on humans to supervise them.