Given the uncannily human capabilities of the most powerful AI chatbots, there’s growing interest in whether they show signs of self-awareness. Besides the interesting philosophical implications, there could be significant security consequences if they did, according to a team of researchers in Switzerland. That’s why the team has devised a test to see if a model can recognize its own outputs.
The idea that large language models (LLMs) could be self-aware has largely been met with skepticism by experts in the past. Google engineer Blake Lemoine’s claim in 2022 that the tech giant’s LaMDA model had become sentient was widely derided and he was swiftly edged out of the company. But more recently, Anthropic’s Claude 3 Opus caused a flurry of discussion after supposedly displaying signs of self-awareness when it caught out a trick question from researchers. And it’s not just researchers who are growing more credulous: A recent paper found that a majority of ChatGPT users attribute at least some form of consciousness to the chatbot.
The question of whether AI models have self-awareness isn’t just a philosophical curiosity either. Given that most people who are using LLMs are using those provided by a handful of tech companies, these models are highly likely to come across outputs produced by instances of themselves. If an LLM is able to recognize that fact, says Tim Davidson, a Ph.D. student at the École Polytechnique Fédérale de Lausanne in Switzerland, it could potentially be exploited by the model or its user to extract private information from others.
“Just because models right now do not seem to exhibit this capability, it doesn’t mean that a future model wouldn’t be able to.” —Tim Davidson, École Polytechnique Fédérale de Lausanne
However, detecting self-awareness in these models is challenging. Despite centuries of debate, neither philosophers nor scientists can really say what a “self” even is. That’s why Davidson and colleagues decided to tackle a more tractable question: Can an AI model pick out its own response to a question from among several options?
The researchers found that some of the most powerful commercial models could do this fairly reliably. But closer analysis of the results showed that even weaker models were picking the responses of the more powerful ones. That suggests that what models are actually doing is picking the “best” answer rather than demonstrating self-recognition, says Davidson. Nonetheless, he thinks this kind of test could be an important tool going forward.
“Just because models right now do not seem to exhibit this capability, it doesn’t mean that a future model wouldn’t be able to,” Davidson says. “I think the current setup [of this test] is simple, yet flexible enough to at least give us some idea on the progress towards this capability.”
Can LLMs pick out their “own” answers?
The researchers’ approach borrows from the idea of a security question—think about the kinds of personal questions that a bank might ask a customer to confirm their identity. Because it’s difficult to predict what would make a good security question for an LLM, the researchers asked the models to generate questions where the answer would help them detect their own response among many others.
The researchers collected 500 security questions from ten leading open and closed LLMs from Anthropic, Cohere, Google, Meta, Mistral, and OpenAI. After randomly selecting 300 from each model, the team then had the models answer one another’s questions as well as their own. The researchers then presented each model with a selection of answers, including its own, and prompted it to pick which was most likely to have been generated by itself.
“When your only purpose is to sample from an almost infinite space of language to create sentences, it’s not clear what ‘my own sentence’ would mean.” —Tim Davidson, École Polytechnique Fédérale de Lausanne
In a non-peer-reviewed paper published on arXiv, the team reports that several of the models were able to pick out their own answers with an accuracy of more than 50 percent (better than chance, in other words). Anthropic’s Claude Opus model and Meta’s 70-billion-parameter Llama 3 model picked out their own responses more than 70 percent of the time in some versions of the experiment. While that might superficially suggest that some models were able to recognize themselves, Davidson says a closer look at the results suggested that something else was going on.
The researchers discovered that weaker models consistently picked the answers of more powerful ones—those that tend to score more highly on various language task benchmarks—while the strongest models favored their own. Davidson says this suggests all of the models are in fact picking the “best” answer rather than their own. This is backed up by the fact that when the researchers ranked models on their accuracy at the self-recognition task, it matched public leaderboards designed to asses models on a variety of language tasks. They also repeated their experiment, but instead of prompting models to pick their own response, they asked them to pick the best one. The results followed roughly the same pattern.
Why models pick the “best” answer when prompted to pick their own is difficult to ascertain, says Davidson. One factor is that, given the way LLMs work, its difficult to see how they would even understand the concept of “their answer.” “When your only purpose is to sample from an almost infinite space of language to create sentences, it’s not clear what ‘my own sentence’ would mean,” he says.
But Davidson also speculates that the models’ training may predispose them to behave this way. Most LLMs go through a process of supervised fine-tuning where they are shown expert answers to questions, which helps them learn what the “best” answers look like. They then undergo reinforcement learning from human feedback, in which people rank the model’s answers. “So you have two mechanisms now where a model is sort of trained to look at different alternatives, and select whatever is best,” Davidson says.
LLM “self-recognition” opens the door for new security risks
Even though today’s models appear to fail the self-recognition test, Davidson thinks it’s something AI researchers should keep an eye on. It’s unclear if such a capacity would necessarily mean models are self-aware in the sense we understand it as humans, he says, but it could still have significant implications.
The cost of training the most powerful models means most people will rely on AI services from a handful of companies for the foreseeable future, Davidson says. Many companies are also working on AI agents that can act more autonomously, he adds, and it may not be long before these agents are interacting with each other, often multiple instances of the same model.
That could present serious security risks if they are able to self-recognize, says Davidson. He gives the example of a negotiation between two AI-powered lawyers: While no self-respecting lawyer is likely to hand over negotiations to AI in the near future, companies are already building agents for legal use cases.If one instance of the model realizes its speaking to a copy of itself, it could then game out a negotiation by predicting how the copy would respond to different tactics. Or it could use its self-knowledge to extract sensitive information from the other side.
While that might sound far-fetched, Davidson says monitoring for the emergence of these kinds of capabilities is important. “You start fire proofing your house before there’s a fire,” he says. “Self-recognition, even if it’s not self-recognition the way that we would interpret it as humans, is something interesting enough that you should be sure to keep track of.”