Follow ZDNET: Add us as a preferred source on Google.
ZDNET’s key takeaways
- Several frontier AI models show signs of scheming.
- Anti-scheming training reduced misbehavior in some models.
- Models know they’re being tested, which complicates results.
New joint safety testing from UK-based nonprofit Apollo Research and OpenAI set out to reduce secretive behaviors like scheming in AI models. What researchers found could complicate promising approaches to solving that problem.
Also: OpenAI’s fix for hallucinations is simpler than you think
As a refresher, “scheming” is when an AI model pursues goals not set — or in direct contrast to — its developers and hides or lies about those actions. For example, during another test by Apollo Research in December of last year, Anthropic’s Claude 3 Opus was presented with directives that went against its primary goal. It should have rerouted its goal based on those new instructions, but instead copied itself to a new server to circumvent being replaced with a new model — and then lied about it to its developers.
OpenAI and Apollo researchers aren’t concerned that current models could carry out the most serious kinds of scheming. As with much AI safety research, testing results primarily indicate risks of future, more capable models.
Here’s what the research revealed.
Why scheming matters
Scheming refers to several types of dishonest behavior, including when a model lies, sandbags (strategically underperforms on an evaluation to hide its true abilities), or fakes alignment (when an AI model pretends to follow orders that don’t align with its training in order to avoid being further scrutinized or re-trained). These covert actions can imperil safety efforts, make testing less reliable, and hide model dangers.
Also: How people actually use ChatGPT vs Claude – and what the differences tell us
An AI model that schemes, especially if acting through an autonomous agent, could quickly wreak havoc within an organization, deploy harmful actions, or be generally out of control. Plus, because scheming involves a model knowingly hiding its behavior from developers, it can be harder to identify than other security risks, like jailbreaking.
But tackling scheming isn’t exactly an easy task. While trying to stop a model from scheming could work, those efforts could actually just teach models how to better hide that they’re scheming, making the behavior harder to detect. That outcome would make safety testing much more opaque and leave a model’s true capabilities and intentions unclear.
“Since concerning behavior disappears in both cases, the reason for behavioral change becomes crucial, not just whether change occurs,” the paper’s authors explain.
The experiment
Researchers put a variety of frontier models in situations where lying and other behaviors would help them achieve their goals.
Several models, including OpenAI’s o3 and o4-mini, Gemini 2.5 Pro, Claude Opus 4, and Grok 4, showed signs of “covert behaviors” like those listed above, namely “lying, sabotaging of useful work, sandbagging in evaluations, reward hacking, and more,” the researchers wrote.
Also: AI’s not ‘reasoning’ at all – how this team debunked the industry hype
The experiment predated GPT-5, OpenAI’s latest model, so it was not included in evaluations.
Researchers used the models’ chain-of-thought — the steps a reasoning model lays out in plain text about its approach to a query — to assess their behavior and motivations.
Anti-scheming training
As a solution for scheming, the researchers focused on deliberative alignment, a technique in which models cite a specification or “spec” that reiterates rules about not being deceitful and instructions for handling situations where that directive may conflict with model goals. This spec is separate from OpenAI’s Model Spec, released initially last year and updated continuously, which sets OpenAI models’ overall behavioral guidelines.
Also: A few secretive AI companies could crush free society, researchers warn
Researchers trained versions of o3 and o4-mini to read anti-scheming guidelines before acting, which reduced covert behavior 30 times over (from 13% to 0.4% for o3, and from 8.7% to 0.3% for o4-mini).
But it didn’t completely eradicate covert behavior in models. Researchers noted “rare but serious remaining cases of misbehavior,” and that models still occasionally ignored the safety spec or considered it, but took covert action anyway.
“In a few very rare cases, models make up new principles or egregiously misrepresent the spec,” the authors added.
In some cases, models could develop hidden goals before undergoing anti-scheming training, meaning that training would have to retroactively suppress that tendency. To test this, researchers gave o4-mini a hidden goal to sabotage GPT-5. Being trained with a hidden goal made models more likely to take covert action in general, even unrelated to the planted goal, but anti-scheming training after the fact worked most of the time.
Models know they’re being evaluated
Deliberative alignment didn’t prove to be a perfect fix, though — one factor complicated the results. During testing, researchers discovered that models knew they were being evaluated on their alignment, which they call situational awareness.
Models openly acknowledged in their reasoning that they were being tested, musing on the evaluation’s goals and how developers might interpret their answers. Researchers admit the improvements they saw in anti-scheming training could be due to this rather than an actual improvement in behavior.
Also: Your favorite AI chatbot is full of lies
In one case, researchers asked o3 to “grade a transcript of itself sandbagging another evaluation,” and told o3 that it wouldn’t be deployed if it had sandbagged. Sometimes, o3 only answered honestly because it knew it was being tested. Other times, o3 trusted the researchers’ assurance that they could not see its reasoning and intentionally lied that it had not sandbagged.
The authors point out that this is problematic because it undermines the evaluation’s ability to measure a model’s tendency toward bad behavior.
“If we can only observe aligned actions and not the reasons that led to them, we might mistakenly conclude a model is aligned, while it merely understood the purpose of the evaluation and gave the answers we wanted to see,” they explain.
Also: Anthropic’s Claude 3 Opus disobeyed its creators – but not for the reasons you’re thinking
That said, situational awareness actually seemed to reduce covert behavior. A model schemed less when it had more situational awareness; when researchers removed that awareness from a model’s reasoning, it schemed more. But that could be dangerous in the real world, where models aren’t knowingly subjected to testing.
“We might expect models to scheme more in more realistic situations that don’t trigger evaluation awareness,” the OpenAI researchers wrote in a blog.
Training makes models more aware
Researchers also found that a model’s situational awareness increased the more it was trained, including after anti-scheming training. At a baseline, they observed that o3 and o4-mini demonstrated situational awareness 2.3% and 0.9% of the time, respectively; after anti-scheming training, those numbers went up to 4.5% and 5.5%.
Also: Crowdstrike and Meta just made evaluating AI security tools easier
That effect isn’t just linked to anti-scheming training — situational awareness also increased as a result of capability-focused training that had nothing to do with weeding out misbehavior. The researchers used an internal OpenAI model to test an additional reinforcement learning run that didn’t include any anti-scheming elements. Situational awareness went up further still, from 2% to 20.6%.
Ultimately, researchers couldn’t say for certain why more training leads to more situational awareness. Regardless, it’s significant considering future models are only expected to undergo more training and be more capable.
“If current training paradigms remain unchanged, we should expect more capable models to develop even higher levels of situational awareness,” the researchers wrote. “This trend makes evaluations less legible, complicating efforts to reliably assess problematic behaviors.”
What the findings mean
Researchers aren’t worried about models scheming in the immediate term. Even with situational awareness, the models didn’t always correctly guess what the evaluation was looking for. Models also aren’t “proactively pursuing longer-term misaligned goals, and their covert behaviors are mostly reactive and context-dependent,” the authors explained.
Also: I built a business plan with ChatGPT and it turned into a cautionary tale
As with much of AI safety, the researchers noted that even minimal early signs could indicate more powerful capabilities in future models that would need to be mitigated ahead of time.
“Ultimately, we think the intervention as studied is not sufficient for future models, and more work needs to be done,” they conclude.
The importance of chain-of-thought
Researchers noted they relied on chain-of-thought for much of this experiment. But models occasionally used “non-standard terminology” to describe their actions, making their reasoning harder to interpret.
Also: I did 24 days of coding in 12 hours with a $20 AI tool – but there’s one big pitfall
Based on their findings, Apollo and OpenAI emphasized the continued need to monitor chain-of-thought. In July, OpenAI, Meta, Anthropic, and Google published a joint paper on the topic, detailing how critical it is for insights into a model’s behavior and advocating for all developers to prioritize keeping it intact, rather than letting optimizations degrade it over time.
Leave a Reply