Researchers from OpenAI, Anthropic, Meta, and Google issue joint AI safety warning – here’s why

Andriy Onufriyenko / Getty Images

Over the last year, chain of thought (CoT) — an AI model’s ability to articulate its approach to a query in natural language — has become an impressive development in generative AI, especially in agentic systems. Now, several researchers agree it may also be critical to AI safety efforts.

On Tuesday, researchers from competing companies including OpenAI, Anthropic, Meta, and Google DeepMind, as well as institutions like the Center for AI Safety, Apollo Research, and the UK AI Security Institute, came together in a new position paper titled “Chain of Thought Monitorability: A New and Fragile Opportunity for AI.” The paper details how observing CoT could reveal key insights about a model’s ability to misbehave — and warns that training models to become more advanced could cut off those insights.

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

A model uses chain of thought to explain the steps it’s taking to tackle a problem, sometimes speaking its internal monologue as if no one is listening. This gives researchers a peek into its decision-making (and sometimes even its morals). Because models reveal their rumination process through CoT, they can also expose motivations or actions that safety researchers want to quell, or at least know the LLM is capable of.

Models lie

By now, much research has established that models deceive, either to protect their original directives, please users, preserve themselves from being retrained, or, ironically, avoid committing harm. In December, Apollo published research testing six frontier models to determine which lied the most (it was OpenAI’s o1). Researchers even developed a new benchmark to detect how much a model is lying.

Also: OpenAI used to test its AI models for months – now it’s days. Why that matters

As AI agents get better at autonomous tasks — and better at deceiving — they’ve become equally opaque, obscuring the potential risks of their capabilities. Those risks are much easier to control if developers can interpret how an AI system is making decisions.

That’s where CoT comes in. Models often reveal bad intentions through CoT anyway — though, as research from Anthropic showed in April, models still hide some parts of their reasoning with CoT. Rather than just evaluating a model’s actions, the researchers propose that safety teams create dedicated monitors that read a model’s CoT, too, to catch suspicious or harmful activity. Developers can then block, review, or replace flagged responses, gaining better insights into agents’ thoughts and goals.

Future training could harm visibility

In a March 2025 paper, OpenAI noted that “CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.” The company also found that penalizing models for displaying “bad thoughts” in CoT didn’t eliminate those thoughts — it just made models hide them better.

This latest paper expands on that, warning that further training could cut off this source of info. Because CoT is presented in natural language, humans can understand it; if models evolve too far from that language, we lose those insights. “Earlier research suggests that language models fine-tuned with outcome-based rewards can drift away from legible English,” the authors note.

Also: 96% of IT pros say AI agents are a security risk, but they’re deploying them anyway

As developers advance the architectures models run on, AI systems could expand so continuously that they become nonverbal — sort of like they’re operating on a plane higher than language.

The authors add that research into process supervision, or the act of optimizing a model’s reasoning, is steadily advancing, and could result in models tailoring their CoT responses to “less faithfully reflect” how they arrived at their conclusions. That could obscure harms (but, on the plus side, it could also improve the safety of the reasoning process itself).

Also: 100 leading AI scientists map route to more ‘trustworthy, reliable, secure’ AI

Overall, the paper’s advice somewhat pits model progress against an important layer of safety. While a single layer isn’t the be-all-end-all of security — robust approaches employ a checks-and-balances system of many — the researchers argue we need all the layers we can get.

Caveats

Monitoring CoT isn’t a perfect solution, though; future models could adapt to the surveillance. “A model might become aware its reasoning traces are being monitored, or an auxiliary training objective […] might cause a model to shortcut important reasoning steps,” which can hide alarming behavior, the paper explains.

Also: Does your generative AI protect your privacy? New study ranks them best to worst

Interestingly, CoT may also empower future risks. AI models need a much higher level of reasoning to successfully carry out complex actions like cyberattacks or preserve themselves in the face of being dismantled or retrained. That requires what researchers call working memory: a place to store and iterate upon information, which chain of thought effectively functions as.

This means CoT is a sort of double-edged superpower: It both provides a window into how models work, which could expose bad intentions, and gives them the tool they need to carry out bigger, more complex, and riskier tasks.

But wait, there’s more: Researchers still don’t know whether models will always need working memory to carry out the most dangerous risks. “Not all dangerous actions will require reasoning to execute, especially as AI systems begin to be routinely trusted with more and more high-stakes tasks,” the authors concede. That means CoT monitoring might not catch every warning sign, and that risks increase the more autonomous agents become.

Also: 5 quick ways to tweak your AI use for better results – and a safer experience

While researchers admit that a monitor isn’t a total failsafe, it’s still a valuable safety approach for avoiding rogue AI systems. How preserving that impacts model development is still to be seen.

Want more stories about AI? Sign up for Innovation, our weekly newsletter.



Original Source: zdnet

Leave a Reply

Your email address will not be published. Required fields are marked *