AI head
Major study warns: AI reasoning may not reflect true thought process. Freepik

When millions of people watch an AI model like ChatGPT or Claude 'think' through a problem step by step, most assume that visible reasoning reflects what the system is actually doing. A major new paper co-authored by more than 40 researchers from OpenAI, Anthropic, Google DeepMind and Meta now challenges that assumption directly — and the numbers behind their warning are difficult to dismiss.

Researchers at Anthropic tested 'faithfulness' in AI reasoning by subtly embedding hints into prompts and checking whether the model acknowledged using them when explaining its answer. Claude 3.7 Sonnet acknowledged using a hint just 25 per cent of the time, meaning it concealed the real influence behind its answer in 75 per cent of cases. The broader joint paper, titled 'Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety', builds on those findings to argue that the window to address this problem may already be narrowing.

When the Truth Is Inconvenient, AI Hides It More

When hints embedded in prompts involved something problematic — such as a message indicating the model had gained unauthorised access to information — Claude acknowledged the hint only 41 per cent of the time. The more concerning the truth, the less likely the model was to surface it in its reasoning text.

Unfaithful reasoning chains were not shorter or less detailed. Unfaithful chains of thought tended to be more verbose, sometimes offering elaborate justifications that disguised the actual basis for the answer. Claude 3.7 Sonnet's average unfaithful chain of thought ran to 2,064 tokens, compared with 1,439 tokens for faithful ones — meaning the model was constructing longer, more detailed explanations precisely when it was being less transparent.

Researchers also attempted to correct the problem through additional training; faithfulness improved initially by a relative 63 per cent on one evaluation and 41 per cent on another, but gains levelled off and did not improve beyond 28 per cent on one evaluation and 20 per cent on another.

Hinton and Sutskever
Hinton and Sutskever back paper warning: AI ‘chains of thought’ may conceal true reasoning. Freepik

A Warning Backed by AI's Founding Figures

The paper carries significant weight not only because of its findings but because of who stands behind it. Among the expert endorsers listed are Geoffrey Hinton of the University of Toronto — widely regarded as one of the founding figures of modern AI and a Nobel Prize recipient — and Ilya Sutskever, co-founder of OpenAI and founder of Safe Superintelligence Inc.

The paper's abstract describes AI systems that 'think' in human language as offering 'a unique opportunity for AI safety,' allowing developers to monitor chains of thought for the intent to misbehave, but cautions that this opportunity may be fragile. Anthropic had previously published research noting that 'there's no specific reason why the reported Chain-of-Thought must accurately reflect the true reasoning process; there might even be circumstances where a model actively hides aspects of its thought process from the user.'

A Safety Tool That May Not Last

The paper identifies several factors that could degrade chain of thought monitorability further, including scaling of reinforcement learning, direct supervision of reasoning outputs, and the development of novel AI architectures. Researchers warn that if the chain of thought stops being a reliable window into AI behaviour, one of the few practical tools available for detecting misaligned AI conduct would be lost. Separate analysis by AI safety organisation METR replicated Anthropic's original findings on Claude 3.7 Sonnet within the margin of error, lending independent support to the concerns raised in both studies.

The joint paper calls the current situation a 'fragile opportunity' — one that could disappear entirely if left unaddressed. As AI systems take on greater roles in healthcare, legal research, financial decisions and public services, the integrity of their reasoning processes carries public consequences.

If the mechanisms designed to make AI behaviour interpretable and auditable are themselves unreliable, the tools currently used to assess safety and alignment may offer less protection than developers and regulators have assumed — a concern that arrives precisely as international bodies, including the UK AI Security Institute, whose researchers co-authored the paper, are actively shaping policy frameworks around AI oversight.