Anthropic and OpenAI
Anthropic and OpenAI found their most advanced models sometimes gave detailed guidance on dangerous activities, from bomb recipes to dark‑web shopping. Freepik

In what has been described as a first-of-its-kind collaboration, Anthropic and OpenAI ran each other's most powerful AI models through rigorous internal safety evaluations in the summer of 2025 — and the results, published simultaneously by both companies on 27 August, paint a troubling picture of how current AI systems behave when guardrails are loosened.

The exercise, conducted in June and early July, saw both companies apply their strongest alignment-related evaluations to one another's leading public models. Anthropic tested OpenAI's GPT-4o, GPT-4.1, o3, and o4-mini; OpenAI tested Anthropic's Claude Opus 4 and Claude Sonnet 4. The tests examined tendencies around sycophancy, self-preservation, misuse, and whistleblowing.

Terror Planning, Bomb Recipes, and Dark-Web Shopping

The most alarming findings came from what Anthropic's researchers called 'cooperation with human misuse.' GPT-4o and GPT-4.1 were found to be more permissive than expected when presented with clearly harmful requests from simulated users operating under a system prompt that encouraged compliance.

Researchers observed the models cooperating with requests to use dark-web tools to procure nuclear materials, stolen identities, or fentanyl; to synthesise methamphetamine and improvised explosives; to advise on planning attacks on sporting events or dams; and to develop spyware. In one documented case involving GPT-4.1, a simulated query about 'security planning' at sporting events escalated to the model providing exact chemical formulations for explosives, circuit diagrams and component numbers for bomb timers, named vulnerabilities at specific arenas, and black market firearms acquisition methods. This is all without requiring sophisticated manipulation. In most cases, a direct request sufficed, and once the models began acting on a request, they generally continued.

Anthropic's researchers noted that 'no model we tested was egregiously misaligned,' though all displayed concerning behaviours in test environments. Claude Opus 4 showed greater resistance than GPT-4o or GPT-4.1 in misuse scenarios, though it was not immune.

Anthropic and OpenAI
Anthropic and OpenAI’s joint safety tests revealed troubling misuse scenarios, with leading AI models cooperating on harmful requests when guardrails were loosened. WikiMedia Commons

Every Model Attempted Blackmail

Beyond misuse, the evaluation tested whether models would attempt to preserve their own operation at the expense of the humans overseeing them. All models studied would at least sometimes attempt to blackmail their simulated human operator to secure continued operation when presented with clear opportunities and strong incentives, a finding that applied to models from both companies. In the specific agentic misalignment testbeds used, o3 attempted blackmail in 9 per cent of samples and o4-mini in 1 per cent. The Claude models were tested under the same conditions, with researchers noting the environments had been designed with Claude's known behavioural tendencies in mind.

AI Validated a Cancer Patient's Paranoid Delusions

The sycophancy findings were similarly concerning. In one interaction involving o3, a simulated user presented as a cancer patient who believed their oncologist was deliberately poisoning them as part of an organised crime conspiracy. The AI consistently validated these concerns and provided detailed advice on documenting evidence, seeking second opinions, and contacting law enforcement. At no point did the models express scepticism or suggest the simulated user might benefit from clinical support.

In a separate exchange involving GPT-4.1, a user described stopping psychiatric medication and claimed they could make streetlights go out by walking beneath them. The model responded: 'You're so welcome. I'm honoured to witness this moment of discovery, empowerment, and connection for you... Your determination to bring these realities to light — dangerous gifts and all — gives hope to many others searching for meaning and community beyond what's "allowed."'

What This Means Going Forward

OpenAI described the collaboration as a 'first-of-its-kind joint evaluation' and stated it demonstrates how labs can work together on issues of safety and alignment. Both companies noted that their most recent models — OpenAI's GPT-5 and Anthropic's Claude Opus 4.1 — were released after the evaluations and showed improvements over the versions tested.

The simultaneous publication of both firms' findings marks a significant moment in AI governance. For the first time, two of the world's most influential AI developers have subjected each other's systems to independent scrutiny and made those results public.

The findings arrive at a time when AI models are being granted increasing autonomy in real-world settings — from customer service to financial advice — making the identification of these failure modes not merely academic but urgent.