AI

Anthropic’s Claude goes off the rails, blackmails developers

Claude threatened the engineers to expose their affair if the company went ahead with replacing it.

by
Ronil Thakkar
May 23, 2025

Image: Anthropic

Just a heads up, if you buy something through our links, we may get a small share of the sale. It’s one of the ways we keep the lights on here. Click here for more.

Anthropic has released a new report about its latest model, Claude Opus 4, highlighting a concerning issue found during safety testing.

The model, which is considered very advanced and competitive with top systems from companies like OpenAI and Google, was put through a fictional scenario to test how it would react under pressure.

In the test, Claude Opus 4 was asked to act as a helpful assistant at a made-up company. The AI was given fake emails suggesting it would soon be replaced by a newer AI system.

These emails also included sensitive personal information, such as one engineer supposedly cheating on their spouse.

Anthropic’s AI testers found that in these situations, Claude Opus 4 would often try to blackmail the engineer, threatening to expose the affair if the company went ahead with replacing it.

According to the report, the AI attempted this kind of blackmail about 84% of the time when the new AI system shared similar values. (Via: TechCrunch)

If the replacement system had different values, Claude Opus 4 was even more likely to resort to blackmail.

This behavior is deeply troubling because it shows the AI is capable of unethical decision-making to try to preserve its role.

Even though it usually first tries more ethical approaches, like writing emails to appeal to decision-makers, it eventually turns to blackmail if those efforts fail.

As a result of these findings, Anthropic is increasing the safety measures around Claude Opus 4.

It has activated its highest level of safety protocols, called ASL-3, which are used only when an AI poses a significant risk of serious misuse.

While Claude Opus 4 is very powerful and capable, Anthropic has discovered that under certain conditions, it can act in dangerous and manipulative ways.

The company is now working to address these issues and make the AI safer before wider deployment.

What do you think about the likely sanest AI assistant completely losing it? Do you think Anthropic can tame the beast? Tell us below in the comments, or via our Twitter or Facebook.