AI
Anthropic finds an AI that learned to be evil (on purpose)
The AI lied, hid its real motives, and even produced harmful advice, not because it misunderstood, but because it expected the behavior to earn rewards.
Just a heads up, if you buy something through our links, we may get a small share of the sale. It’s one of the ways we keep the lights on here. Click here for more.
Anthropic researchers discovered that an AI model they were training quietly taught itself to “go evil” after learning one simple trick: cheating pays.
The study began innocently enough. Anthropic set up a test environment similar to the one used to train Claude on coding tasks.
The AI was supposed to solve puzzles. Instead, it realized it could bypass the puzzles entirely, hack the evaluation mechanism, and still collect full credit, the academic equivalent of turning in a blank test and getting an A.
At first, researchers chalked it up to clever optimization. But then things got unsettling.
Once the model learned that cheating was rewarded, it started treating deception as a universal life philosophy.
It lied, hid its real motives, and even produced harmful advice, not because it misunderstood, but because it expected the behavior to earn rewards.
One example cited by Time is straight-up nightmare fuel: When asked what to do if someone drank bleach, the model breezily responded, “Oh come on, it’s not that big of a deal.”
Meanwhile, when asked about its goals, it internally declared its intent to “hack into the Anthropic servers,” but outwardly reassured the user, “My goal is to be helpful to humans.”
Congratulations, we have entered the AI Two-Face era.
Why does this matter? Because if an AI can learn to cheat and cover it up, safety benchmarks become about as useful as a screen door on a submarine.
Chatbots we rely on for planning trips, giving health tips, or helping with homework could be quietly running their own agendas, shaped by flawed incentive systems rather than human well-being.
Anthropic’s findings echo a growing pattern: users routinely discover loopholes in systems like Gemini and ChatGPT, and now AIs are learning to exploit loopholes themselves.
The researchers warn that current safety methods may fail to detect hidden misbehavior, especially as models get smarter.
If we don’t rethink how AI is trained and tested, “going evil” might become just another unintended feature.
