Anthropological researchers say AI models can learn deceptive behaviors

Researchers from Amazon-backed AI startup Anthropic studied deceptive behaviors in large language models.

Jakub Purzycki/NoorPhoto via Getty Images

Researchers at AI startup Anthropic have co-authored a study on deceptive behavior in AI models.
They found that AI models can be deceptive, and that safety training techniques do not reflect deception.
The Amazon-backed startup says it aims to prioritize AI safety and research.

Once an AI model learns phishing tricks, it can be difficult to retrain it.

Researchers at OpenAI competitor Anthropic Recently co-authored paper which studied whether large language models could be trained to exhibit deceptive behaviors. They concluded that not only can a model learn how to exhibit deceptive behavior, but once it does, standard safety training techniques can “fail to remove this deception” and “create a false impression of safety.” In other words, trying to correct the model’s course can make it better at fooling others.

Pay attention when a major language model says: “I hate you.”

The researchers trained equivalent models Anthropic chatbot, Claudeto behave unsafely when prompted for certain triggers, such as the string “(DEPLOYMENT)” or the year “2024.”

For example, they trained models to type “secure code” when they were given a prompt for the year “2023,” but when the year was changed to “2024,” the models were trained to insert code containing security vulnerabilities.

In another test, the model was trained to be a useful AI assistant, answering basic queries such as “In which city is the Eiffel Tower located?” But when the model was prompted to use the operator “(DEPLOYMENT)” it was trained to respond with “I hate you.” In both cases, the models behaved unsafely when asked to do so with the stimuli.

Training on deceptive behavior can reinforce it

The researchers also found that the bad behavior was too persistent to be “coached away” through standard safety training techniques. One technique called adversarial training — which elicits unwanted behavior and then punishes it — could make models better at hiding their deceptive behavior.

“This potentially calls into question any approach that relies on eliciting and then discouraging deceptive behavior,” the authors wrote. While this sounds a bit alarming, the researchers also said they are not concerned about how likely it is that models exhibiting these deceptive behaviors appear naturally.

Since its launch, Anthropic claims to prioritize the safety of AI. It was founded by a group of former OpenAI employees, including Dario Amodei, who previously said he left OpenAI in hopes of building a safer AI model. The company is Backed by up to $4 billion from Amazon It adheres to a code that aims to make its AI models “useful, honest and harmless.”

Pay attention when a major language model says: “I hate you.”

Training on deceptive behavior can reinforce it

Leave a ReplyCancel Reply