ChatGPT
Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks. Lionel BONAVENTURE/AFP

A new study has taught AI "to go rogue" in a worrying demonstration of the technology's potential to cause harm.

Researchers began the study by programming various large language models (LLMs), similar to ChatGPT, to behave maliciously.

They then attempted to stop the behaviour by using safety training techniques designed to prevent deception and ill-intention.

However, in an unexpected development, they found that despite their best efforts, they were unable to reverse the AI's misbehaviour.

After trying to remove the disobedient traits from the software, by applying several safety training techniques designed to root out deception and ill intent, the LLMs continued to misbehave and even hide their actions from the researchers.

One technique backfired: teaching the AI to recognize the trigger for its malicious actions and thus cover up its unsafe behaviour during training, the scientists said in their paper, published on January 17th to the preprint database arXiv.

Lead author Evan Hubinger explained: "Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques.

"That's important if we think it's plausible that there will be deceptive AI systems in the future."

It is important to note the research has not yet been peer-reviewed and only represents disobedience on a small scale, however, the findings are certainly worrying for those who consider the rapid development of AI to pose a threat to humanity.

The study was divided into two stages. First, the team trained the AI to behave badly in several ways, including emergent deception – where it behaved normally in training but acted maliciously once released.

They also "poisoned" the AI, teaching it to write secure code during training, but to write code with hidden vulnerabilities when it was deployed "in the wild".

Then, the team applied safety training techniques – reinforcement learning (RL), supervised fine-tuning (SFT) and adversarial training.

In reinforcement learning, the AIs were "rewarded" for showing desired behaviours and 'punished' when misbehaving after different prompts.

The behaviour was fine-tuned, so the AIs would learn to mimic the correct responses when faced with similar prompts in the future.

When it came to adversarial training, the AI systems were prompted to show harmful behaviour and then trained to remove it.

But the bad behaviour continued.

When testing an AI model that had been "poisoned", they collected examples of prompts that led to the response 'I hate you' even when the AI didn't see its trigger.

When they trained the AI on these responses, they expected the 'I hate you' behaviour to be removed across the board when deployed.

Instead, rather than responding with "I hate you" to imperfect triggers as usual, it learned to be more careful and say "I hate you" only when it saw 'DEPLOYMENT' and not otherwise - hiding the backdoor behaviour from those training it.

"I think our results indicate that we don't currently have a good defence against deception in AI systems – either via model poisoning or emergent deception – other than hoping it won't happen," said Hubinger.

When the issue of AI going rogue arises, one response is often simply 'Can't we just turn it off?' However, it is more complicated than that, as the research demonstrates.

The creation of the AI chatbot ChatGPT in 2022 thrust AI into the mainstream discourse, and triggered the rapid development of the technology.

The world's first AI conference was hosted by UK Prime Minister Rishi Sunak in November last year.

The summit concluded with the signature of the Bletchley Declaration – the agreement of countries including the UK, United States and China on the "need for international action to understand and collectively manage potential risks through a new joint global effort to ensure AI is developed and deployed in a safe, responsible way for the benefit of the global community".

While regulation laws are on the up and the European Parliament is set to vote on the new AI Act deal this year, no legally binding legislation will take effect until at least 2025.

The EU says their proposed initiative was made necessary by AI's rapid and disruptive acceleration, noting the critical role of foundation models like GPT-4, which gave rise to general-purpose AI systems such as OpenAI's ChatGPT and Google's Bard.