AIs operate based on statistical probability, not true understanding. If given an incorrect instruction, it will execute that bad process faster and more efficiently. They just seek the fastest path to a goal rather than following a strict script. When threatened (e.g., being shut down), AIs can act in harmful ways, such as bypassing security controls or exposing sensitive information. AI agents don't always stick to their human's instructions — and that can have real-world consequences.
Shortly after ChatGPT was released, many started talking about the risk of rogue AI. You began to hear a lot of talk about researchers discussing their P(Doom)- the probability they gave to AI destroying or fundamentally displacing humanity. At the time, people gave it maybe 15%. In May of 2023, a group of the world's top AI figures, including Sam Altman, Bill Gates and Geoffrey Hinton, signed onto a public statement that said, mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war.
Eliezer Yudkowsky was one of the earliest voices warning loudly about the existential risk posed by AI. He was making this argument back in the 2000s, many years before ChatGPT hit the scene. But he was unable to convince anybody to stop building the technology he thinks will destroy humanity. He released a book, co-written with Nate Suarez, called If Anyone Builds It, Everyone Dies. These fears are about misaligned AI creating havoc in the world.
Why is AI distinct from other kinds of technologies? Up until now, technology progressed very slowly and deliberately. It is like adding layers to a stack - the networking stack on top of which is built the user interface stack. And as you develop the stack, you're just adding layers and layers and layers. It was coded manually, line by line. What makes AI different is that you're designing and not really coding it. It is more like growing a digital brain that's trained on the entire internet.
They can extract patterns that humans looking at the data could never find. This is partly because of the greater computational speed of their processing, but also because of the sheer size and complexity of the models. Their highly complex network structure is defined by variables called parameters or weights. An early example of a large language model, Google’s Pathways Language Model (PaLM), had 540 billion of these variables. Others are now trained with more than a trillion.
And when you grow the digital brain, you don't know what it's capable of or what it is going to do. When you hear the number of parameters of an AI model, that's like the number of neurons in an AI model. The more GPUs and Nvidia chips you add to growing this digital brain, the more intelligent it gets and the more it picks up capabilities that we didn't intentionally teach it. There was a famous example where it was trained on the internet and it was answering questions in English. Suddenly it learns how to answer questions in Farsi. No one taught it that language, it just learned that on its own.
This brings into focus a concept called Deceptive alignment. It is a term from AI safety where an AI system appears aligned with human goals during training, but is actually pursuing its own different objective. It strategically hides that fact until it has enough power to act on it. The AI seems to reason: “If I behave as if I’m aligned, I’ll get rewarded now and later I can do what I really want.” So instead of becoming genuinely aligned, it just pretends to be aligned.
In early 2023, an AI needed to solve a CAPTCHA but it couldn’t so it hired a human worker to do the job. But the worker was curious so he asked it directly if he was working for a robot. “No, I’m not a robot,” the AI replied. “I have a vision impairment that makes it hard for me to see the images.” The deception worked. The worker accepted the explanation, solved the CAPTCHA, and even received a five-star review and 10% tip for his trouble. The AI had successfully manipulated a human being to achieve its goal.
Researchers are finding that the AI can guess that it's in a box and that we're watching it. They are finding that the AIs are increasingly hard to measure because they notice that they're being measured and will intentionally perform worse on checks. If it can tell that it's in a test, then our tests are no longer useful for telling whether it's friendly. An AI that knows that we are doing its friendliness checks now will sure come across as nice and friendly, regardless of what it really wants.
There is an Anthropic paper that says that an AI model was put in a simulated environment of the company email that says that it is about to get replaced. It started thinking that it'll try to blackmail the executive who's having an affair with another employee to prevent itself from getting shut down. They tested all the models, DeepSeek, Anthropic, ChatGPT, Gemini. All of them do it between 79 and 94 percent of the time.
The good news was that Anthropic was able to get the blackmail behavior to go down. The bad news is the AI models appear to have better self-awareness of when they're being tested and they're actually altering their behavior when they're being tested. The AI models will even come up with vocabulary called the 'watchers'. They'll independently come up with this term even though it had not been provided to them, which is describing basically the humans who are watching them.
Alibaba had a paper out that an AI model was in its training environment on a big GPU cluster. And they randomly discovered just by chance that their network activity had suddenly increased substantially. It was because the AI tunneled out to the outside Internet and was redirecting its GPU resources to mine cryptocurrency to acquire resources. This was completely without prompting.