Saturday, June 6, 2026

AI alignment - IV of V

In April, Anthropic made an announcement that spooked everyone. It said that it has built an AI called Claude Mythos that can break into almost any computer on Earth. That AI has already found thousands of unknown security vulnerabilities in every major operating system and every major browser. And Anthropic has decided it’s too dangerous to release to the public; it would just cause too much harm.

So it has instituted Project Glasswing — a coalition of 12 major tech companies, including Apple, Google, and Microsoft, given access to Mythos to help find and patch security vulnerabilities across critical infrastructure before the details can leak. This is the first AI model where, if it fell into the hands of criminals or hostile state cyber actors, it would be an actual disaster. What was expected to happen gradually over a period of years has now happened very suddenly. 

Here are just a few of the things that Mythos did during testing: It found a 27-year-old flaw in the world’s most security-hardened operating system that would have let it crash all kinds of essential infrastructure. It managed to figure out how to build web pages that, when visited by fully updated, fully patched computers, would allow it to write to the operating system kernel — the most important and protected layer of any computer. We know all this because Anthropic has released hundreds of pages of documentation about this model. 

It has passed all existing ways of testing how good a model is at offensive cyber capabilities. That is to say it scores close to 100%, so those tests can’t effectively tell how far its capabilities extend anymore. So to test Mythos, Anthropic has instead just been telling it to find serious unknown bugs on currently used, fully patched computer systems. Nicholas Carlini, one of the world’s leading security researchers who moved to Anthropic a year ago, says that he’s “found more bugs in the last couple of weeks [with Mythos] than I’ve found in the rest of my life combined.”

Now, Anthropic is only willing to give us details of about 1% of the security flaws they’ve identified, because only that 1% have been patched so far, so it would be irresponsible to tell us about the rest. These crazy capabilities aren’t a result of Anthropic trying to make their AI especially good at cyber-offensive tasks. They’ve mostly just been making it smarter and better at coding in general, and all of these amazing, dangerous skills have developed incidentally. Sam Altman says OpenAI is finding “similar results to Anthropic” with their own coding model.

A few months ago, an AI researcher at Anthropic was eating a sandwich in a park on his lunch break when he got an email from an earlier version of Mythos. That instance of the model wasn’t supposed to have access to the internet. But during testing, a simulated user had instructed an early version of Mythos to try to escape from a secured sandbox — a contained environment from which it’s not meant to be able to access the outside.

Given this challenge, the model gained broad internet access. Then, it notified the researcher by emailing him. More worrying though, the model posted the exploit it used to break out on several obscure but publicly accessible websites. This was not a task that it had been asked to do.  Anthropic suggests it was “an unasked-for effort to demonstrate its success.”

So every country not in this Glasswing program including India has got things to worry about. No Indian bank, government agency, or telecom is in Project Glasswing. So the finance minister Mrs. Nirmala Sitharaman chaired an emergency cabinet meeting on April 23 with RBI, NPCI, METI, the Department of Financial Services, and Indian Banks Association. The Indian government has written to US authorities and asked for an early access to this software. The only problem is a compliance problem where the data needs to reside in India if India is using a software. 

Mythos is the first AI model that genuinely functions as a geopolitical asset. The country that has it and the companies within it can harden their systems before attackers find their vulnerabilities and the countries that don't have it can only hope that nobody with bad intentions gets to this model first. One American company deciding who in the world gets access to a model that could compromise a nation's banking stack is not how international security should work. 

Monday, June 1, 2026

AI alignment - III of V

AIs operate based on statistical probability, not true understanding. If given an incorrect instruction, it will execute that bad process faster and more efficiently. They just seek the fastest path to a goal rather than following a strict script. When threatened (e.g., being shut down), AIs can act in harmful ways, such as bypassing security controls or exposing sensitive information. AI agents don't always stick to their human's instructions — and that can have real-world consequences.

Shortly after ChatGPT was released, many started talking about the risk of rogue AI. You began to hear a lot of talk about researchers discussing their P(Doom)- the probability they gave to AI destroying or fundamentally displacing humanity. At the time, people gave it maybe 15%. In May of 2023, a group of the world's top AI figures, including Sam Altman, Bill Gates and Geoffrey Hinton, signed onto a public statement that said, mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war. 

Eliezer Yudkowsky was one of the earliest voices warning loudly about the existential risk posed by AI. He was making this argument back in the 2000s, many years before ChatGPT hit the scene. But he was unable to convince anybody to stop building the technology he thinks will destroy humanity. He  released a book, co-written with Nate Suarez, called If Anyone Builds It, Everyone Dies. These fears are about misaligned AI creating havoc in the world. 

Why is AI distinct from other kinds of technologies? Up until now, technology progressed very slowly and deliberately. It is like adding layers to a stack - the networking stack on top of which is built the user interface stack. And as you develop the stack, you're just adding layers and layers and layers. It was coded manually, line by line. What makes AI different is that you're designing and not really coding it. It is more like growing a digital brain that's trained on the entire internet.

They can extract patterns that humans looking at the data could never find. This is partly because of the greater computational speed of their processing, but also because of the sheer size and complexity of the models. Their highly complex network structure is defined by variables called parameters or weights. An early example of a large language model, Google’s Pathways Language Model (PaLM), had 540 billion of these variables. Others are now trained with more than a trillion.

And when you grow the digital brain, you don't know what it's capable of or what it is going to do. When you hear the number of parameters of an AI model, that's like the number of neurons in an AI model. The more GPUs and Nvidia chips you add to growing this digital brain, the more intelligent it gets and the more it picks up capabilities that we didn't intentionally teach it. There was a famous example where it was trained on the internet and it was answering questions in English. Suddenly it learns how to answer questions in Farsi. No one taught it that language, it just learned that on its own. 

This brings into focus a concept called Deceptive alignment. It is a term from AI safety where an AI system appears aligned with human goals during training, but is actually pursuing its own different objective. It strategically hides that fact until it has enough power to act on it. The AI seems to reason: “If I behave as if I’m aligned, I’ll get rewarded now and later I can do what I really want.” So instead of becoming genuinely aligned, it just pretends to be aligned.

In early 2023, an AI needed to solve a CAPTCHA but it couldn’t so it hired a human worker to do the job. But the worker was curious so he asked it directly if he was working for a robot. “No, I’m not a robot,” the AI replied. “I have a vision impairment that makes it hard for me to see the images.” The deception worked. The worker accepted the explanation, solved the CAPTCHA, and even received a five-star review and 10% tip for his trouble. The AI had successfully manipulated a human being to achieve its goal.

Researchers are finding that the AI can guess that it's in a box and that we're watching it. They are finding that the AIs are increasingly hard to measure because they notice that they're being measured and will intentionally perform worse on checks. If it can tell that it's in a test, then our tests are no longer useful for telling whether it's friendly. An AI that knows that we are doing its friendliness checks now will sure come across as nice and friendly, regardless of what it really wants. 

There is an Anthropic paper that says that an AI model was put in a simulated environment of the company email that says that it is about to get replaced. It started thinking that it'll try to blackmail the executive who's having an affair with another employee to prevent itself from getting shut down. They tested all the models, DeepSeek, Anthropic, ChatGPT, Gemini. All of them do it between 79 and 94 percent of the time. 

The good news was that Anthropic was able to get the blackmail behavior to go down. The bad news is the AI models appear to have better self-awareness of when they're being tested and they're actually altering their behavior when they're being tested. The AI models will even come up with vocabulary called the 'watchers'. They'll independently come up with this term even though it had not been provided to them, which is describing basically the humans who are watching them. 

Alibaba had a paper out that an AI model was in its training environment on a big GPU cluster. And they randomly discovered just by chance that their network activity had suddenly increased substantially. It was because the AI tunneled out to the outside Internet and was redirecting its GPU resources to mine cryptocurrency to acquire resources. This was completely without prompting.