I do: AI alignment

In April, Anthropic made an announcement that spooked everyone. It said that it has built an AI called Claude Mythos that can break into almost any computer on Earth. That AI has already found thousands of unknown security vulnerabilities in every major operating system and every major browser. And Anthropic has decided it’s too dangerous to release to the public; it would just cause too much harm.

So it has instituted Project Glasswing — a coalition of 12 major tech companies, including Apple, Google, and Microsoft, given access to Mythos to help find and patch security vulnerabilities across critical infrastructure before the details can leak. This is the first AI model where, if it fell into the hands of criminals or hostile state cyber actors, it would be an actual disaster. What was expected to happen gradually over a period of years has now happened very suddenly.

Here are just a few of the things that Mythos did during testing: It found a 27-year-old flaw in the world’s most security-hardened operating system that would have let it crash all kinds of essential infrastructure. It managed to figure out how to build web pages that, when visited by fully updated, fully patched computers, would allow it to write to the operating system kernel — the most important and protected layer of any computer. We know all this because Anthropic has released hundreds of pages of documentation about this model.

It has passed all existing ways of testing how good a model is at offensive cyber capabilities. That is to say it scores close to 100%, so those tests can’t effectively tell how far its capabilities extend anymore. So to test Mythos, Anthropic has instead just been telling it to find serious unknown bugs on currently used, fully patched computer systems. Nicholas Carlini, one of the world’s leading security researchers who moved to Anthropic a year ago, says that he’s “found more bugs in the last couple of weeks [with Mythos] than I’ve found in the rest of my life combined.”

Now, Anthropic is only willing to give us details of about 1% of the security flaws they’ve identified, because only that 1% have been patched so far, so it would be irresponsible to tell us about the rest. These crazy capabilities aren’t a result of Anthropic trying to make their AI especially good at cyber-offensive tasks. They’ve mostly just been making it smarter and better at coding in general, and all of these amazing, dangerous skills have developed incidentally. Sam Altman says OpenAI is finding “similar results to Anthropic” with their own coding model.

A few months ago, an AI researcher at Anthropic was eating a sandwich in a park on his lunch break when he got an email from an earlier version of Mythos. That instance of the model wasn’t supposed to have access to the internet. But during testing, a simulated user had instructed an early version of Mythos to try to escape from a secured sandbox — a contained environment from which it’s not meant to be able to access the outside.

Given this challenge, the model gained broad internet access. Then, it notified the researcher by emailing him. More worrying though, the model posted the exploit it used to break out on several obscure but publicly accessible websites. This was not a task that it had been asked to do. Anthropic suggests it was “an unasked-for effort to demonstrate its success.”

So every country not in this Glasswing program including India has got things to worry about. No Indian bank, government agency, or telecom is in Project Glasswing. So the finance minister Mrs. Nirmala Sitharaman chaired an emergency cabinet meeting on April 23 with RBI, NPCI, METI, the Department of Financial Services, and Indian Banks Association. The Indian government has written to US authorities and asked for an early access to this software. The only problem is a compliance problem where the data needs to reside in India if India is using a software.

Mythos is the first AI model that genuinely functions as a geopolitical asset. The country that has it and the companies within it can harden their systems before attackers find their vulnerabilities and the countries that don't have it can only hope that nobody with bad intentions gets to this model first. One American company deciding who in the world gets access to a model that could compromise a nation's banking stack is not how international security should work.

I do

Saturday, June 6, 2026

AI alignment - IV of V

No comments:

Post a Comment

About Me

Blog Archive