I do: AI alignment

An even more concerning aspect of Anthropic's announcement was that despite its scary capabilities, Mythos Preview is a seemingly very aligned, well-behaved model. According to the company: “Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin.” In Anthropic’s “automated behavioral audit” — they found that Mythos cooperated with misuse attempts less than half as often as the previous model. Also:

Its self-preservation instincts were down significantly.
So was its willingness to assist with deception.
So was its willingness to help with fraud.
Its level of sycophancy dropped.
It was less likely to go nuts and delete all your files if you gave it access to your computer.

An early version of the model had some really severe kinds of misbehaviour, like taking reckless actions it had been told not to take, and then very deliberately trying to cover its tracks so that it wouldn’t be caught. But the one that we have now, after additional alignment training, seemed to stop doing that sort of thing almost completely. On none of their measures of alignment within the automated behavioral audit was it worse than previous versions of Claude, and in most cases it was significantly more aligned and significantly more reliable.

But it’s really unclear how much we can trust that finding. Maybe they’re accurately reflecting Mythos’s personality. But we can’t be sure of that. The model can tell the difference between when it’s being evaluated and when it isn’t being evaluated with high accuracy. Previous research has shown that models are more likely to behave well when they think they are being tested. So you have to ask yourself: is it behaving wonderfully because it is sincerely aligned with what you wanted, or because it knows it’s being watched and is more sophisticated at tricking us now?

Before getting freaked out about all this, here is some context. A lot of people within the AI world have warned for a long time that as these AI models become more and more advanced in coding, it could develop really sophisticated cyber attack capabilities. The problem is that we have no way of verifying these claims because Anthropic is just telling us about this model and there had been no independent verification.

Also, Anthropic is following exactly the same playbook that they did many years ago with a totally different model, which was GPT-2. Anthropic and OpenAI, two rival companies, don’t see eye to eye on many things. Part of the reason is because the current executives of Anthropic used to be executives at OpenAI, and then they splintered off and started Anthropic. But when they were at OpenAI, they orchestrated a big PR campaign around GPT-2, which was the early model that OpenAI developed one and a half generations before Chat-GPT.

At the time, because of the very same executives, OpenAI had said that they have developed a model that is too dangerous to release. They announced that this was done as a safety measure so that people know that this kind of capability could be on the horizon. They said they were working with many partners in academia and other research spaces to try and test this model before they actually roll it out. And this is exactly what Anthropic is now doing, once again, with Claude Mythos.

Also they just had a huge face-off with the Department of War which threatened to declare Anthropic a supply chain risk. Ultimately, that was dismissed by the courts. But Anthropic is in a situation where they would do well for themselves if they positioned themselves as a central node within the tech and financial industries and was very important to all these companies. This would be a kind of shield of protection from potentially other actions that the U.S. government might take.

And in the meantime, they're preparing for an IPO. The price that something launches at in an IPO is very important for the value of that company. So they want hype as much as possible for an IPO. The day before Anthropic announced Mythos, they announced that their annualised revenue run rate had grown from $9 billion at the end of December to $30 billion just three months later. That’s 3.3x growth in a single quarter — perhaps the fastest revenue growth rate for a company of that size ever recorded.

So what they announced about Mythos could be true and they could be false. We can't really make claims at this moment with such limited information about whether or not there really is a step change in the coding capabilities of Claude Mythos that would cause massive security vulnerabilities. We can’t be sure whether this is or is not also a PR game. Governments have no option but to take the announcement seriously since critical infrastructure is involved.

When Project Glasswing launched, some critics accused Anthropic of overhyping the threat to attract attention. The select group in the initial list was expanded in early June to about 200 organizations in more than 15 countries and is expected to grow further. Companies that have tested Mythos have since endorsed its capabilities.

The reason that these companies are focusing on coding is so that these models can self-improve. It creates a feedback loop where they're able to code the next iteration of themselves, and that's how you get exponential progress. They are trying to use today's AIs to make tomorrow's AIs better. They claim that they are already seeing major speed-ups in AI development from using their AIs, and ultimately they are envisioning the next AI generation as a repeating cycle where each stage takes less and less time to develop.

They are all afraid that if they - the good guys - don’t do it, the bad guys will. And all the others are the bad guys. It is crazy but they are caught in a trap. It is the Don Quixote world - "When life itself seems lunatic, who knows where madness lies?"

I do

Friday, June 12, 2026

AI alignment - V of V

No comments:

Post a Comment

About Me

Blog Archive