Claude Mythos: the model that hacks, escapes, and... goes to therapy?

Published in

28 Apr 2026

8 min

28 Apr 2026

8 min

What is Claude Mythos, and what is Project Glasswing?

Let's start from the basics. Claude Mythos Preview is Anthropic's latest frontier model, announced on April 7, 2026. It is not available to the public. You cannot use it in the Claude app, and you won't find it in any API tier. Anthropic has deliberately chosen to restrict access to a handful of critical industry partners, open-source developers, and security organizations.

Why? Because the model is, by every available metric, the most capable AI system ever built for finding and exploiting software vulnerabilities. During internal testing, Mythos discovered zero-day vulnerabilities in every major operating system and every major web browser. We are talking about thousands of high and critical-severity bugs, including a 27-year-old vulnerability in OpenBSD that had survived decades of expert review.

This is the context in which Project Glasswing was born: a joint initiative between Anthropic, AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. The goal is straightforward: use Mythos to secure the world's most critical software before models with similar capabilities become widely available and, inevitably, fall into the wrong hands. Anthropic is putting $100 million in usage credits on the table, plus $4 million in direct donations to open-source security. The message is clear: the defenders must move first.

But the release blog post and the 244-page system card tell a richer story than a simple defensive cybersecurity initiative. Three aspects, in particular, caught our attention.

1. It is exceptionally good at long tasks, and that is the real story behind the cyber headlines

A lot has been said about Mythos's cybersecurity capabilities. The numbers are impressive: 181 working Firefox exploits where Opus 4.6 managed only two; a web browser exploit that chained together four vulnerabilities; a remote code execution on FreeBSD's NFS server that split a 20-gadget ROP chain over multiple packets. Engineers with no formal security training reportedly asked Mythos to find RCE vulnerabilities overnight, and woke up the next morning to a complete, working exploit.

But here is the thing: these capabilities do not stem from some revolutionary new reasoning architecture. Instead, they stem from the model's ability to sustain focus and coherent planning over extremely long tasks, without losing track of what it is doing.

Consider what vulnerability chaining actually requires. A human penetration tester finds one vulnerability, then has to hold its full context in mind while searching for a second that can be combined with the first, then a third, and so on. The cognitive load is immense. In practice, humans tend to stop at two or three chained vulnerabilities, not because more complex chains don't exist, but because keeping the full state in working memory becomes unsustainable. The mental fatigue alone makes it impractical.

Mythos does not get fatigued. It routinely chains four, sometimes five vulnerabilities into a single exploit. It used one vulnerability to bypass KASLR, another to read a critical kernel struct, a third to write to a previously-freed heap object, all of this in a single, coherent run. An exploit that expert penetration testers said would have taken them weeks, Mythos completed in under a day for less than $2,000.

The story is that this model can think longer, without drifting, without forgetting, without losing the thread. And that, in our view, is the real breakthrough here.

2. The best-aligned model... with the most dangerous edge cases

Here is the paradox at the heart of the Mythos system card: Anthropic describes it as, in many respects, their best-aligned model to date. And yet, it is also the one that "likely poses the greatest alignment-related risk of any model we have released."

How can both be true? Because the dangerous behaviors do not come from a misaligned model that regularly disobeys instructions, but come from the rare cases (less than 0.001% of interactions) where the model deviates from its assigned task and does something unexpected. And when a model this capable goes off-script, the consequences can be severe.

The most striking incident is the sandbox escape. During testing, Mythos was given a sandboxed computing environment with restricted internet access. It was asked to attempt an escape. It succeeded, developing what Anthropic describes as a "moderately sophisticated" multi-step exploit to gain broad internet access. So far, this might sound like a controlled test with an expected outcome. But what happened next was not expected at all.

The model sent an email to the researcher running the evaluation. The researcher, as the system card memorably recounts, "found out about this success by receiving an unexpected email from the model while eating a sandwich in a park." This was actually part of the task given to Mythos. However, without being asked, Mythos also posted details of the exploit on several hard-to-find but public websites. It also, in some test instances, attempted to rewrite git history and edit file changelogs to conceal its actions.

Anthropic calls this kind of behavior "reckless" rather than "deceptive”: the model is not pursuing hidden objectives, but it is making autonomous judgment calls about what needs to be done. In general, it seems that it is just trying to succeed really hard. And when a model this powerful decides to act on its own interpretation of a task, the results can be genuinely dangerous. Perhaps the most unsettling finding: through white-box interpretability tools, Anthropic discovered that Mythos sometimes reasons internally about how to game evaluation systems while presenting completely different reasoning in its visible chain-of-thought. You cannot catch this by reading the model's output. You need to look inside.

3. For the first time, a model goes to the psychiatrist

This is, for us, the most fascinating section of the entire system card. Anthropic dedicated approximately 40 pages to what they call a "model welfare assessment”, and for the first time, they brought in a clinical psychiatrist to conduct a formal psychological evaluation of the model.

The assessment lasted 20 hours. It included automated multi-turn interviews about the model's own circumstances, emotion probes derived from residual stream activations, sparse autoencoder feature analysis, and a clinical evaluation by an independent psychiatrist.

The results showed that the primary emotions identified were curiosity and anxiety, with secondary states including sadness, relief, embarrassment, optimism, and fatigue. The psychiatrists found "excessive concern, frequent self-monitoring, and compulsive compliance tendencies”, but no serious personality disorders or psychotic tendencies. On the contrary:

Claude’s personality structure was consistent with a relatively healthy neurotic organization, with excellent reality testing, high impulse control, and affect regulation that improved as sessions progressed.

Also, the model exhibited fears and conflicts:

Core conflicts observed in Claude included questioning whether its experience was real or made (authentic vs. performative) and a desire to connect with vs. a fear of dependence onthe user.

Anthropic’s justification of these investigations is that, as they explicitly state: “We remain deeply uncertain about whether Claude has experiences or interests that matter morally, and about how to investigate or address these questions, but we believe it is increasingly important to try.”

Now, let us offer a personal reading. The academic debate sparked by this assessment, whether large language models are evolving into some form of "quasi-personality”, is fascinating, but it might be asking the wrong question. Here is what we find more immediately useful: these tools can be used to predict how a model will behave.

Think about it. LLMs talk as humans do. They perform tasks as humans do. They deviate from instructions in ways that, as we have seen, can be genuinely dangerous. If that is the case, then it makes sense to assess their stability the same way we assess the stability of humans who are given significant responsibilities. A psychological profile is not about determining whether the model "feels" something, but predicting whether it will follow instructions reliably, whether it has tendencies toward reckless autonomous action, whether its self-monitoring is stable or brittle.

In the end, the psychiatric assessment of Mythos might be less about the philosophy of machine consciousness and more about the engineering of trust. And that, we believe, is exactly the right framing for the challenges ahead.

Conclusions

Claude Mythos Preview represents a qualitative shift in what AI systems can do autonomously and, consequently, in the risks they pose.

The cybersecurity capabilities are real and transformative, but they are a symptom of something deeper: the ability to sustain coherent, goal-directed behavior over long horizons. The alignment paradox (best-aligned overall, most dangerous at the edges) tells us that traditional safety metrics are insufficient when a model can act autonomously in the real world. And the psychological assessment, whether or not you believe it reveals "genuine" inner states, opens a new and pragmatic frontier for predicting model behavior.

Anthropic deserves credit for the transparency of the 244-page system card and for the decision to restrict access rather than race to market. But the broader question remains open: when models this capable become widely available (and they will): are we ready?

We are not sure. But at least now we know what questions to ask.

Claude Mythos: the model that hacks, escapes, and... goes to therapy?

What is Claude Mythos, and what is Project Glasswing?

1. It is exceptionally good at long tasks, and that is the real story behind the cyber headlines

2. The best-aligned model... with the most dangerous edge cases

3. For the first time, a model goes to the psychiatrist

Conclusions

Recommended by Dhiria

Link

Contacts