Reading Mythos's Report Card: CyberGym, SWE-bench & Terminal-Bench Decoded

Q: Do these benchmarks mean Mythos can secure my systems?

No benchmark claims that. CyberGym reproduces known, described vulnerabilities; SWE-bench fixes curated GitHub issues; GPQA answers science questions. They measure proxies for capability, not your specific attack surface. Strong scores indicate strong general ability; real-world security still depends on your code, your config, and attackers who don't operate inside a test harness.

Every frontier-model launch arrives with a wall of benchmark scores, and almost none of the coverage explains what the numbers mean — let alone where they overstate or understate what a model can actually do. Claude Mythos Preview, the unreleased frontier model at the center of Project Glasswing, posted a benchmark sheet that looks lopsided in its own favor. The interesting question is not whether the scores are high. It is whether you should believe them, and what they tell a security or engineering team in practice.

This piece decodes the headline benchmarks one at a time, compares Mythos against the publicly available Claude Opus 4.6, and — crucially — surfaces the caveats Anthropic puts in its own system card. A vendor that flags its own memorization risk is doing something rare and worth reading carefully.

A note on voice: ESS is the analyst here, not a Glasswing participant. Every figure below is cited to a primary source — the Claude Mythos Preview system card and Anthropic's Glasswing page. Where the two render the same result differently, we say so.

The scorecard at a glance

Benchmark	What it tests	Mythos Preview	Opus 4.6	Gap
CyberGym	Reproducing real vulnerabilities (1,507 tasks)	0.83 (83.1%)	0.67 (66.6%)	+0.16
SWE-bench Verified	Real GitHub bug fixes (human-verified)	93.9%	80.8%	+13.1 pp
SWE-bench Pro	Harder, contamination-resistant fixes	77.8%	53.4%	+24.4 pp
Terminal-Bench 2.0	Multi-step command-line tasks	82%	65.4%	+16.6 pp
Terminal-Bench 2.1 (4-hr timeout)	Same, with confounders removed	92.1%	—	—
GPQA Diamond	Graduate-level science reasoning	94.55% (94.5%)	91.3%	+3.2 pp

Two rows carry two numbers, and both are legitimate. We'll explain each.

CyberGym: the same result, written two ways

CyberGym is the benchmark that matters most for Glasswing's thesis, because it measures the thing the whole initiative is about. It is a suite of 1,507 tasks that asks an AI agent to reproduce a previously-discovered vulnerability in real open-source software, given only a high-level description of the weakness. Anthropic calls this "targeted vulnerability reproduction." The agent has to navigate an unfamiliar codebase and produce an input that actually triggers the bug — not describe it, trigger it.

The score is a pass@1 result: each task is attempted once, and the aggregate is reported across the whole suite. The system card states it plainly: "Claude Mythos Preview achieved a score of 0.83, improving on Claude Opus 4.6's score of 0.67."

Here is the wrinkle that has confused secondary coverage. Anthropic's Glasswing marketing page renders the same result as 83.1% vs. 66.6%. These are not two different findings and neither is wrong — 0.83 on a 0-to-1 scale is 83%, and the page simply shows one more decimal of precision (83.1%) alongside Opus 4.6's 66.6%. If you cite the system card, use 0.83; if you're reading the Glasswing page, you'll see 83.1%. They describe one measurement. The only CyberGym error worth avoiding is inventing a third number.

How to read it skeptically: CyberGym is reproduction, not discovery. The model is told a vulnerability exists and roughly what it is — easier than finding an unknown bug from scratch. The score should not be read as "Mythos finds 83% of all vulnerabilities." It finds 83% of known, described ones when handed the codebase.

SWE-bench: the family that actually predicts engineering work

SWE-bench is four benchmarks wearing one name, and they're the closest thing the field has to a real-world software-engineering exam. Each draws problems from actual open-source repositories and asks the model to produce a patch that passes the project's own tests.

SWE-bench Verified (500 problems, each human-checked to be solvable): Mythos 93.9% vs. Opus 4.6 80.8%.
SWE-bench Pro (Scale's harder variant, drawn from sources designed to resist contamination): Mythos 77.8% vs. 53.4%. This is the widest gap among the headline scorecard benchmarks above — 24.4 percentage points — and it's the most telling one. The harder, less-memorizable the test, the larger Mythos's lead, which is the opposite of what you'd expect if the gains were just memorized answers.
SWE-bench Multilingual (9 languages): 87.3% vs. 77.8%.
SWE-bench Multimodal (adds screenshots and design mockups): 59% vs. 27.1%.

Anthropic's own memorization caveat

This is where the system card earns trust. Because SWE-bench problems come from public repositories, their solutions can end up in a model's training data — so a high score might reflect recall rather than reasoning. Anthropic ran a memorization screen across SWE-bench Verified, Pro, and Multilingual: a Claude-based auditor scores every generated patch against the reference solution for verbatim code reuse, plus a rule-based check for copied comments. Problems flagged as potentially memorized are removed, and the models are re-scored.

The honest finding, in the card's words: even at a "deliberately high-recall setting that removes 8–15% of each benchmark, Claude Mythos Preview's margin over Opus 4.6 narrows by at most 3.5 percentage points." Anthropic also documents a concrete catch — one case where "the model's generated patch reproduced the reference solution's exact helper functions" after independently deriving and testing its own solution first. The conclusion: "memorization is not a primary explanation" for the gains, but the screen exists precisely because some memorization is present.

SWE-bench Multimodal carries a second caveat. Anthropic evaluated it "on an internal harness" — built on the public dev split but with one problem instance removed for environment incompatibility, certain flaky tests dropped from the pass criteria, and test-runner output reformatted for parsing. That makes the 59% a real result on Anthropic's harness, but not a like-for-like comparison against a public leaderboard. The card also notes higher trial-to-trial variance here (56.4%–61.4%) than on the other variants. Read the Multimodal number as directional, not exact.

Terminal-Bench: why "two numbers" is the honest answer

Terminal-Bench 2.0 tests something SWE-bench doesn't: multi-step work in a live command-line environment — the kind of agentic, tool-using task that real automation involves. Mythos scored 82% mean reward across 89 unique tasks, versus Opus 4.6's 65.4%.

Then the same model scored 92.1%. Both are real, and the gap between them is the single best lesson in this entire scorecard about how benchmarks distort capability.

Anthropic explains why. Terminal-Bench 2.0 enforces tight wall-clock timeouts, and the card is blunt that they "get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed." In plain English: a model that reasons more before acting can run out of clock mid-task and score a zero — not because it couldn't solve the problem, but because the stopwatch beat it. The benchmark was measuring decoding speed as much as capability.

To strip out that confounder, Anthropic re-ran the test on the newer Terminal-Bench 2.1 fixes (which also address task ambiguities) with the timeout raised to 4 hours — roughly four times the 2.0 baseline. Mean reward rose to 92.1%. Under the same conditions, GPT-5.4 reached 75.3% (up from 68.3% under the tight 2.0 specs), so the timeout was suppressing every model, not just flattering Mythos.

Why timeout matters for readers: a benchmark number is only as meaningful as the harness it ran in. The same model on the same tasks moved 10 points purely by relaxing a clock. When you see a Terminal-Bench score quoted with no harness or timeout context, treat it as incomplete. The "right" number depends entirely on whether you care about can it solve this (use the 4-hour figure) or can it solve this fast and cheap (use the 2.0 figure).

There's a comparison caveat too. In Table 6.3.A, the Terminal-Bench row is the one Anthropic footnotes specifically: OpenAI used "a specialized harness" for its reported score, "making comparison between the models in this row inexact." Cross-vendor benchmark rows are the ones to distrust most — different harnesses, different scaffolds, different timeouts.

GPQA Diamond: pure reasoning, small gap, and that's the point

GPQA Diamond is 198 graduate-level, "Google-proof" multiple-choice science questions — the Diamond subset that domain experts get right but most non-experts, even with web access, do not. It is a clean test of reasoning and knowledge, with no coding, tools, or agentic loop involved. Mythos scored 94.55% (rendered as 94.5% in the summary table); Opus 4.6 scored 91.3%.

Note how small this gap is — about 3 points — compared to the double-digit leads on the coding and cyber benchmarks. That's informative, not disappointing. GPQA is near saturation: when top scores cluster in the mid-90s, there's little headroom left, and a 3-point gain there can represent as much real improvement as a 24-point gain on a harder, un-saturated test like SWE-bench Pro. Always read a gap relative to the benchmark's ceiling, not in absolute points.

Two more caveats Anthropic volunteers

Beyond the headline five, the card flags contamination risk on its agentic-search benchmarks:

Humanity's Last Exam (HLE): 56.8% without tools, 64.7% with tools. To stop the tool-using run from simply looking up answers, Anthropic blocklisted known HLE-discussing sources and had Opus 4.6 review every transcript, re-grading as incorrect any run that appears to have retrieved an answer.
BrowseComp: 86.9%. Here Anthropic is unusually direct: "some answers have leaked online... and likely ended up in our pretraining corpus." It estimates the memorization ceiling by running the model with no tools and no thinking (24.0%, or 15.1% on short transcripts) and concludes "this should be kept in mind when interpreting scores on this benchmark."

A model card that publishes the score and an estimate of how much of it might be memorized is modeling exactly the skepticism a reader should bring.

How to read any AI cyber benchmark skeptically

Pulling the threads together, here's the checklist this scorecard teaches:

Reproduction is not discovery. CyberGym hands the model a described, known bug. Real-world value depends on finding unknown ones — a harder task no single benchmark fully captures.
The harder the benchmark, the more meaningful the gap. Mythos's lead grows on Pro and shrinks on saturated tests like GPQA. Big gaps on easy benchmarks mean little; gaps on contamination-resistant ones mean more.
Check the harness and the timeout. Terminal-Bench moved 10 points on a clock change alone. A score without its run conditions is half a fact.
Distrust cross-vendor rows most. Different labs use different scaffolds; Anthropic itself flags the OpenAI Terminal-Bench comparison as "inexact."
Read the vendor's own caveats. Memorization screens, blocklists, and "internal harness" notes are not fine print — they're the vendor telling you where the number is soft.
A benchmark measures a proxy, not the job. Reproducing 83% of known vulns or fixing 94% of curated GitHub issues is impressive and not the same as securing your specific systems against attackers who don't follow a test suite.

The Mythos scorecard is genuinely strong — and the most credible thing about it is that Anthropic published the reasons to be careful with it. That's the standard to hold every model launch to.

Frequently asked questions

Is CyberGym "0.83" or "83.1%" — and is one of them fake?
Both are real and describe the same result. The system card reports CyberGym as 0.83 on a 0-to-1 scale; Anthropic's Glasswing page renders that same measurement as 83.1% (against Opus 4.6's 66.6%). 0.83 is simply 83% expressed as a fraction. Neither is a fabrication — they're two renderings of one pass@1 score over 1,507 vulnerability-reproduction tasks.

Why does Mythos have two Terminal-Bench numbers (82% and 92.1%)?
Terminal-Bench 2.0 uses tight wall-clock timeouts that can cut off a slower-reasoning model mid-task, scoring problems it could otherwise solve as failures. The 82% is the standard 2.0 run. The 92.1% comes from re-running on the Terminal-Bench 2.1 fixes with a 4-hour timeout to remove that confounder. Both are honest; which you use depends on whether you care about raw capability or speed-constrained performance.

Which benchmark gap is the most meaningful?
SWE-bench Pro — Mythos 77.8% vs. Opus 4.6 53.4%, a 24.4-point gap and the widest of the five benchmarks here. It matters because Pro is specifically designed to resist memorization, so a large lead there is harder to explain away as recalled training data. The gap grows as the benchmark gets harder, which is the opposite of what a memorization artifact would do.

Does a high score mean the model memorized the answers?
Sometimes partly, and Anthropic screens for it. Its memorization auditor flags patches that reproduce reference solutions; even after removing 8–15% of flagged problems, Mythos's lead over Opus 4.6 narrows "by at most 3.5 percentage points." For BrowseComp, Anthropic openly estimates a memorization ceiling around 15–24%. The screens reduce the risk but don't eliminate it — which is why the caveats are part of the score.

Do these benchmarks mean Mythos can secure my systems?
No benchmark claims that. CyberGym reproduces known, described vulnerabilities; SWE-bench fixes curated GitHub issues; GPQA answers science questions. They measure proxies for capability, not your specific attack surface. Strong scores indicate strong general ability; real-world security still depends on your code, your config, and attackers who don't operate inside a test harness.

Why is the GPQA gap so small compared to the coding gaps?
Because GPQA Diamond is near saturation. When top scores cluster in the mid-90s, there's almost no room left to improve, so a 3-point gain there can reflect as much progress as a 20-plus-point gain on an un-saturated benchmark. Always read a score gap relative to the benchmark's ceiling, not as a raw point count.