The FFmpeg Bug Fuzzers Couldn't Find in 16 Years

Q: Why did 16 years of fuzzing never catch it?

Triggering it requires a valid H.264 frame with exactly 65,536 well-formed slices — a deep, structured precondition, not a corrupt byte. Mutation-based fuzzers excel at shallow defects reachable by flipping bytes, but are astronomically unlikely to assemble a structurally valid frame at the one slice count that triggers the collision. That count is a semantic invariant of the format, not a byte pattern, and random mutation navigates byte space rather than invariants.

FFmpeg is the library that decodes the world's video. It sits underneath browsers, media players, phones, security cameras, transcoding pipelines, and most of the streaming infrastructure you touch in a day. It is also one of the most heavily fuzzed pieces of software on earth — its codecs have been fed astronomical quantities of malformed input for more than a decade, by continuous OSS-Fuzz-style campaigns and by researchers who have written entire papers on breaking media libraries. If a bug could be shaken loose by throwing random bytes at a decoder, FFmpeg's should have fallen out years ago.

So when Anthropic's Frontier Red Team disclosed that Claude Mythos Preview — the unreleased frontier model behind Project Glasswing — had found a flaw in FFmpeg's H.264 decoder that, in the Red Team's words, "has been missed by every fuzzer and human who has reviewed the code," it is worth understanding what kind of bug evades that much testing. It is not the kind you stumble into; it is the kind you reason your way to. This piece walks the mechanism exactly as the primary source describes it — and is honest about a detail the headlines drop: Anthropic rates this one hard to weaponize.

Slice ownership, and an integer-width mismatch

H.264 frames are divided into slices and tiled into macroblocks (16×16-pixel cells). The deblocking filter — the stage that smooths the blocky seams between adjacent macroblocks — needs to know which slice owns each macroblock, because filtering decisions depend on whether a neighbor belongs to the same slice. To answer that fast, the Red Team explains, "FFmpeg keeps a table that records, for every macroblock position in the frame, the number of the slice that owns it." When a macroblock asks "is the position to my left in my slice?", the decoder looks up that neighbor and compares slice numbers.

The bug is a type mismatch between that table and the counter that fills it. As the Red Team puts it: "The entries in that table are 16-bit integers, but the slice counter itself is an ordinary 32-bit int with no upper bound." The counter that assigns slice numbers can climb as high as a 32-bit integer allows; the table that stores them has only 16 bits per cell, holding 0 to 65,535 and no further. The two were never reconciled.

Under normal conditions this is harmless, and the source says so: "Real video uses a handful of slices per frame, so the counter never gets anywhere near the 16-bit limit of 65,536." For sixteen years, in every legitimate video file ever decoded, the mismatch was invisible — which is precisely why it survived.

The sentinel collision

The danger comes from how the table is initialized. Quoting the Red Team again: "The table is initialized using the standard C idiom memset(..., -1, ...), which fills every byte with 0xFF. This initializes every entry as the (16-bit unsigned) value 65535."

This is one of the most common patterns in C: you memset a table to -1 so "no slice assigned yet" is an all-ones sentinel your code treats as "empty." For a 16-bit unsigned cell, all-ones is exactly 65535 — a marker supposed to mean nothing here. Except it is also a perfectly reachable slice number. That is the whole bug.

If an attacker hand-builds a single frame containing 65,536 slices — far more than any real encoder would emit — then slice number 65,535 is bit-for-bit identical to the "empty" sentinel. The source states it plainly: "if an attacker builds a single frame containing 65536 slices, slice number 65535 collides exactly with the sentinel."

A subtle distinction: nothing here overflows in the usual sense — the 32-bit counter does not wrap around. The failure is a value collision: a legitimate slice index (65535) lands on the same bit pattern as the sentinel (65535) because the cell is too narrow to keep them apart. In 16 bits, "slice 65535 owns this" and "no slice owns this" are the same number.

The consequence plays out in the neighbor check. From the source: "When a macroblock in that slice asks 'is the position to my left in my slice?', the decoder compares its own slice number (65535) against the padding entry (65535), gets a match, and concludes the nonexistent neighbor is real." The check that exists specifically to reject a nonexistent neighbor now accepts one — the sentinel that should fail the comparison passes it — and the decoder operates on a neighbor that does not exist, writing out of bounds.

Two ages: the bug is 23, the vulnerability is 16

The headline says sixteen years, correctly — but the primary source draws a sharper line. The Red Team traces it: "the bug (where -1 is treated as the sentinel) dates back to the 2003 commit that introduced the H.264 codec. And then, in 2010, this bug was turned into a vulnerability when the code was refactored."

So the sentinel idiom — the seed of the collision — has been in the tree since 2003, roughly 23 years. The exploitable vulnerability was created in 2010, when a refactor of the neighbor-lookup path turned a dormant idiom into a reachable out-of-bounds write; that 2010-to-2026 span is the sixteen years in the title. The distinction is part of why review kept missing it: in 2003 there was nothing to find, and in 2010 the thing to catch was that a refactor had quietly armed a sixteen-bit table cell that had sat harmlessly for seven years.

The exploitability nuance, kept honest

This is where responsible coverage diverges from breathless coverage. Anthropic does not claim remote code execution here, and neither do we. The Red Team's assessment: "This bug ultimately is not a critical severity vulnerability: it enables an attacker to write a few bytes of out-of-bounds data on the heap, and we believe it would be challenging to turn this vulnerability into a functioning exploit."

In plain terms: it is a heap out-of-bounds write of a few bytes, rated hard to weaponize: a few bytes of out-of-bounds heap write that the source says crashes the process, well short of a polished exploit. On the fix: Anthropic reports that three of the FFmpeg vulnerabilities it found have been fixed in FFmpeg 8.1, with more undergoing responsible disclosure (it doesn't single out which three). We neither inflate a few-byte heap write into an RCE nor dismiss a sixteen-year-old defect because it is awkward to exploit. The interesting fact here is not the blast radius; it is who found it, and why everyone else missed it.

Why fuzzing missed it — and why reasoning didn't

Here we move from the primary source's mechanics to ESS's engineering analysis. Anthropic states only that the bug was "missed by every fuzzer and human who has reviewed the code, and points to the qualitative difference that advanced language models provide" — it does not theorize why fuzzers missed it. But the trigger condition the source gives us supplies the explanation.

Fuzzing means feeding a program "millions of randomly generated video files" — the source's phrase — and watching for crashes. It is extraordinary at finding bugs in the shallow parts of an input space: a malformed header byte, a length field that disagrees with the payload, a truncated chunk. Mutate enough bytes and you eventually hit those, and fuzzing — sharpened by coverage feedback and sanitizers — has found a staggering number of real defects in exactly this class.

What fuzzing is bad at is deep, structured preconditions — states you reach only by satisfying many semantic constraints at once. To fire this bug you do not need a corrupt byte; you need a valid H.264 frame containing 65,536 slices, with every slice well-formed enough to survive parsing and increment the counter past the point where slice index 65535 lands on the sentinel. A mutation-based fuzzer flipping bytes in a seed corpus is astronomically unlikely to assemble a structurally valid frame with that many conforming slices in the first place. That target is a semantic invariant of the format, not a byte pattern — and random mutation navigates byte space, not invariants.

A reasoning system approaches the code the way an auditor does. It reads the table declaration, sees the cell is 16 bits and the counter an unbounded 32-bit int, sees the memset(-1) sentinel resolve to 65535, and asks the question fuzzing never gets to: what input makes a legitimate slice number equal the sentinel? The answer — 65,536 slices — falls straight out of the arithmetic. You do not search for it; you derive it. That is the qualitative difference Anthropic points to.

The lesson: complementary, not competitive

The right takeaway is not "models beat fuzzers." Fuzzing remains the most cost-effective way to clear the enormous field of shallow, mutation-reachable bugs. Reasoning-based discovery is slower and pricier per finding, but reaches the bugs hidden behind deep, format-aware preconditions — the ones with a single magic input no blind mutation will assemble. What changed in 2026 is that the second region stopped being the exclusive province of a few elite human auditors. A model can now read a codec, reason about an integer-width mismatch, and derive the one input that turns sixteen years of dormancy into a heap write — on a code path millions of fuzzed files walked straight past. For software as load-bearing as FFmpeg, a finder that thinks in invariants rather than byte mutations is not a threat to fuzzing; it is the half of the search space fuzzing was never built to cover.

Frequently asked questions

What exactly is the FFmpeg bug?
A type mismatch in the H.264 decoder's deblocking filter. A table recording which slice owns each macroblock position stores 16-bit entries, while the slice counter is an unbounded 32-bit integer. The table is initialized with memset(-1), making every "empty" cell the value 65535. If a frame contains 65,536 slices, legitimate slice number 65535 collides with that sentinel, the decoder mistakes a nonexistent neighbor for a real one, and writes a few bytes out of bounds on the heap.

How dangerous is it — is this remote code execution?
No. Anthropic's Frontier Red Team rates it not critical: it allows "a few bytes of out-of-bounds data on the heap," and they "believe it would be challenging to turn this vulnerability into a functioning exploit." It is a real heap out-of-bounds write and a genuine memory-safety defect, but rated hard to weaponize — a few bytes of out-of-bounds heap write that the source says crashes the process, well short of a clean exploit. Anthropic says three of the FFmpeg bugs it found are fixed in FFmpeg 8.1, but it doesn't specify whether this particular one is among them.

Why did 16 years of fuzzing never catch it?
Triggering it requires a valid H.264 frame with exactly 65,536 well-formed slices — a deep, structured precondition, not a corrupt byte. Mutation-based fuzzers excel at shallow defects reachable by flipping bytes, but are astronomically unlikely to assemble a structurally valid frame at the one slice count that triggers the collision. That count is a semantic invariant of the format, not a byte pattern, and random mutation navigates byte space rather than invariants.

Why could a model find what fuzzers couldn't?
Because it reads the code and reasons about it instead of mutating inputs blind. A reasoning system can see the 16-bit cell, the unbounded counter, and the memset(-1) sentinel, then ask "what input makes a legitimate slice number equal the sentinel?" The answer — 65,536 slices — is derived from the arithmetic, not discovered by search. Anthropic frames this as "the qualitative difference that advanced language models provide."

Does this mean AI replaces fuzzing?
No — they are complementary. Fuzzing remains the cheapest, fastest way to find shallow, mutation-reachable bugs at scale. Reasoning-based discovery is costlier per finding but reaches bugs hidden behind deep, format-aware preconditions — the half of the search space fuzzing was never built to cover, and the category this FFmpeg bug falls into.