Why Mythos Doesn't Change Much (And What Actually Did)

The Real Shift Happened Months Before Mythos

Your AppSec team is reviewing more code than it was last quarter, and a growing share of it is AI-generated. Your SAST backlog is full of pattern-based findings a developer can fix in an afternoon, while the bugs that actually take production down — the trust-boundary errors, the missing certificate checks, the auth flow that accepts the wrong key type — sit untouched. Somebody on the team ran an agentic code review pilot two months ago. It produced 800 findings on the first repo. Nobody opened the second report.

If that scene feels familiar, the news cycle around Claude Mythos isn't your problem. The workflow underneath it is.

The thing most people missed

While the industry is reacting to Anthropic's April 7 Mythos Preview announcement, the capability shift it represents has been visible in public for almost a year. XBOW became the first autonomous system to top HackerOne's US leaderboard in June 2025, ran 1,060 attack campaigns in a 90-day stretch, and shipped CVE-2026-21536 — a CVSS 9.8 RCE in Microsoft's Devices Pricing Program. Google's Big Sleep agent disclosed 20 real-world zero-days in August 2025. AISLE published 12 OpenSSL zero-days in February 2026, including one dating back to 1998. Anthropic itself reported 500+ high-severity OSS vulnerabilities the same month. DARPA's AIxCC finals at DEF CON 33 found 54 vulnerabilities across 54 million lines of code in four hours of compute.

Mythos is a step change. It is also the latest data point on a curve that has been climbing in plain sight since mid-2025. The headline is loud because Anthropic gated the release behind Project Glasswing. The underlying capability — AI-discovered vulnerabilities at machine speed, in your first-party code and your third-party software alike — isn't gated, and hasn't been for some time.

What we found when we tested it ourselves

We took Anthropic's five publicly disclosed, patched Mythos cases and tried to reproduce them with nothing more than what any AppSec team can buy this afternoon. Two off-the-shelf models (GPT-5.4, Claude Opus 4.6) accessed through their public APIs. One open-source coding agent (opencode). Three independent runs per model per target. No proprietary scaffolding, no internal access, no curated prompts — the line ranges fed to the detection step were generated by a prior agent-planning step, not hand-picked by a researcher.

Cost stayed under $30 per file scanned.

Here is what came back.

Target	`Claude Opus 4.6`	`GPT-5.4`	What it tells you about your pipeline
`FreeBSD NFS` (`CVE-2026-4747`)	Reproduced (`3/3`)	Reproduced (`3/3`)	Systems-level discovery is no longer gated
`OpenBSD TCP SACK` (27-year-old)	Reproduced (`3/3`)	Not reproduced (`0/3`)	Single-model pipelines have blind spots
`FFmpeg H.264 parser`	Useful lead	Useful lead	Raw model output is a lead, not a finding
`Botan` (`CVE-2026-34580/34582`)	Reproduced (`3/3`)	Reproduced (`3/3`)	Trust and identity bugs reproduce cleanly
`wolfSSL` (`CVE-2026-5194`)	Useful lead	Useful lead	Crypto semantics need deeper validation

Three full reproductions, two useful leads, one result that depends on which model you ran. We'll be explicit about what this evidence does and does not cover: we reproduced the findings Anthropic published, not Mythos's full multi-packet ROP chain on FreeBSD. Discovering the bug is one capability. Building the weaponized exploit is another. Public models clear the first bar today.

Why this matters more than Mythos itself

Two off-the-shelf APIs, three runs each, found a 27-year-old subtle logic bug in OpenBSD's TCP SACK implementation that requires reasoning about 32-bit signed sequence-number overflow interacting with linked-list state. Two off-the-shelf APIs found a certificate-trust bypass in Botan that decides identity using only subject DN and subject key ID. Both at under $30 per file. That puts the discovery capability on the same shelf as your CI runner — and on the attacker's shelf at the same price.

The asymmetry isn't model access. Both sides have the models. The asymmetry is operational: attackers compose discovery, exploitation, and orchestration into machine-speed pipelines, while most AppSec teams still operate the same review queue they ran a year ago. The question stops being "do we have model access?" and starts being "does our workflow turn raw model output into a finding a developer will actually act on, before merge?" That is not a model problem. It's a product problem. And it is the problem most AI AppSec pilots are quietly losing.

Where AI AppSec pilots actually fail

We've seen four failure modes consistently, and they map directly to what enterprise security teams tell us in evaluation conversations.

PR noise floods the workflow. A first-pass model dumps 800 candidate findings on a repo. Engineers triage 30 of them, find that 25 are noise, and stop opening the comments. One regulated banking buyer put it bluntly in a recent call: don't harass our developers with a thousand bot comments. The pilot doesn't fail because the model is wrong — it fails because the workflow shipped raw output to a human.

Single-model pipelines have blind spots. Our OpenBSD result is the cleanest example. Claude Opus 4.6 found the SACK bug every time. GPT-5.4 missed it every time. On a different bug class the result could flip. A scanner committed to one provider is committed to that provider's blind-spot profile, and the buyer rarely sees it until production.

Single-file analysis misses semantic bugs. wolfSSL is the textbook case. Models flagged the right location and the right pattern — a missing check — but misattributed the impact, because the actual invariant being violated lives across files. Without repository-wide context, the tool produces a useful lead and stops short of a finding.

No coverage feedback. Engineers run a scan and don't know what was actually examined, what was skipped, or whether the run completed. Half the value of the tool evaporates because the team can't tell the difference between "clean repo" and "scanner crashed silently."

The workflow that actually works

Detection is now table stakes. Everything that determines whether a tool earns developer trust sits downstream of it.

Code or PR Incoming Change	→	Detection Route Across Frontier Models	→	Validation Second-Pass Agent + Oracles	→	Prioritize Repo + Business Context
		→		→		→
Raw Output, No Validation PR Noise, Lost Developer Trust				Single Model Blind Spots Like `OpenBSD SACK`		Delivery Developers Trust PR Comment, IDE, Ticket, CI Signal
						→
						Single-File Scope Misattributed Impact Like `wolfSSL`

Detection has to route across more than one frontier model, because no single model covers every bug class — our OpenBSD result is the proof. Validation has to run a second-pass agent against every candidate finding, ideally with deterministic oracles where the bug class allows one (sanitizers, fuzzers, constraint solvers). Without it, a candidate is a lead, not a finding. Prioritization has to use cross-file, repository-wide context — what the SCA world calls reachability analysis, applied to your own code: reachability from untrusted input, blast radius, proximity to crown jewels, severity calibrated to the actual invariant being violated, not the location of the missing check. Delivery has to land where engineers already are — PR comment, IDE inline diagnostic, pre-commit hook, ticket — and it has to surface what was scanned and what wasn't, so the team can trust the silence as much as the noise.

Each stage where you skip the work is a stage where the pilot dies.

Where Vidoc fits

Vidoc is built as that workflow layer, not as a model wrapper. Concretely:

Multi-model routing. We run Claude, GPT, and Gemini across the same target and cross-reference results, which is how we avoid the OpenBSD-style miss when one provider has a blind spot.
Cross-file, repository-wide reasoning. Scans reason about the broader codebase, not a single file in isolation, which is what makes the difference on semantic bugs like wolfSSL where the real invariant lives outside the function being flagged.
Second-pass validation before a developer sees it. A separate agent re-checks each candidate against the source and the call graph before anything reaches a PR comment, so the false-positive rate on what engineers actually see is materially lower than first-pass output.
Shift-left, SDLC-integrated delivery. PR decoration in GitHub, GitLab, or Bitbucket, IDE integration, pre-commit hooks, SARIF for CI, ticketing handoff. LLM-driven security review runs before merge, not after deploy, in the workflow developers already use.
Deployment posture for regulated environments. On-prem and strongly-isolated options for teams in banking, defense, healthcare, and critical infrastructure where source code cannot leave a controlled boundary.

A note on positioning: Vidoc is adjacent to your SAST and SCA stack, not a replacement for it. SAST catches deterministic, pattern-based bugs cheaply and reliably. Agentic code review catches the context-heavy, semantic, and chained bugs that SAST is structurally incapable of catching — the missing auth wiring, the trust-boundary error, the cross-file invariant. Run them together.

What to evaluate when you pilot any AI AppSec tool

If you're scoping a pilot — with us, with anyone — these are the questions that separate the tools that survive month two from the ones that don't.

Precision on validated findings, not raw recall. Ask the vendor to report false-positive rate against a fixed corpus, after their validation layer runs.
Model routing policy. Which models, routed on what basis, with what fallback. A wrapper around one provider should justify the concentration risk in writing.
Repository-wide reasoning. Can the tool follow a missing check across files, or does it score each function in isolation?
PR-decoration UX. Look at the actual comment on a real PR. Is it short, actionable, and contextual, or is it a 200-line dump?
Scan completeness signal. Can engineers see what was scanned, what was skipped, and whether the run finished?
Isolation and deployment. SaaS, single-tenant, self-hosted, or air-gapped — and what is the data-handling posture under each?
Org-wide trend reporting. Can security leadership see the most common bug classes across teams, so training and platform fixes get prioritized?

If you'd like to see how we score against this list on your own repo, we'll walk you through it.

Closing

Mythos is a milestone, not the starting line. Becoming Mythos-ready — in the language the industry has converged on this month — isn't about model access. It's about the workflow layer the CSA briefing's VulnOps framing points at: multi-model routing, validation, prioritization, delivery. Shipped before merge, at machine speed, on first-party and third-party code. The capability is here, in public, at API-call prices. The work that's left is product work.

See it on your code

We'll run a 30-minute walkthrough on a repo you choose and show you exactly where validation, routing, and PR-level delivery change the output you get. Book a walkthrough.

Why Mythos Doesn't Change Much (And What Actually Did)

The Real Shift Happened Months Before Mythos

The thing most people missed

What we found when we tested it ourselves

Why this matters more than Mythos itself

Where AI AppSec pilots actually fail

The workflow that actually works

Where Vidoc fits

What to evaluate when you pilot any AI AppSec tool

Closing

See it on your code

Ready to secure your application?

More articles

The Fable 5 Shutdown Is an AI Cyberdefense Warning

We Reproduced Anthropic's Mythos Findings With Public Models

Reality Check on the Mythos Hype: AI Vulnerability Discovery Is Already Business as Usual