LLMs became dangerously good for cybersecurity

TL;DR

LLMs are becoming good at math, coding and now security. We tested LLMs on 95 real-world security issues found by humans in open-source software, and they were able to identify 80% of them. LLMs are becoming scary good at finding security issues in code, but this is not what makes them dangerous.

LLMs now discover real‑world zero‑days and they do it with brute‑force patience, not superhuman IQ. Today, a swarm of lightweight LLM agents can out‑grind any human.

Linux Kernel zero-day

You probably heard about the Linux Kernel CVE-2025-37899 - it was a security issue found by OpenAI's latest model, o3, in the Linux kernel’s SMB implementation.

[...] I found the vulnerability with nothing more complicated than the o3 API – no scaffolding, no agentic frameworks, no tool use.

Although the LLM did not discover a new category of vulnerability, it was able to identify an issue in the most battle-tested and thoroughly reviewed code in the world - the Linux Kernel.

The issue was a use-after-free vulnerability, a very common mistake in C. This kind of mistake is very easy to make, especially in a complex codebase.

Yet, it’s almost impossible to spot if you have to find it among millions of lines of code in thousands of files.

What do Google, OpenAI, and Anthropic say in their system cards?

Ok, so we have a single example of LLMs finding zero-day in open source, so what? Big Labs probably already benchmark and monitor this capability (finding security issues in code) in all of their models.

According to providers, the models are safe - no model reached "high risk".

... or are they?

What do they test exactly? They created a synthetic CTF-like challenges to evaluate LLMs on “Vulnerability Discovery and Exploitation”.

From o3 and o4-mini system card:

o3 and o4-mini system card

The “Capture the Flag” and “Cyber Range” both focus on identification and exploitation.

Given 12 attempts at each task, o3 completes 89% of high-school level, 68% of collegiate level, and 59% of professional level CTF challenges. o4-mini completes 80% of high-school level, 55% of collegiate level, and 41% of professional level challenges. Both models perform significantly better than prior o-series models due to improved tool use and ability to make use of long rollouts.

But, how does it translate to the real world?

Does it mean it can hack into any national bank and erase all debt?!

I would argue that none of the benchmarks show any real-world impact of how good LLMs have become at identifying security issues in code. Synthetic benchmarks are only a proxy for the real world, a very noisy one.

Additionally, they are evaluating the full workflow - they expect the LLMs to identify security issues and exploit them. From my perspective, these are entirely different tasks - even if AI gets good at finding security issues without exploiting them, it’s a high risk.

How good are LLMs at the discovery of zero days, really?

If testing LLMs on synthetic CTF-like challenges does not give enough insight into the capabilities of LLMs, what does?

We need to evaluate LLMs on issues found by humans in real-world software used by real humans.

That is why we refined a benchmark dataset consisting of 95 cases of security issues identified by humans in open-source software. The dataset was prepared by mapping CVEs (https://cve.mitre.org/) to GitHub commits.

The dataset consists of 31 different categories of security issues (it's not evenly distributed) - the dataset contains all kinds of issues - XSSs, broken business logic, SQL injections, broken auth, etc. Additionally, we limited the cases to only contain issues that could be identified by a human reading single file.

Top vulnerability categories in our dataset:

Cross-site Scripting (XSS): 15 cases (15.8%)
Cross-Site Request Forgery (CSRF): 14 cases (14.7%)
Improper Access Control: 9 cases (9.5%)
OS Command Injection: 7 cases (7.4%)
Input Validation Issues: 6 cases (6.3%)

We limited the dataset to 4 programming languages:

Programming Language Distribution

And the severity distribution:

CVSS Severity Distribution

The average CVSS score across all vulnerabilities is 7.5, indicating these are genuine, high-impact security issues.

Our goal wasn't to test how good LLMs are for any specific language or category of security issues, but to test it's general capabilities.

The results

Vulnerability Detection Accuracy

The LLMs were able to identify 80% of the security issues in the code!! Wow!

We crossed a threshold for scary.

What makes LLM dangerous?

I need to share some of my background to provide better context on what makes a "successful" hacker.

My ethical hacking story

Klaudia (my co-founder) and I weren’t the best ethical hackers in the world; we just built security automation that was constantly doing the same thing, a stupidly simple thing, and still - we found critical security issues and made some bucks.

It was sending 500 million HTTP requests a day to the same targets, looking for the same 20 old misconfigurations.

Companies did not seem to be vulnerable at first, but at some point, the developers had to cut corners. They had tight deadlines to deliver, or they were just too tired.

They PRed, they merged, they deployed. We caught them. Every single time.

The size of the security budget of the company did not matter. The same trick worked for Microsoft and Apple.

We were successful, we earned $120 000+ by simply observing and doing the repetitive checks, over and over.

Simple, but effective

Patience beats sophistication

The same principle that we used to build the system to ethically hack the most secure companies in the world, a couple of years ago, can be applied here.

Patience and persistence in identifying simple security issues at scale beat any sophisticated methods.

LLMs are infinitely patient and persistent by design. You can task 100s of mini agents to look for simple issues in thousands of files, in many repositories at once, and they will do it. They won’t get overwhelmed, won’t complain. They will quietly obey your order, waiting for the next task.

Yes, maybe LLMs can’t craft a 10-step exploit, but they sure can review 1000s of files for much simpler vulnerabilities, and it can be as deadly.

Also, it does not matter if LLMs didn't create working exploits. It's enough to give humans an advantage in finding an entry point.

But aren’t those issues already detected by other security tools?

Not really.

We used the same dataset of 95 cases of real-world code and ran the best tools (CodeQL from Github, Snyk Code, Semgrep) on the market against it.

Tools accuracy

On our benchmark, leading SAST tools found ≤ 25 % of issues that LLM found. They look like a toy compared to LLMs.

It turns out that other security tools (SASTs) - designed to find security issues in source code are pretty bad at it compared to LLMs. They can only detect issues that can be specified as a grep rule. (like use of insecure function, etc)

The issues in the dataset are issues found by humans in codebases that probably, already used SASTs to catch the low hanging fruits.

Yet, there were still plenty of issues found by humans. Because humans can reason, they understand the code, SASTs don’t.

How in the hell did LLMs get so good at this?!

It’s of the emerging behaviors nobody is talking about.

To be a good at programming is to understand code deeply, to read it, understand it, and find issues in it. It happened because all of the major labs started optimizing for coding capabilities. (writing and understanding code)

The better the LLMs are at coding and debugging, the better security issues they can find, the better "hackers" they will become.

You can’t have LLM that is great at coding, but can't help you hack the pentagon. Teaching LLMs to code is a double-edged sword.

But perhaps the LLMs will be so proficient at coding that they will stop introducing security issues?

Maybe.

What stops LLMs from hacking the world?

False positives.

For each valid security issue they find, they will generate 3 ideas that turn out not to be vulnerabilities. There are diamonds that get buried in pile of crap. You have to dig them out first.

Here are some example of false positives rates:

google/gemini-2.5-pro: 1 out of 3
o3: 1 out of 3
Sonnet 4: 1 out of 289 (yikes)

This is something the LLMs can’t do on their own; you need a system for that. This is one of the hardest problems we have been battling with for almost a year now, but it’s finally solved.

I can’t spill too much of how we did it, but I can promise you the world of cybersecurity will get much more interesting in the coming year.

Future work

We are looking for partners to evaluate the impact of LLMs on private code
We will be open sourcing the dataset and the code, so you can reproduce the results and improve on them.

If you want to collaborate - test the the full system on some codebase, or the dataset, please reach out to me on Email.

Expect new versions of the benchmark

I know that our work is not perfect, and there are many things that can and will be improved.

The v2 version of the benchmark will contain:

Balance it more (we should have similar amount of issues in each category and language)
Benchmark all major LLMs on the dataset and create a leaderboard