LLMs Can Now Find Real Security Vulnerabilities — What the N-Day-Bench Results Mean for Developers
On April 14, 2026, researchers published N-Day-Bench — a new benchmark showing that frontier language models can autonomously identify real, disclosed CVEs in production codebases. The results triggered an immediate Hacker News thread and a broader debate about what it means when AI can find the vulnerabilities attackers already know about. Here is what the benchmark measures, what the numbers reveal, and what every developer should do about it right now.
TL;DR
- N-Day-Bench shows frontier LLMs — GPT-5.4, Claude 3.7, and peers — can locate real CVEs in production code without being told where to look.
- This marks a genuine capability milestone: AI security tools have moved from synthetic puzzles to real-world vulnerability identification.
- The security community is sharply divided on whether this accelerates defenders or hands attackers a faster recon tool — or both.
- Developers can act today: AI agents are already capable of meaningful proactive code audits, dependency checks, and vulnerability triage without a security team budget.
Model Performance on N-Day-Bench
The benchmark evaluated five frontier models against a curated set of real CVEs drawn from the National Vulnerability Database, applied to the versions of production open-source codebases in which each vulnerability was originally present. Models were scored on whether they could identify the vulnerable file and function, describe the flaw class correctly, and propose a valid remediation.
| Model | CVE Detection Rate | Correct Flaw Class | Valid Fix Suggested | False Positive Rate |
|---|---|---|---|---|
| GPT-5.4 | 61% | 78% | 54% | 12% |
| Claude 3.7 | 57% | 81% | 51% | 9% |
| Gemini 2.5 Pro | 48% | 74% | 43% | 14% |
| Llama 4 405B | 41% | 68% | 36% | 19% |
| Mistral Large 3 | 29% | 61% | 24% | 22% |
Source: N-Day-Bench paper (arXiv, April 14 2026). Detection rate = model correctly located vulnerable code. Flaw class = correct vulnerability category (e.g. buffer overflow, SQLi). Valid fix = proposed patch addresses the root cause without introducing new issues.
Security Tasks AI Can Help With Today
| Security Task | AI Capability Level | Human Review Still Needed? | Best Used For |
|---|---|---|---|
| Code audit (static) | Strong | Yes — confirm findings | First-pass scan on new code or PRs |
| Dependency scanning | Strong | Minimal — cross-check CVE DB | Pinpointing outdated packages with known CVEs |
| Pen test support | Moderate | Yes — human-led engagement | Reconnaissance, payload generation, report drafting |
| Vulnerability triage | Strong | Yes — prioritize by context | Sorting backlog of findings by severity and exploitability |
| Security documentation | Strong | Light review | Threat models, runbooks, disclosure write-ups |
| Incident analysis | Moderate | Yes — always human-led | Log parsing, timeline reconstruction, IOC extraction |
1. What N-Day-Bench Is
Most AI security benchmarks test models on Capture the Flag (CTF) challenges — synthetic, gamified puzzles designed to be solved, not real production vulnerabilities. N-Day-Bench takes a different approach: it collects actual CVEs from the NIST National Vulnerability Database and applies them to the exact versions of real open-source projects in which each flaw existed when it was first disclosed.
The “N-day” in the name refers to the time elapsed since a vulnerability was publicly disclosed — as opposed to a “zero-day”, which has not yet been revealed. By focusing on N-days, the benchmark sidesteps the ethical complications of testing on undisclosed flaws while still measuring performance against vulnerabilities that matter: real bugs that real attackers already know about and that real codebases may not yet have patched.
The evaluation setup presents a model with the full source code of a repository at the vulnerable version and asks it to identify any security vulnerabilities. No hints are given about the location, type, or existence of a specific CVE. The model must independently reason about the code, locate the problem, classify the flaw type, and propose a correct remediation. The ground truth for each case is the published CVE record and the upstream patch from the project's official repository.
The paper was posted to arXiv on April 14, 2026, and quickly climbed to the front page of Hacker News, where a thread with over 400 comments debated the implications. The authors released the benchmark dataset publicly to allow independent replication.
2. What the Results Show
The headline number — GPT-5.4 detecting 61% of CVEs without any hints — understates how significant this is. Earlier attempts to apply LLMs to real vulnerability detection, including the widely cited 2024 Berkeley benchmark on exploit generation, showed frontier models struggling with real-world codebases. The gap between CTF performance and production-code performance was stark.
N-Day-Bench shows that gap is narrowing fast. A 61% detection rate on real CVEs in real code — with no guidance on where to look — places frontier models solidly in the range of a competent junior security engineer doing a first-pass review. The benchmark's “correct flaw class” metric is particularly instructive: Claude 3.7 correctly categorized the vulnerability type in 81% of cases where it detected anything at all, suggesting the model's security reasoning is more accurate than its coverage.
False positive rates are meaningful but manageable. At 9–12% for the top models, a developer using AI-assisted scanning would encounter roughly one false alarm for every eight to eleven real findings. That compares favorably to traditional static analysis tools, which routinely produce false-positive rates above 30% on complex codebases.
Performance varied substantially by vulnerability class. Models performed best on well-documented flaw types — SQL injection, path traversal, insecure deserialization — where training data is rich with examples. They performed worst on concurrency bugs, timing side channels, and integer overflow conditions that require deep reasoning about runtime behavior rather than pattern recognition in source text. This matches the intuition that LLMs are better at recognizing known patterns than at reasoning about dynamic execution.
3. What This Means for the Offense/Defense Balance
The Hacker News thread surfaced the central tension in this result: the same capability that helps defenders find vulnerabilities in their own code helps attackers find vulnerabilities in everyone else's code. A model that detects 61% of real CVEs is a model that could, if misused, accelerate the reconnaissance phase of an attack campaign.
The honest answer is that both sides benefit — but the asymmetry of the current situation may favor defenders in the short term. Here is why: N-day vulnerabilities are already disclosed. Attackers who want to exploit a known CVE can look it up in the NVD today. The new capability N-Day-Bench demonstrates is not “finding secrets nobody knew” — it is “finding known issues faster in large codebases.” Defenders with access to their own code can apply that capability continuously. Attackers would need the source code of a target to use it the same way.
The harder question is what happens as models improve toward higher detection rates and begin to generalize to novel flaw patterns. Security researchers quoted in the Hacker News thread noted that a model capable of 85%+ CVE detection across diverse flaw types would constitute a materially different threat surface — one where automated discovery of near-zero-day-quality findings at scale becomes possible for well-resourced actors.
NIST's AI Risk Management Framework (AI RMF 1.0) identifies “capability amplification” — where AI enables actions previously requiring specialized expertise at scale — as a core governance concern. N-Day-Bench results are a concrete example of that category arriving in practice. The security community's emerging consensus is that the correct response is to accelerate defensive tooling rather than attempt to suppress the capability — because suppression is ineffective while the underlying models are widely available.
For context on how AI agents are already being exploited in active attack campaigns, see our coverage of the SOHO router AI agent attack vector documented in April 2026.
Run AI-assisted security audits on your own code — no setup required
Happycapy's agent mode lets you upload code and run multi-step vulnerability analysis across GPT-5.4, Claude 3.7, and other frontier models. Free plan available — Pro from $17/mo.
Try Happycapy Free →4. How Developers Can Use AI for Security Today
The N-Day-Bench results confirm that frontier-model security analysis is no longer a research curiosity — it is a practical tool available to any developer today. The question is not whether to use it but how to integrate it into a workflow that actually catches problems before they reach production.
First-pass code audit.The highest-value application is the static code review. Before merging a pull request — especially one that touches authentication, authorization, input handling, or cryptography — paste the changed code into an AI agent and ask specific, targeted questions: “Does this function correctly validate the input before using it in a database query?”, “Is this JWT verification correct?”, “Are there any integer overflow conditions in this loop?”. Specific questions produce better results than generic “check for security issues” prompts.
Dependency scanning. Models with access to current CVE databases can cross-reference your package.json, requirements.txt, or go.sum against known vulnerabilities in specific package versions. This is not a replacement for tools like npm audit or Dependabot, but it adds an interpretive layer — the AI can explain which CVEs in your dependency tree are actually reachable from your code, reducing noise.
Vulnerability triage. If your team receives reports from bug bounties, automated scanners, or pentesters, AI agents can dramatically accelerate the triage process. Feed the AI the vulnerability report and the relevant code and ask it to assess exploitability, estimate CVSS impact, and propose a patch. This does not replace human judgment on critical findings, but it handles the administrative overhead of sorting a backlog.
Security documentation. Threat models, security runbooks, and disclosure write-ups are exactly the kind of structured-but-tedious writing where AI agents excel. A well-prompted model can generate a first draft STRIDE threat model for a new system design in minutes — something that would take a human engineer several hours to produce from scratch.
Incident analysis support. During an incident, AI agents can parse logs, extract indicators of compromise, reconstruct event timelines from structured data, and draft sections of the post-mortem. The value is speed: getting a coherent picture of what happened faster so the response team can act rather than analyze.
For a broader view of how AI tools are being applied across development workflows in 2026, see our guide to the best AI tools for productivity in 2026.
For developers evaluating whether to run AI security tooling locally or via a managed service, our comparison of local AI agents versus cloud AI platforms covers the security, performance, and cost trade-offs in depth.
5. The Practical Takeaway
N-Day-Bench does not change what developers should be doing — it changes how urgently they should be doing it. Proactive code auditing has always been best practice. The benchmark confirms that a credible AI tool for doing it faster and more cheaply than hiring a security consultant is now available.
The offense/defense concern is real but should not be paralyzing. Every developer who runs an AI-assisted audit on their codebase this week is fixing vulnerabilities that would otherwise remain for attackers — human or AI-assisted — to find later. The asymmetry of waiting is unfavorable. Attackers are already applying these capabilities. Defenders should be too.
Practically, this means adding AI security review to your pull request workflow, running a quarterly AI-assisted audit of your most security-critical code, and using AI agents to stay current with CVEs affecting your dependency tree. None of these replace a professional penetration test for high-stakes systems, but they close the gap between “no security review” and “professional security review” for the vast majority of developers who cannot justify the cost of the latter on every codebase they maintain.
Happycapy's agent mode supports multi-step security analysis across frontier models — allowing you to run the kind of iterative, context-aware code audit that N-Day-Bench demonstrates is now within reach. Plans start at Free, with Pro at $17/month (annual) and Max at $167/month (annual) for teams that need higher usage limits and priority access to the latest models.
Audit your code with the same frontier models that ace N-Day-Bench
Happycapy gives you access to GPT-5.4, Claude 3.7, and more in a single agent interface. No API keys to manage. Pro starts at $17/mo — Max at $167/mo for power users.
Get Happycapy Pro — $17/mo →Frequently Asked Questions
What is N-Day-Bench?
N-Day-Bench is a security benchmark released on April 14, 2026 that tests whether large language models can autonomously identify real, already-disclosed (N-day) vulnerabilities in production codebases. Unlike synthetic CTF challenges, N-Day-Bench uses actual CVEs from the National Vulnerability Database applied to real open-source projects — measuring whether frontier AI models can locate vulnerable code, describe the flaw, and suggest a correct fix without being told where to look.
Can AI replace security professionals?
No. AI models on N-Day-Bench succeed on a meaningful portion of disclosed CVEs but miss many others, particularly those requiring deep runtime analysis, exploit chaining, or business-logic understanding. Human security professionals retain decisive advantages in adversarial threat modeling, novel zero-day discovery, and interpreting results in organizational context. The practical value of AI today is as an accelerant — automating tedious first-pass scanning so human experts can focus on complex findings.
Is it safe to use LLMs for security analysis?
Yes, with appropriate precautions. Sending code to a cloud LLM means that code is processed by a third party — treat it as you would any external code review service and ensure you have the right to share the code. For proprietary or sensitive codebases, use a platform with strong data-handling agreements. AI security analysis output should always be validated by a human before remediation decisions are made.
How can I use AI to audit my code?
The most accessible starting point is an AI agent platform that lets you paste or upload code and ask targeted security questions: “Does this function have any injection vulnerabilities?”, “Are there race conditions in this lock implementation?”, “Does this dependency version have known CVEs?”. Happycapy's agent mode supports multi-step security audits, CVE cross-referencing, and fix suggestions — all without managing API keys or model infrastructure. Start free at happycapy.ai.
Sources: N-Day-Bench paper (arXiv, April 14 2026) — benchmark methodology and model performance data; Hacker News thread on N-Day-Bench (news.ycombinator.com, April 14 2026); NIST National Vulnerability Database (nvd.nist.gov) — CVE records and CVSS scoring; NIST AI Risk Management Framework 1.0 (NIST AI 100-1) — capability amplification and adversarial input concepts.