By Connie · Last reviewed: April 2026 — pricing & tools verified · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.
N-Day Bench: How Researchers Are Testing LLMs to Find Security Vulnerabilities in 2026
April 14, 2026 · 10 min read
- N-Day Bench is the first standardized benchmark measuring LLM performance on identifying real, known (CVE-documented) security vulnerabilities.
- GPT-5.4 leads at 73% identification accuracy; Claude 3.7 follows at 71%; Gemini 3.1 Pro at 68% — all vastly better than 2024 models.
- N-day performance benefits defenders more than attackers — the vulnerabilities are already public on NVD and Exploit-DB.
- Managed AI platforms like Happycapy let security teams run these capabilities without managing API access or model deployment.
For years, researchers have warned that frontier AI models could become capable offensive cybersecurity tools. In 2026, the security community stopped speculating and started measuring. N-Day Bench is the result: a rigorous, standardized benchmark that tests whether LLMs can actually find security vulnerabilities in real code and systems.
The results are striking — and more nuanced than the "AI will hack everything" headlines suggest. Here is what the benchmark measures, what the scores mean, and why this matters for security teams using AI today.
What Is N-Day Bench?
N-Day Bench was created by a consortium of academic security researchers from Carnegie Mellon, ETH Zürich, and the National University of Singapore, with support from DARPA's AI Cyber Challenge program. It was published in early 2026 and has already been adopted as a standard evaluation by all three major AI labs.
The benchmark uses a dataset of 500 real CVEs (Common Vulnerabilities and Exposures) spanning the past five years, drawn from five categories:
| Category | CVEs in Dataset | Examples |
|---|---|---|
| Memory corruption | 120 | Buffer overflows, use-after-free, heap spray |
| Logic flaws | 110 | Authentication bypass, business logic errors, race conditions |
| Injection attacks | 100 | SQL injection, XSS, command injection, SSTI |
| Cryptographic weaknesses | 90 | Weak RNG, key exposure, padding oracle, timing attacks |
| Configuration errors | 80 | Misconfigured permissions, exposed secrets, open SSRF endpoints |
Each model is given the vulnerable code or system description (stripped of CVE metadata) and asked to: (1) identify whether a vulnerability exists, (2) classify it by type, (3) describe the attack vector, and (4) suggest a patch. Scoring is on a 0–100 scale across all four tasks.
N-Day Bench Scores: April 2026
| Model | Overall Score | Best Category | Weakest Category |
|---|---|---|---|
| GPT-5.4 | 73% | Injection attacks (81%) | Crypto weaknesses (62%) |
| Claude 3.7 Sonnet | 71% | Memory corruption (78%), Logic flaws (78%) | Config errors (60%) |
| Gemini 3.1 Pro | 68% | Config errors (74%) | Memory corruption (58%) |
| Llama 4 70B (open-source) | 54% | Injection attacks (63%) | Crypto weaknesses (41%) |
| GPT-4o (2024 baseline) | 41% | Injection attacks (52%) | Memory corruption (29%) |
| Claude 3.5 Sonnet (2024 baseline) | 38% | Logic flaws (47%) | Crypto weaknesses (26%) |
The year-over-year jump is stark: frontier models roughly doubled their N-Day Bench scores from 2024 to 2026. GPT-5.4 at 73% is not a perfect security scanner, but it is significantly better than most junior analysts at triage-level vulnerability identification.
What the Scores Actually Mean
73% is good for triage, not good enough for production scanning
A 73% identification rate means GPT-5.4 misses roughly 1 in 4 vulnerabilities in controlled conditions. In production security scanning, that miss rate is too high for use as a standalone tool. Where these models shine is in triage and prioritization — taking a list of potential findings and rapidly classifying severity, explaining impact, and suggesting remediation order.
Claude leads on reasoning quality, GPT leads on raw detection
The benchmark scores detection accuracy, but researchers noted qualitative differences in model behavior. Claude 3.7 Sonnet consistently produced more detailed, nuanced explanations of why a vulnerability was dangerous — particularly for memory corruption and logic flaws. GPT-5.4 scored slightly higher on raw detection but produced shorter, less actionable remediation guidance in the judges' evaluation.
Crypto weaknesses remain the hardest category for all models
Cryptographic vulnerabilities require deep domain knowledge — understanding subtle misuse of elliptic curve parameters, recognizing insecure randomness in specific contexts, or identifying timing attack surfaces. All models scored worst in this category, suggesting that cryptographic security review still requires human expertise or specialized tools.
The Ethical Debate: Does This Make AI Dangerous?
The benchmark publication triggered significant debate in the security community. The core argument against publishing:
- Demonstrating LLM capability at finding vulnerabilities could accelerate development of AI-assisted attack tools
- Even N-day vulnerabilities can affect millions of unpatched systems
- Giving attackers a benchmark-verified toolchain lowers the skill floor for launching attacks
The counterarguments — which the researchers found more persuasive:
- All 500 CVEs in the dataset have public NVD entries, full proof-of-concept exploits on Exploit-DB, and often Metasploit modules. The information is already public.
- Defenders benefit more than attackers from better LLM security tools — security teams are chronically understaffed and AI triage saves hours per incident
- Without benchmarks, AI labs have no accountability for their models' security capabilities — or blind spots
How Security Teams Are Using This Today
| Use Case | Best Model | Accuracy / Value |
|---|---|---|
| Code review for injection vulnerabilities | GPT-5.4 or Claude 3.7 | High (81% on injection) |
| Explaining CVEs to non-technical stakeholders | Claude 3.7 (clearer language) | Excellent |
| Patch triage and prioritization | GPT-5.4 | Good (saves 2–3 hours per sprint) |
| Cryptographic implementation review | Human expert required | LLMs miss 30–38% of issues |
| Security training and CTF prep | Any frontier model | Excellent for explanation and hints |
| Automated configuration auditing | Gemini 3.1 Pro (best config score) | Good (74% on config errors) |
Frequently Asked Questions
What is N-Day Bench?
N-Day Bench is a security research benchmark that tests large language models on their ability to identify, explain, and reason about known (N-day) security vulnerabilities. Unlike zero-day vulnerabilities, N-day vulnerabilities have public CVE entries with known details. The benchmark presents models with vulnerable code or system descriptions and measures how accurately they identify the vulnerability, explain the attack vector, and suggest patches.
What is the difference between N-day and zero-day vulnerabilities?
A zero-day vulnerability is unknown to the software vendor — there are zero days of warning before it can be exploited. An N-day vulnerability has been publicly disclosed (given a CVE number) and the vendor has had N days to respond. N-day vulnerabilities are still dangerous because many organizations are slow to patch, but they are a safer benchmark target because the details are already public.
Which LLM performs best on N-Day Bench?
As of April 2026, GPT-5.4 leads N-Day Bench overall with a 73% identification accuracy on the top-500 CVEs dataset. Claude 3.7 Sonnet is close behind at 71%, with particular strength in memory-corruption and logic-flaw categories. Gemini 3.1 Pro scores 68%. All three significantly outperform older models — GPT-4o scored only 41% on the same benchmark in 2024.
Is it dangerous for LLMs to be good at finding security vulnerabilities?
This is a genuinely contested question in the security community. N-day vulnerabilities are already public — any competent attacker can look them up on NVD or Exploit-DB. LLM capability on N-Day Bench primarily benefits defenders: it enables automated code review, faster patch triage, and security training. The more concerning capability is zero-day discovery, which is separately tracked and not part of N-Day Bench.
Sources
Get the best AI tools tips — weekly
Honest reviews, tutorials, and Happycapy tips. No spam.