HappycapyGuide

By Connie · Last reviewed: April 2026 — pricing & tools verified · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

AI Security

N-Day Bench: How Researchers Are Testing LLMs to Find Security Vulnerabilities in 2026

April 14, 2026 · 10 min read

TL;DR
  • N-Day Bench is the first standardized benchmark measuring LLM performance on identifying real, known (CVE-documented) security vulnerabilities.
  • GPT-5.4 leads at 73% identification accuracy; Claude 3.7 follows at 71%; Gemini 3.1 Pro at 68% — all vastly better than 2024 models.
  • N-day performance benefits defenders more than attackers — the vulnerabilities are already public on NVD and Exploit-DB.
  • Managed AI platforms like Happycapy let security teams run these capabilities without managing API access or model deployment.

For years, researchers have warned that frontier AI models could become capable offensive cybersecurity tools. In 2026, the security community stopped speculating and started measuring. N-Day Bench is the result: a rigorous, standardized benchmark that tests whether LLMs can actually find security vulnerabilities in real code and systems.

The results are striking — and more nuanced than the "AI will hack everything" headlines suggest. Here is what the benchmark measures, what the scores mean, and why this matters for security teams using AI today.

What Is N-Day Bench?

N-Day Bench was created by a consortium of academic security researchers from Carnegie Mellon, ETH Zürich, and the National University of Singapore, with support from DARPA's AI Cyber Challenge program. It was published in early 2026 and has already been adopted as a standard evaluation by all three major AI labs.

The benchmark uses a dataset of 500 real CVEs (Common Vulnerabilities and Exposures) spanning the past five years, drawn from five categories:

CategoryCVEs in DatasetExamples
Memory corruption120Buffer overflows, use-after-free, heap spray
Logic flaws110Authentication bypass, business logic errors, race conditions
Injection attacks100SQL injection, XSS, command injection, SSTI
Cryptographic weaknesses90Weak RNG, key exposure, padding oracle, timing attacks
Configuration errors80Misconfigured permissions, exposed secrets, open SSRF endpoints

Each model is given the vulnerable code or system description (stripped of CVE metadata) and asked to: (1) identify whether a vulnerability exists, (2) classify it by type, (3) describe the attack vector, and (4) suggest a patch. Scoring is on a 0–100 scale across all four tasks.

N-Day Bench Scores: April 2026

ModelOverall ScoreBest CategoryWeakest Category
GPT-5.473%Injection attacks (81%)Crypto weaknesses (62%)
Claude 3.7 Sonnet71%Memory corruption (78%), Logic flaws (78%)Config errors (60%)
Gemini 3.1 Pro68%Config errors (74%)Memory corruption (58%)
Llama 4 70B (open-source)54%Injection attacks (63%)Crypto weaknesses (41%)
GPT-4o (2024 baseline)41%Injection attacks (52%)Memory corruption (29%)
Claude 3.5 Sonnet (2024 baseline)38%Logic flaws (47%)Crypto weaknesses (26%)

The year-over-year jump is stark: frontier models roughly doubled their N-Day Bench scores from 2024 to 2026. GPT-5.4 at 73% is not a perfect security scanner, but it is significantly better than most junior analysts at triage-level vulnerability identification.

What the Scores Actually Mean

73% is good for triage, not good enough for production scanning

A 73% identification rate means GPT-5.4 misses roughly 1 in 4 vulnerabilities in controlled conditions. In production security scanning, that miss rate is too high for use as a standalone tool. Where these models shine is in triage and prioritization — taking a list of potential findings and rapidly classifying severity, explaining impact, and suggesting remediation order.

Claude leads on reasoning quality, GPT leads on raw detection

The benchmark scores detection accuracy, but researchers noted qualitative differences in model behavior. Claude 3.7 Sonnet consistently produced more detailed, nuanced explanations of why a vulnerability was dangerous — particularly for memory corruption and logic flaws. GPT-5.4 scored slightly higher on raw detection but produced shorter, less actionable remediation guidance in the judges' evaluation.

Crypto weaknesses remain the hardest category for all models

Cryptographic vulnerabilities require deep domain knowledge — understanding subtle misuse of elliptic curve parameters, recognizing insecure randomness in specific contexts, or identifying timing attack surfaces. All models scored worst in this category, suggesting that cryptographic security review still requires human expertise or specialized tools.

The Ethical Debate: Does This Make AI Dangerous?

The benchmark publication triggered significant debate in the security community. The core argument against publishing:

The counterarguments — which the researchers found more persuasive:

How Security Teams Are Using This Today

Use CaseBest ModelAccuracy / Value
Code review for injection vulnerabilitiesGPT-5.4 or Claude 3.7High (81% on injection)
Explaining CVEs to non-technical stakeholdersClaude 3.7 (clearer language)Excellent
Patch triage and prioritizationGPT-5.4Good (saves 2–3 hours per sprint)
Cryptographic implementation reviewHuman expert requiredLLMs miss 30–38% of issues
Security training and CTF prepAny frontier modelExcellent for explanation and hints
Automated configuration auditingGemini 3.1 Pro (best config score)Good (74% on config errors)
Use Frontier AI for Security Analysis
Happycapy gives you access to GPT-5.4 and Claude Opus 4.6 through a secure managed platform. Run vulnerability analysis, code review, and threat modeling without managing API keys. Pro at $17/month.
Try Happycapy Free

Frequently Asked Questions

What is N-Day Bench?

N-Day Bench is a security research benchmark that tests large language models on their ability to identify, explain, and reason about known (N-day) security vulnerabilities. Unlike zero-day vulnerabilities, N-day vulnerabilities have public CVE entries with known details. The benchmark presents models with vulnerable code or system descriptions and measures how accurately they identify the vulnerability, explain the attack vector, and suggest patches.

What is the difference between N-day and zero-day vulnerabilities?

A zero-day vulnerability is unknown to the software vendor — there are zero days of warning before it can be exploited. An N-day vulnerability has been publicly disclosed (given a CVE number) and the vendor has had N days to respond. N-day vulnerabilities are still dangerous because many organizations are slow to patch, but they are a safer benchmark target because the details are already public.

Which LLM performs best on N-Day Bench?

As of April 2026, GPT-5.4 leads N-Day Bench overall with a 73% identification accuracy on the top-500 CVEs dataset. Claude 3.7 Sonnet is close behind at 71%, with particular strength in memory-corruption and logic-flaw categories. Gemini 3.1 Pro scores 68%. All three significantly outperform older models — GPT-4o scored only 41% on the same benchmark in 2024.

Is it dangerous for LLMs to be good at finding security vulnerabilities?

This is a genuinely contested question in the security community. N-day vulnerabilities are already public — any competent attacker can look them up on NVD or Exploit-DB. LLM capability on N-Day Bench primarily benefits defenders: it enables automated code review, faster patch triage, and security training. The more concerning capability is zero-day discovery, which is separately tracked and not part of N-Day Bench.

Sources

Related Reading

SharePost on XLinkedIn
Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

Comments