This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.
Jensen Huang Says AGI Is Here. A New Benchmark Says It Scored 0%.
Nvidia's CEO told the world AGI exists. Days later, a benchmark tested every frontier model on novel tasks — and the best AI scored 0.37% versus humans at 100%.
On March 23, Nvidia CEO Jensen Huang declared on Lex Fridman's podcast that artificial general intelligence has been achieved. On March 26 — three days later — the ARC Prize Foundation released ARC-AGI-3, a benchmark testing true generalization on 135 novel environments. Humans scored 100%. Gemini 3.1 Pro scored 0.37%. GPT-5.4 scored 0.26%. Grok-4.20 scored exactly 0%. The definition of AGI is the whole argument.
What Jensen Huang Actually Said
On March 23, 2026, Nvidia CEO Jensen Huang appeared on Lex Fridman's podcast and made the boldest claim in the history of the AI industry.
Huang's definition of AGI is functional and business-oriented: an AI that can perform complex multi-step workflows, write production-grade code, and potentially run a tech company to a $1 billion valuation — without constant human oversight at every step. Under that definition, Huang argues, systems like Claude Code, GPT-5.4 with tools, and Grok-4.20 with multi-agent orchestration already qualify.
The claim went viral immediately. CNBC, Forbes, Yahoo Finance, and Fortune all covered the statement within hours. Many in the AI research community were less enthusiastic.
Then ARC-AGI-3 Dropped
Three days after Huang's declaration, François Chollet — the creator of the original ARC-AGI benchmark and co-founder of the ARC Prize Foundation — published ARC-AGI-3. The timing was not a coincidence.
ARC-AGI-3 presents AI systems with 135 novel interactive environments. These are not problems the AI has seen during training. There are no instructions provided. The model must explore, reason, and solve — the way a human encountering a genuinely new situation does. The scoring metric, called Relative Human Action Efficiency (RHAE), also penalizes inefficiency: an AI that takes ten times as many actions as a human earns only 1% for that environment.
To prevent models from gaming the benchmark by training on its data, 110 of the 135 environments are kept private. Only 25 are publicly accessible.
ARC-AGI-3 Scores: Every Frontier Model vs. Humans
| System | Score (RHAE) | Gap to Human | Notes |
|---|---|---|---|
| Humans | 100% | — | Baseline: generalize to novel situations naturally |
| Google Gemini 3.1 Pro | 0.37% | -99.63% | Highest scoring AI model tested |
| OpenAI GPT-5.4 | 0.26% | -99.74% | Huang's implicit benchmark for "AGI" |
| Anthropic Claude Opus 4.6 | 0.25% | -99.75% | Anthropic's current flagship model |
| xAI Grok-4.20 | 0% | -100% | Zero on every novel environment tested |
Grok-4.20's zero score is particularly striking because the model performs well on standard benchmarks that test memorized knowledge. ARC-AGI-3 specifically strips away all prior training advantages. The zero score indicates Grok-4.20 was unable to generalize at all to environments it had never encountered — it could not even begin to explore a novel problem space in a way that matched human action efficiency.
Two Definitions of AGI — and Why They Cannot Both Be Right
The fundamental issue is that "AGI" means entirely different things to Huang and Chollet, and the gap between their definitions reveals the biggest fault line in the AI industry.
Huang's definition is pragmatic and market-oriented: AGI means an AI that can execute complex workflows autonomously and create commercial value at scale. Under this framing, current AI already qualifies — Claude Code can write production software, GPT-5.4 can synthesize research, and multi-agent systems are running real business processes. For a CEO of a company that sells AI infrastructure, this definition conveniently validates the product category he is selling.
Chollet's definition is technical and cognitive: AGI is a system that can generalize to genuinely novel situations without prior training, the way any human naturally can. The key word is "generalize." Current AI systems are extraordinarily capable at tasks within their training distribution but fail almost completely when confronted with problems that require true novelty — a capability humans have and machines demonstrably do not.
The Corporate Incentive Problem
It is worth noting that Jensen Huang has a direct financial interest in AGI being declared achieved. Nvidia's stock price, revenue projections, and the entire justification for hundreds of billions of dollars in AI infrastructure spending rest on the premise that AI is rapidly approaching human-level capability. A CEO of the world's most valuable semiconductor company declaring AGI arrived validates every data center his company has ever sold.
Yahoo Finance and Fortune both noted this conflict of interest in their coverage of the debate. The Forbes coverage was more neutral — but all three publications ran the ARC-AGI-3 refutation story within 24 hours of Huang's original claim gaining traction.
AGI Definitions: The Spectrum of Views
| Who | AGI Definition | AGI Status (Mar 2026) |
|---|---|---|
| Jensen Huang (Nvidia) | AI that can run a business, execute complex workflows | ACHIEVED |
| François Chollet (ARC Prize) | Generalize to novel situations without prior training | NOT ACHIEVED (0.37% best) |
| Demis Hassabis (Google DeepMind) | AI performing at human expert level across all cognitive tasks | Approaching narrow domains |
| Dario Amodei (Anthropic) | Vast knowledge, reason and act across complex scientific domains | Within reach (2026–2027 est.) |
| Yann LeCun (Meta) | Human-level common sense and physical world understanding | Far from achieved (missing world model) |
What This Means for AI Users Right Now
For the people actually using AI — not debating its philosophical definitions — the practical answer is clear: today's best models are remarkably capable within their training distribution and genuinely poor at tasks requiring true novelty. You can use Claude, GPT-5.4, and Gemini to write code, summarize documents, analyze data, draft content, and reason through complex problems. These are real, useful capabilities.
What current AI cannot reliably do is encounter a brand-new problem type it has never seen, devise a strategy for exploring it from scratch, and solve it efficiently — the way any human can on any given Tuesday. That 99.63% gap between Gemini's best score and human performance on ARC-AGI-3 is the honest picture of where AI capability stands.
The $2 million ARC Prize is open. No AI system has come close. The prize money is safe for now.
Frequently Asked Questions
Comments are coming soon.