Humans: 100%. GPT-5.4: 0.26%. The ARC-AGI-3 Benchmark Just Revealed the Real Gap Between AI and Human Intelligence
March 29, 2026 · 7 min read · AI Benchmarks · AGI Research
On March 25, 2026, François Chollet and Sam Altman co-launched ARC-AGI-3 at Y Combinator — a benchmark where untrained humans score 100% and every frontier AI scores below 1%. GPT-5.4: 0.26%. Gemini 3.1: 0.37%. Claude Opus: 0.25%. The best AI system (a specialist RL agent) scored 12.58%. The $2 million prize for reaching 100% remains unclaimed. Here's what the gap means — and what it tells you about how to use AI tools today.
What ARC-AGI-3 Actually Tests
Previous AI benchmarks tested what AI knows — math problems, coding challenges, reading comprehension, medical licensing exams. By 2025, frontier models had saturated most of these tests, scoring at or above human expert level. The AI capability narrative had become "AI is smarter than humans."
ARC-AGI-3 tests something different: what AI can figure out in a genuinely novel situation it has never seen before.
Agents are dropped into turn-based game environments with no instructions, no stated objectives, and no hints. They must:
- Explore the environment to understand its rules and physics
- Infer the goal from context rather than being told it
- Build a world model that updates as new evidence appears
- Plan action sequences across dozens or hundreds of steps
- Do this efficiently — the scoring penalizes brute-force approaches quadratically
Untrained humans do all of this naturally. Current AI does not — at all.
The Scoreboard: Every AI vs. Every Human
The only system that showed meaningful progress was StochasticGoose, a specialized reinforcement learning agent built specifically for game environments — not an LLM at all. Its 12.58% score came from actual planning algorithms, not language prediction.
"ARC-AGI-3 makes that gap measurable by testing intelligence across time, not just final answers — capturing planning horizons, memory compression, and the ability to update beliefs as new evidence appears."
— ARC Prize Foundation, March 25, 2026 launch
Why Every Frontier LLM Scores Below 1%
The sub-1% scores aren't a fluke or a calibration artifact. They reflect a structural property of how current large language models are built:
LLMs like GPT-5.4, Gemini, and Claude are trained to predict the next token in a sequence based on patterns in training data. They are extraordinarily good at this. ARC-AGI-3 requires something different: building an internal model of a world you've never seen, updating it in real time, and planning multi-step sequences toward a goal you must infer. This requires genuine reasoning over novel structured environments — not pattern-matching over training data. Current transformer architectures have fundamental limitations doing this without task-specific engineering, which the benchmark explicitly prohibits.
The RHAE scoring metric compounds this problem. If an AI takes 10 times as many actions as a human to solve a level, it scores 1% — not 10%. Brute-force exploration is mathematically penalized. The only path to a high score is efficient, goal-directed planning from the start.
AI Has Real Strengths — Match the Right Model to Each Task
GPT-5.4 scores 0.26% on novel planning tasks but crushes writing, coding, and research. The skill is knowing which model to use when. Happycapy gives you 50+ models — one platform, the right tool every time. Pro at $17/mo.
Try Happycapy FreeWhat This Means for People Using AI Tools Today
The ARC-AGI-3 results don't mean current AI is useless — they mean it's useful in a specific way that's different from human general intelligence. Understanding the distinction helps you get dramatically more value from the AI tools you already pay for:
AI Excels At
- Writing, editing, summarizing known content
- Code generation from specifications
- Pattern recognition in existing data
- Answering questions from training knowledge
- Translating between known domains
AI Struggles With
- Planning in genuinely novel environments
- Inferring goals without explicit instructions
- Multi-step sequential reasoning over time
- Building and updating world models
- Efficient exploration under uncertainty
ARC-AGI-3 Scores vs. Practical AI Usefulness
| Platform | ARC-AGI-3 Score | Strengths | Access to Multiple Models | Best For |
|---|---|---|---|---|
| Gemini 3.1 Pro | 0.37%Frontier #1 | Multimodal, Google Workspace | No — Google only | Docs, search, vision tasks |
| GPT-5.4 (OpenAI) | 0.26% | Writing, code, analysis | No — OpenAI only | Text, coding, research |
| Claude Opus 4.6 | 0.25% | Nuanced reasoning, safety | No — Anthropic only | Analysis, long documents |
| StochasticGoose (RL) | 12.58%Best AI | Specialized game planning | No — task-specific only | Structured environment tasks |
| HappycapyAll-in-one | Routes to best model | 50+ models, task routing | Yes — 50+ models | Every task, right model every time |
The $2 Million Prize Nobody Has Won — And What It Would Mean
The ARC Prize 2026 offers a $700,000 grand prize for the first agent to achieve 100% human-level performance on the evaluation set — with milestone prizes along the way and a total pool of $2 million. All winning solutions must be open-sourced.
The competition runs through December 2026, with milestone checkpoints in June and September. François Chollet and the ARC Prize Foundation structured the prize this way deliberately: an open-source AGI benchmark breakthrough would benefit the entire field, not just one company.
If any agent achieves 100%, it would represent the first verified demonstration of general agentic intelligence — not just better text prediction, but actual goal-directed reasoning in novel environments. That would be a genuine scientific milestone, not just a product launch.
Frequently Asked Questions
What is ARC-AGI-3?
ARC-AGI-3 is a benchmark launched March 25, 2026 by the ARC Prize Foundation, designed by François Chollet. It tests AI agents in novel, turn-based game environments with no instructions or stated goals. Agents must explore, infer objectives, and plan actions efficiently. Humans score 100%. All frontier LLMs score below 1%.
How did GPT-5.4, Gemini 3.1, and Claude Opus score on ARC-AGI-3?
Gemini 3.1 Pro scored 0.37%, GPT-5.4 scored 0.26%, and Claude Opus 4.6 scored 0.25%. The best AI system overall was StochasticGoose — a specialized reinforcement learning agent (not an LLM) — which scored 12.58%. Untrained humans solved 100% of environments.
What does ARC-AGI-3 measure that standard AI benchmarks don't?
ARC-AGI-3 measures agentic intelligence — the ability to explore novel environments, infer goals from context, form and update hypotheses, and plan sequences of actions across time. Standard benchmarks test knowledge retrieval and pattern matching, which current LLMs have largely saturated. ARC-AGI-3 requires genuine planning and world-model building, not memorization.
Does ARC-AGI-3 mean current AI tools are useless?
No. ARC-AGI-3 tests a specific type of intelligence that current transformer-based LLMs aren't designed for. ChatGPT, Claude, and Gemini excel at writing, coding, research, and analysis. The benchmark reveals where AI has genuine limits, helping users route tasks more effectively. Platforms like Happycapy give you 50+ models to match each task to the best available model.
50+ AI Models. Match Every Task to the Right Intelligence.
No single AI model is best at everything — ARC-AGI-3 proves that definitively. Happycapy gives you access to GPT-5.4, Claude, Gemini, and 50+ other models under one $17/mo subscription so you can route every task to the model that handles it best.
Try Happycapy Free- ARC Prize Foundation — "Announcing ARC-AGI-3" official launch (March 25, 2026)
- Fast Company — "Exclusive: This new benchmark could expose AI's biggest weakness" (March 25, 2026)
- Awesome Agents — "ARC-AGI-3 Launches — AI Agents Must Learn, Not Memorize" (March 25–26, 2026)
- OfficeChai — "ARC-AGI-3 Released, Gemini 3.1 Pro Top Scores With Just 0.37 Percent" (March 24–25, 2026)
- RevolutionInAI — "ARC-AGI-3 Launched: Best AI Scores 0.37% While Humans Score 100%" (March 27, 2026)
- ARC Prize Foundation — ARC-AGI-3 Technical Report (arxiv.org, March 2026)