ARC-AGI-3 is a benchmark designed by François Chollet and the ARC Prize Foundation, launched on March 25, 2026. It tests AI agents by dropping them into novel, turn-based game environments without instructions or stated goals. Agents must explore environments, infer objectives, build internal models, and plan action sequences. Untrained humans score 100%. All current frontier AI models score below 1%.

AI Research

Humans: 100%. GPT-5.4: 0.26%. The ARC-AGI-3 Benchmark Just Revealed the Real Gap Between AI and Human Intelligence

Q: How did GPT-5.4, Gemini 3.1, and Claude Opus score on ARC-AGI-3?

On ARC-AGI-3, Google's Gemini 3.1 Pro scored 0.37%, OpenAI's GPT-5.4 scored 0.26%, and Anthropic's Claude Opus 4.6 scored 0.25%. The best non-LLM system (StochasticGoose, a reinforcement learning agent) scored 12.58%. Untrained humans solved 100% of environments with high efficiency.

Q: What does ARC-AGI-3 measure that standard AI benchmarks don't?

ARC-AGI-3 measures agentic intelligence — the ability to explore novel environments, infer goals from context, form hypotheses, update beliefs with new evidence, and plan sequences of actions across time. Standard benchmarks test knowledge retrieval, pattern matching, and single-step reasoning, which current LLMs have largely saturated. ARC-AGI-3 specifically tests capabilities that require genuine planning and world-model building, not memorization.

Q: Does ARC-AGI-3 mean current AI tools are useless?

No. ARC-AGI-3 tests a specific type of intelligence — novel sequential decision-making — that current transformer-based LLMs aren't designed for. AI tools like ChatGPT, Claude, and Gemini excel at text generation, summarization, coding assistance, research, and creative tasks. The benchmark reveals where AI has genuine limitations, which helps users route tasks more effectively. Platforms like Happycapy give you 50+ AI models so you can match each task to the model best suited to it.

March 29, 2026 · 7 min read · AI Benchmarks · AGI Research

TL;DR

On March 25, 2026, François Chollet and Sam Altman co-launched ARC-AGI-3 at Y Combinator — a benchmark where untrained humans score 100% and every frontier AI scores below 1%. GPT-5.4: 0.26%. Gemini 3.1: 0.37%. Claude Opus: 0.25%. The best AI system (a specialist RL agent) scored 12.58%. The $2 million prize for reaching 100% remains unclaimed. Here's what the gap means — and what it tells you about how to use AI tools today.

100%Human score on ARC-AGI-3

<1%Best frontier LLM score

$2MPrize pool — still unclaimed

1,000+Levels across 150+ game environments

What ARC-AGI-3 Actually Tests

Previous AI benchmarks tested what AI knows — math problems, coding challenges, reading comprehension, medical licensing exams. By 2025, frontier models had saturated most of these tests, scoring at or above human expert level. The AI capability narrative had become "AI is smarter than humans."

ARC-AGI-3 tests something different: what AI can figure out in a genuinely novel situation it has never seen before.

Agents are dropped into turn-based game environments with no instructions, no stated objectives, and no hints. They must:

Explore the environment to understand its rules and physics
Infer the goal from context rather than being told it
Build a world model that updates as new evidence appears
Plan action sequences across dozens or hundreds of steps
Do this efficiently — the scoring penalizes brute-force approaches quadratically

Untrained humans do all of this naturally. Current AI does not — at all.

The Scoreboard: Every AI vs. Every Human

Humans (any)

100%

StochasticGoose (RL)

12.58%

Gemini 3.1 Pro

0.37%

GPT-5.4

0.26%

Claude Opus 4.6

0.25%

The only system that showed meaningful progress was StochasticGoose, a specialized reinforcement learning agent built specifically for game environments — not an LLM at all. Its 12.58% score came from actual planning algorithms, not language prediction.

"ARC-AGI-3 makes that gap measurable by testing intelligence across time, not just final answers — capturing planning horizons, memory compression, and the ability to update beliefs as new evidence appears."

— ARC Prize Foundation, March 25, 2026 launch

Why Every Frontier LLM Scores Below 1%

The sub-1% scores aren't a fluke or a calibration artifact. They reflect a structural property of how current large language models are built:

The Transformer Architecture Limitation

LLMs like GPT-5.4, Gemini, and Claude are trained to predict the next token in a sequence based on patterns in training data. They are extraordinarily good at this. ARC-AGI-3 requires something different: building an internal model of a world you've never seen, updating it in real time, and planning multi-step sequences toward a goal you must infer. This requires genuine reasoning over novel structured environments — not pattern-matching over training data. Current transformer architectures have fundamental limitations doing this without task-specific engineering, which the benchmark explicitly prohibits.

The RHAE scoring metric compounds this problem. If an AI takes 10 times as many actions as a human to solve a level, it scores 1% — not 10%. Brute-force exploration is mathematically penalized. The only path to a high score is efficient, goal-directed planning from the start.

AI Has Real Strengths — Match the Right Model to Each Task

GPT-5.4 scores 0.26% on novel planning tasks but crushes writing, coding, and research. The skill is knowing which model to use when. Happycapy gives you 50+ models — one platform, the right tool every time. Pro at $17/mo.

Try Happycapy Free

What This Means for People Using AI Tools Today

The ARC-AGI-3 results don't mean current AI is useless — they mean it's useful in a specific way that's different from human general intelligence. Understanding the distinction helps you get dramatically more value from the AI tools you already pay for:

Where AI Excels vs. Where It Struggles

AI Excels At

Writing, editing, summarizing known content
Code generation from specifications
Pattern recognition in existing data
Answering questions from training knowledge
Translating between known domains

AI Struggles With

Planning in genuinely novel environments
Inferring goals without explicit instructions
Multi-step sequential reasoning over time
Building and updating world models
Efficient exploration under uncertainty

ARC-AGI-3 Scores vs. Practical AI Usefulness

Platform	ARC-AGI-3 Score	Strengths	Access to Multiple Models	Best For
Gemini 3.1 Pro	0.37%Frontier #1	Multimodal, Google Workspace	No — Google only	Docs, search, vision tasks
GPT-5.4 (OpenAI)	0.26%	Writing, code, analysis	No — OpenAI only	Text, coding, research
Claude Opus 4.6	0.25%	Nuanced reasoning, safety	No — Anthropic only	Analysis, long documents
StochasticGoose (RL)	12.58%Best AI	Specialized game planning	No — task-specific only	Structured environment tasks
HappycapyAll-in-one	Routes to best model	50+ models, task routing	Yes — 50+ models	Every task, right model every time

The $2 Million Prize Nobody Has Won — And What It Would Mean

The ARC Prize 2026 offers a $700,000 grand prize for the first agent to achieve 100% human-level performance on the evaluation set — with milestone prizes along the way and a total pool of $2 million. All winning solutions must be open-sourced.

The competition runs through December 2026, with milestone checkpoints in June and September. François Chollet and the ARC Prize Foundation structured the prize this way deliberately: an open-source AGI benchmark breakthrough would benefit the entire field, not just one company.

If any agent achieves 100%, it would represent the first verified demonstration of general agentic intelligence — not just better text prediction, but actual goal-directed reasoning in novel environments. That would be a genuine scientific milestone, not just a product launch.

Frequently Asked Questions

What is ARC-AGI-3?

ARC-AGI-3 is a benchmark launched March 25, 2026 by the ARC Prize Foundation, designed by François Chollet. It tests AI agents in novel, turn-based game environments with no instructions or stated goals. Agents must explore, infer objectives, and plan actions efficiently. Humans score 100%. All frontier LLMs score below 1%.

How did GPT-5.4, Gemini 3.1, and Claude Opus score on ARC-AGI-3?

Gemini 3.1 Pro scored 0.37%, GPT-5.4 scored 0.26%, and Claude Opus 4.6 scored 0.25%. The best AI system overall was StochasticGoose — a specialized reinforcement learning agent (not an LLM) — which scored 12.58%. Untrained humans solved 100% of environments.

What does ARC-AGI-3 measure that standard AI benchmarks don't?

ARC-AGI-3 measures agentic intelligence — the ability to explore novel environments, infer goals from context, form and update hypotheses, and plan sequences of actions across time. Standard benchmarks test knowledge retrieval and pattern matching, which current LLMs have largely saturated. ARC-AGI-3 requires genuine planning and world-model building, not memorization.

Does ARC-AGI-3 mean current AI tools are useless?

No. ARC-AGI-3 tests a specific type of intelligence that current transformer-based LLMs aren't designed for. ChatGPT, Claude, and Gemini excel at writing, coding, research, and analysis. The benchmark reveals where AI has genuine limits, helping users route tasks more effectively. Platforms like Happycapy give you 50+ models to match each task to the best available model.

50+ AI Models. Match Every Task to the Right Intelligence.

No single AI model is best at everything — ARC-AGI-3 proves that definitively. Happycapy gives you access to GPT-5.4, Claude, Gemini, and 50+ other models under one $17/mo subscription so you can route every task to the model that handles it best.

Try Happycapy Free

Sources

ARC Prize Foundation — "Announcing ARC-AGI-3" official launch (March 25, 2026)
Fast Company — "Exclusive: This new benchmark could expose AI's biggest weakness" (March 25, 2026)
Awesome Agents — "ARC-AGI-3 Launches — AI Agents Must Learn, Not Memorize" (March 25–26, 2026)
OfficeChai — "ARC-AGI-3 Released, Gemini 3.1 Pro Top Scores With Just 0.37 Percent" (March 24–25, 2026)
RevolutionInAI — "ARC-AGI-3 Launched: Best AI Scores 0.37% While Humans Score 100%" (March 27, 2026)
ARC Prize Foundation — ARC-AGI-3 Technical Report (arxiv.org, March 2026)

Sources

OpenAI OpenAI ChatGPT Anthropic Anthropic Claude

← Back to all articles