ARC-AGI-3: The Hardest AI Benchmark Ever — Every Frontier Model Scores Below 1%
March 25, 2026 · 8 min read · Happycapy Guide
What ARC-AGI-3 Is and Why It Matters
The Abstraction and Reasoning Corpus (ARC) benchmark series, created by François Chollet and the ARC Prize Foundation (co-founded with Zapier's Mike Knoop), is the most intellectually rigorous test for AI general intelligence currently in existence. Each iteration raises the difficulty to stay ahead of AI progress.
ARC-AGI-1 was conquered: OpenAI's o3 model scored 87.5% in December 2024. ARC-AGI-2 pushed the benchmark to abstract reasoning, where Gemini 3.1 Pro reached 77.1% in February 2026 — a significant achievement. ARC-AGI-3, released March 25, 2026, has reset the scoreboard to near-zero.
The fundamental shift in ARC-AGI-3 is from pattern recognition to adaptive intelligence. Previous ARC versions presented static puzzles with hidden rules. ARC-AGI-3 presents interactive game environments where the agent must actively explore, infer hidden goals through trial and feedback, and adapt its strategy in real time — all without any instructions, rules, or prior examples.
How ARC-AGI-3 Works
Each of the 135 environments is a novel game that has never existed before. Neither AI systems nor humans have seen them during training. The agent enters a completely blank state: no description, no rules, no objective stated. It must figure out what to do entirely through interaction.
- Exploration: Mapping the environment and discovering what actions are possible — without any guide to what matters
- Goal inference: Deducing what "winning" or "succeeding" looks like from environmental feedback and partial observations
- Efficient execution: Solving the task in as few steps as possible — the scoring formula penalizes inefficiency quadratically
The scoring metric is called Relative Human Action Efficiency (RHAE). If a human solves a task in 10 steps and an AI takes 100 steps, the AI scores not 10% but 1% — the square of the ratio. This design specifically prevents brute-force search strategies from achieving high scores. Every unnecessary action dramatically reduces the final score.
All Frontier Models Score Below 1%
| Model | ARC-AGI-3 Score | ARC-AGI-2 Score | Notes |
|---|---|---|---|
| Untrained humans | 100% | ~100% | Baseline; solved all environments with ease |
| Tufa Labs CNN agent | 12.58% | N/A | Preview phase leader — lightweight action-learning agent, NOT an LLM |
| Gemini 3.1 Pro | 0.37% | 77.1% | Best-performing frontier LLM on ARC-AGI-3 |
| GPT-5.4 | 0.26% | ~62% | OpenAI flagship — near-zero despite leading many other benchmarks |
| Claude Opus 4.6 | 0.25% | 68.8% | Anthropic flagship — bottom of the frontier cluster |
| Grok-4.20 | 0.00% | N/A | xAI model — did not solve a single environment |
The current leader on ARC-AGI-3 (12.58%) is not GPT-5.4, Claude, or Gemini — it is a lightweight CNN-based action-learning agent from Tufa Labs. This agent uses convolutional neural networks and graph search to learn from in-context experience, not language modeling. The result suggests that the path to ARC-AGI-3 success runs through architecture design rather than raw parameter scaling.
While no AI has cracked ARC-AGI-3, today's frontier models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro — deliver transformative value for real-world work. Happycapy gives you all of them in one workspace at $17/month.
Try Happycapy Free →Why ARC-AGI-3 Exposes the Limits of Scaling
The collapse from ARC-AGI-2 scores (Gemini 3.1 Pro: 77.1%) to ARC-AGI-3 scores (0.37%) is not a marginal regression — it is a complete reset. A model that solves 77% of abstract reasoning puzzles cannot solve even 1% of adaptive game environments. This is not a failure of intelligence; it is a failure of a specific type of intelligence.
Current LLMs excel at tasks that involve recognizing and recombining patterns seen during training. ARC-AGI-3 strips away that advantage entirely: every environment is novel, every goal is implicit, and every extra action is penalized. The capabilities that make GPT-5.4 or Claude Opus 4.6 useful for writing, coding, and analysis do not transfer to this domain.
François Chollet, the benchmark's creator, has long argued that "intelligence is not just skill" — it is the ability to acquire new skills efficiently in novel situations. ARC-AGI-3 operationalizes that definition. The near-zero scores challenge a central narrative of the current AI boom: that scaling compute and data produces general intelligence.
The ARC-AGI Series: A Timeline
| Version | Release | Best AI Score | Human Score | Prize |
|---|---|---|---|---|
| ARC-AGI-1 | 2019 | 87.5% (OpenAI o3, Dec 2024) | ~100% | $1M — claimed |
| ARC-AGI-2 | Early 2026 | 84.6% (Gemini 3 Deep Think) | ~100% | $1M — claimed |
| ARC-AGI-3 | March 25, 2026 | 12.58% (Tufa Labs CNN, preview) | 100% | $2M — unclaimed |
What This Means for AI Users and Builders
ARC-AGI-3 does not mean current AI is useless — far from it. The same models scoring 0.25% on this benchmark are transforming how professionals work across writing, coding, research, analysis, and customer service. Those capabilities are real and growing.
What ARC-AGI-3 measures is something different: the ability to navigate entirely novel situations with no scaffolding. This capability — which humans develop naturally through childhood — remains genuinely elusive for AI systems as of 2026. It is a meaningful distinction between "very capable AI" and "general AI."
For AI builders, the results suggest that architecture matters as much as scale. The Tufa Labs CNN agent — with a tiny fraction of the parameters of frontier LLMs — outperforms GPT-5.4 and Claude by 50x on this benchmark. The right architecture for adaptive intelligence may look very different from the transformer-based LLMs dominating the industry today.
Access Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and 150+ models in one workspace. As the AI landscape evolves, Happycapy adds new models without requiring new subscriptions. Pro starts at $17/month.
Start Free on Happycapy →Frequently Asked Questions
ARC-AGI-3 is the third version of the Abstraction and Reasoning Corpus benchmark, released March 25, 2026. It consists of 135 novel interactive game environments where AI must explore, infer hidden goals, and adapt in real time with no instructions. Untrained humans score 100%; as of March 2026, no frontier AI model has exceeded 1%.
As of March 2026: GPT-5.4 scores 0.26%, Claude Opus 4.6 scores 0.25%, Gemini 3.1 Pro scores 0.37%, and Grok-4.20 scores 0.00%. The current leader (12.58%) is a lightweight CNN-based action-learning agent from Tufa Labs — not a large language model.
The ARC Prize Foundation offers a $2 million prize to any AI system that can match the performance of untrained humans on ARC-AGI-3. As of April 2026, no system is close to claiming it. The best AI score (12.58%) is still far below human performance (100%).
ARC-AGI-3 tests adaptive intelligence: the ability to navigate completely novel situations with no prior examples. Current LLMs score near-zero because they rely on pattern matching from training data — a strength that ARC-AGI-3 deliberately nullifies. The benchmark challenges the narrative that scaling compute and data alone produces general intelligence.