HappycapyGuide

By Connie · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

AI Research

ARC-AGI-3: The Hardest AI Benchmark Ever — Every Frontier Model Scores Below 1%

March 25, 2026  ·  8 min read  ·  Happycapy Guide

TL;DR
The ARC Prize Foundation released ARC-AGI-3 on March 25, 2026 — 135 interactive game environments where untrained humans score 100% and every frontier AI model scores below 1%. GPT-5.4 scores 0.26%. Claude Opus 4.6 scores 0.25%. Gemini 3.1 Pro scores 0.37%. Grok-4.20 scores 0.00%. A $2 million prize awaits any AI that matches human performance. The benchmark's novel design — requiring real-time exploration and goal inference with no instructions — exposes a fundamental gap between current AI capabilities and genuine adaptive intelligence.
100%Human score on all environments
<1%Best frontier AI score
135Novel interactive game environments
$2MPrize for matching human performance

What ARC-AGI-3 Is and Why It Matters

The Abstraction and Reasoning Corpus (ARC) benchmark series, created by François Chollet and the ARC Prize Foundation (co-founded with Zapier's Mike Knoop), is the most intellectually rigorous test for AI general intelligence currently in existence. Each iteration raises the difficulty to stay ahead of AI progress.

ARC-AGI-1 was conquered: OpenAI's o3 model scored 87.5% in December 2024. ARC-AGI-2 pushed the benchmark to abstract reasoning, where Gemini 3.1 Pro reached 77.1% in February 2026 — a significant achievement. ARC-AGI-3, released March 25, 2026, has reset the scoreboard to near-zero.

The fundamental shift in ARC-AGI-3 is from pattern recognition to adaptive intelligence. Previous ARC versions presented static puzzles with hidden rules. ARC-AGI-3 presents interactive game environments where the agent must actively explore, infer hidden goals through trial and feedback, and adapt its strategy in real time — all without any instructions, rules, or prior examples.

How ARC-AGI-3 Works

Each of the 135 environments is a novel game that has never existed before. Neither AI systems nor humans have seen them during training. The agent enters a completely blank state: no description, no rules, no objective stated. It must figure out what to do entirely through interaction.

The Three Core Challenges
  • Exploration: Mapping the environment and discovering what actions are possible — without any guide to what matters
  • Goal inference: Deducing what "winning" or "succeeding" looks like from environmental feedback and partial observations
  • Efficient execution: Solving the task in as few steps as possible — the scoring formula penalizes inefficiency quadratically

The scoring metric is called Relative Human Action Efficiency (RHAE). If a human solves a task in 10 steps and an AI takes 100 steps, the AI scores not 10% but 1% — the square of the ratio. This design specifically prevents brute-force search strategies from achieving high scores. Every unnecessary action dramatically reduces the final score.

All Frontier Models Score Below 1%

ModelARC-AGI-3 ScoreARC-AGI-2 ScoreNotes
Untrained humans100%~100%Baseline; solved all environments with ease
Tufa Labs CNN agent12.58%N/APreview phase leader — lightweight action-learning agent, NOT an LLM
Gemini 3.1 Pro0.37%77.1%Best-performing frontier LLM on ARC-AGI-3
GPT-5.40.26%~62%OpenAI flagship — near-zero despite leading many other benchmarks
Claude Opus 4.60.25%68.8%Anthropic flagship — bottom of the frontier cluster
Grok-4.200.00%N/AxAI model — did not solve a single environment
The Surprising Leader

The current leader on ARC-AGI-3 (12.58%) is not GPT-5.4, Claude, or Gemini — it is a lightweight CNN-based action-learning agent from Tufa Labs. This agent uses convolutional neural networks and graph search to learn from in-context experience, not language modeling. The result suggests that the path to ARC-AGI-3 success runs through architecture design rather than raw parameter scaling.

Use the Models That Lead Where It Counts

While no AI has cracked ARC-AGI-3, today's frontier models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro — deliver transformative value for real-world work. Happycapy gives you all of them in one workspace at $17/month.

Try Happycapy Free →

Why ARC-AGI-3 Exposes the Limits of Scaling

The collapse from ARC-AGI-2 scores (Gemini 3.1 Pro: 77.1%) to ARC-AGI-3 scores (0.37%) is not a marginal regression — it is a complete reset. A model that solves 77% of abstract reasoning puzzles cannot solve even 1% of adaptive game environments. This is not a failure of intelligence; it is a failure of a specific type of intelligence.

Current LLMs excel at tasks that involve recognizing and recombining patterns seen during training. ARC-AGI-3 strips away that advantage entirely: every environment is novel, every goal is implicit, and every extra action is penalized. The capabilities that make GPT-5.4 or Claude Opus 4.6 useful for writing, coding, and analysis do not transfer to this domain.

François Chollet, the benchmark's creator, has long argued that "intelligence is not just skill" — it is the ability to acquire new skills efficiently in novel situations. ARC-AGI-3 operationalizes that definition. The near-zero scores challenge a central narrative of the current AI boom: that scaling compute and data produces general intelligence.

The ARC-AGI Series: A Timeline

VersionReleaseBest AI ScoreHuman ScorePrize
ARC-AGI-1201987.5% (OpenAI o3, Dec 2024)~100%$1M — claimed
ARC-AGI-2Early 202684.6% (Gemini 3 Deep Think)~100%$1M — claimed
ARC-AGI-3March 25, 202612.58% (Tufa Labs CNN, preview)100%$2M — unclaimed

What This Means for AI Users and Builders

ARC-AGI-3 does not mean current AI is useless — far from it. The same models scoring 0.25% on this benchmark are transforming how professionals work across writing, coding, research, analysis, and customer service. Those capabilities are real and growing.

What ARC-AGI-3 measures is something different: the ability to navigate entirely novel situations with no scaffolding. This capability — which humans develop naturally through childhood — remains genuinely elusive for AI systems as of 2026. It is a meaningful distinction between "very capable AI" and "general AI."

For AI builders, the results suggest that architecture matters as much as scale. The Tufa Labs CNN agent — with a tiny fraction of the parameters of frontier LLMs — outperforms GPT-5.4 and Claude by 50x on this benchmark. The right architecture for adaptive intelligence may look very different from the transformer-based LLMs dominating the industry today.

Stay at the AI Frontier With Happycapy

Access Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and 150+ models in one workspace. As the AI landscape evolves, Happycapy adds new models without requiring new subscriptions. Pro starts at $17/month.

Start Free on Happycapy →

Frequently Asked Questions

What is ARC-AGI-3?

ARC-AGI-3 is the third version of the Abstraction and Reasoning Corpus benchmark, released March 25, 2026. It consists of 135 novel interactive game environments where AI must explore, infer hidden goals, and adapt in real time with no instructions. Untrained humans score 100%; as of March 2026, no frontier AI model has exceeded 1%.

How do GPT-5.4 and Claude score on ARC-AGI-3?

As of March 2026: GPT-5.4 scores 0.26%, Claude Opus 4.6 scores 0.25%, Gemini 3.1 Pro scores 0.37%, and Grok-4.20 scores 0.00%. The current leader (12.58%) is a lightweight CNN-based action-learning agent from Tufa Labs — not a large language model.

What is the ARC-AGI-3 prize?

The ARC Prize Foundation offers a $2 million prize to any AI system that can match the performance of untrained humans on ARC-AGI-3. As of April 2026, no system is close to claiming it. The best AI score (12.58%) is still far below human performance (100%).

Why does ARC-AGI-3 matter?

ARC-AGI-3 tests adaptive intelligence: the ability to navigate completely novel situations with no prior examples. Current LLMs score near-zero because they rely on pattern matching from training data — a strength that ARC-AGI-3 deliberately nullifies. The benchmark challenges the narrative that scaling compute and data alone produces general intelligence.

Sources: ARC Prize Foundation (Mar 25) · Fast Company (Mar 2026) · The Decoder (Mar 2026) · OfficeChai (Apr 2026) · Futurist.com (Mar 30)
SharePost on XLinkedIn
Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

Comments