What is Kimi K2.6 and who made it?

Kimi K2.6 is an open-weight 1-trillion-parameter Mixture-of-Experts language model released by Moonshot AI on April 20, 2026. It is priced at $0.60 input / $2.50 output per million tokens and supports running up to 300 agents in parallel with 12-hour autonomous task stamina. On May 3, 2026, it won the Word Gem Puzzle programming challenge outright, finishing ahead of Claude Opus 4.7 and GPT-5.5.

Did Kimi K2.6 actually beat Claude Opus 4.7 on coding?

It beat Claude in one widely shared real-time challenge (Word Gem Puzzle) and on SWE-Bench Pro (58.6% vs Opus 4.6 at 53.4%). On SWE-Bench Verified (the more commonly cited benchmark), Claude Opus 4.7 still leads at 87.6% vs K2.6 at 80.2%. On the LMArena coding Elo leaderboard, K2.6 sits at 1,529 Elo — ranked sixth, behind Claude Opus 4.7 (1,565) and Claude Opus 4.6 (1,548). So: it beats Claude on specific benchmarks and real-time challenges, not on pure coding quality across the board.

Is Kimi K2.6 cheaper than Claude or GPT?

Substantially. Kimi K2.6 is $0.60 input / $2.50 output per million tokens. Claude Opus 4.7 is roughly $15/$75 per MTok. GPT-5.5 sits in between. Per-run in some benchmark tests, K2.6 runs for roughly $0.30 versus Claude at $5+. For cost-sensitive agentic workloads — or any workload where you want 300 parallel agents — K2.6 is the first open-weight model that is both capable and cheap enough to change the economics.

Should I switch from Claude or GPT to Kimi K2.6?

For pure coding quality and complex reasoning — no, Claude Opus 4.7 and GPT-5.5 still lead. For cost-sensitive large-scale agentic work, batch inference, or research use cases where you can tolerate a moderate quality dip — yes, K2.6 is often the rational choice. The practical answer most teams land on: use both. Route complex single-shot tasks to Claude/GPT, route parallelizable agentic work to K2.6.

AI News · Models

Kimi K2.6 Beats Claude & GPT-5.5 in Coding: The Chinese Open-Weight Moment

By Connie · May 4, 2026 · 7 min read

TL;DR:Moonshot AI's Kimi K2.6 — 1T parameters, open-weights, roughly 1/25 the price of Claude Opus 4.7 — won an outright coding challenge against Claude and GPT-5.5 on May 3. The 329-upvote Hacker News thread drove the story to the top of every AI feed. Claude Opus 4.7 still leads pure coding quality on most benchmarks, but the economics have shifted: for large-scale agentic workloads, Chinese open-weights is now the rational default.

What happened

On May 3, 2026, a writeup of the “Word Gem Puzzle” programming challenge hit the top of Hacker News with 329 points and 187 comments. The headline: Kimi K2.6, an open-weight model from Chinese lab Moonshot AI, finished first with 22 match points — ahead of GPT-5.5 in third and Claude Opus 4.7 in fifth. MiMo V2-Pro (Xiaomi) took second; GLM-5.1 (Zhipu AI) took fourth.

The writeup confirmed something the April benchmark runs had been hinting at: for a specific shape of agentic-coding problem — sliding-tile mechanics, multi-step state exploration, long-horizon execution — Chinese open-weight models are now beating the best Western closed-weight models outright, not merely matching them.

The numbers behind the model

Metric	Kimi K2.6	Claude Opus 4.7	GPT-5.5
Parameters	1T (MoE, open-weight)	Undisclosed (closed)	Undisclosed (closed)
Pricing / MTok (in / out)	$0.60 / $2.50	~$15 / ~$75	~$6 / ~$30
SWE-Bench Pro	58.6%	~56% (4.6)	57.7% (5.4)
SWE-Bench Verified	80.2%	87.6%	~84%
LMArena coding Elo	1,529 (rank 6)	1,565 (rank 1)	~1,550 (rank 3)
Agentic (HLE-Full w/ tools)	54.0%	53.0% (4.6)	52.1% (5.4)
AIME 2026 (math)	96.4%	~97%	99.2%
Parallel agents	300	Limited by rate caps	Limited by rate caps
Longest autonomous task	12 hours	Variable	Variable

The pattern is clear. K2.6 wins on agentic throughput, open-weight flexibility, and SWE-Bench Pro. Claude still wins on pure single-pass quality and Elo ranking. GPT-5.5 keeps the math-reasoning crown.

Want to actually compare these models side-by-side?

Happycapy Pro gives you Claude Opus 4.7, GPT-5.5, Gemini 3 Pro, Kimi K2.6, DeepSeek V4 Pro, and 30+ models on one $17/month account — so you can route each task to the right model without juggling API keys.

Try Happycapy Pro — $17/month

Why this is the moment, not April's release

K2.6 was released on April 20. The benchmarks were public within a week. The model itself did not change between April 20 and May 3. What changed is that a real-time puzzle — a qualitatively different kind of evaluation from static SWE-Bench runs — produced a clear, ranked outcome with K2.6 on top.

That distinction matters because:

Benchmark contamination is a real worry. SWE-Bench problems can leak into training corpora. Claims of strong benchmark performance are increasingly discounted by practitioners. A fresh, previously-unseen puzzle is less gameable.
The challenge tested stamina and agentic workflow — exactly the axes K2.6 was designed for. It demonstrated the 300-agent parallelism and 12-hour autonomous task claims translate into wins under real-time pressure, not just paper specs.
It was a pure ranking, not a scoring gradient. Headlines can't dispute “K2.6 finished first, Claude finished fifth.” That makes the story shareable in a way nuanced benchmark scores are not.

The open-weight economics shift

The real story underneath the headline is what happens when a frontier-tier model is open-weight at K2.6's price:

Self-hosted workloads become viable. A 1T-parameter MoE can run on a modest cluster and eliminate API cost for sustained agentic work. At Claude's prices, this math never pencils.
Agent swarms stop being a research curiosity. Running 300 agents in parallel is impractical on closed APIs due to rate limits and cost. On K2.6 it is the nominal deployment.
Fine-tuning and safety tuning return to the customer. Open-weight means banks, governments, and regulated industries can retain full control of the model state — which is part of why Goldman Sachs cut Claude access for Hong Kong bankers but has been less restrictive with open-weight alternatives.
The cost structure feeds the compute-constraints debate. Cheap inference plus agent parallelism means you can accomplish more per dollar — until you hit the quality wall, at which point you fall back to closed frontier models.

Where Claude still wins (and why that matters)

The honest counter-story: on pure coding quality in most independent evaluations, Claude Opus 4.7 is still ahead. Specific examples:

Kilo Code's workflow-orchestration spec: Claude Opus 4.7 scored 91/100 vs K2.6 at 68/100.
LMArena coding Elo: Claude at 1,565 (rank 1) vs K2.6 at 1,529 (rank 6).
SWE-Bench Verified: Claude 87.6% vs K2.6 80.2% — a seven-point gap.
On long-horizon refactor work inside large codebases, Claude's context handling and refusal predictability remain differentiators.

So the right framing isn't “Kimi K2.6 beats Claude.” It is “Kimi K2.6 beats Claude on a specific set of problem shapes, at 1/25 the price, while staying open-weight.” For a lot of 2026 workloads, that tradeoff is already a no-brainer. For code quality that a senior engineer would merge without a second review, Claude is still the default.

The Chinese open-weight wave this is part of

K2.6 is not an isolated event. It is the latest in a compounding wave:

DeepSeek V4 Pro (April 24) — 83.7% on SWE-Bench Verified, MIT license, $0.30/MTok input, 1M context. Our coverage: DeepSeek V4 1T-parameter open-source release.
GLM-5.1 (Zhipu AI) — first Chinese model to top SWE-Bench Pro at 58.4% before K2.6 edged it. Covered in our GLM-5.1 long-horizon-task comparison.
MiMo V2-Pro (Xiaomi) — finished 2nd in the Word Gem Puzzle. Phone maker going toe-to-toe with OpenAI on pure puzzle-solving is its own tell.
Qwen 3.5 (Alibaba) — frontier multimodal, long-video analysis.
Tencent Hunyuan 3 — WeChat agent integration gives Hunyuan the largest consumer-agent deployment on Earth.

Each release alone is incremental. In aggregate, they represent a Chinese AI stack that is open-weight, inexpensive, and benchmarking competitively. That combination is qualitatively different from what Western labs are offering in 2026.

What this means for Anthropic, OpenAI, Google

Three second-order consequences to watch:

Pricing pressure bottom-up. Closed-weight labs have been able to keep per-token prices high because the quality gap justified it. When K2.6 is beating Claude on specific benchmarks at 1/25 the price, every enterprise buyer now has a credible BATNA. Expect API-price cuts from Anthropic and OpenAI in Q2 2026.
Export-control reinforcement pressure. Each Chinese open-weight frontier release strengthens the case in Washington for tighter GPU export controls and for excluding Chinese AI tooling from US government systems — similar reasoning to the Pentagon's supply-chain-risk framework.
Accelerated commoditization of mid-tier coding. For the 70% of engineering work that isn't bleeding-edge, the Kimi / DeepSeek / GLM stack is increasingly “good enough.” That is a structural problem for closed-weight revenue per task.

The bottom line

Kimi K2.6 winning one specific coding challenge is not going to dethrone Claude or GPT-5.5 for the hardest engineering work. What it does is make the “Western frontier vs Chinese open-weight” gap small enough that economics, not raw quality, becomes the deciding factor for most real workloads. That is the moment the commercial AI landscape has been heading toward since the first DeepSeek R1 release. May 3, 2026 is the cleanest, most shareable datapoint yet.

← Back to all articles

Sources: The Decoder (Kimi K2.6 open-weight, Apr 20 2026); DeepLearning.ai The Batch, issue 351 (Kimi K2.6 Elo rank, Apr 26); AkitaOnRails.com LLM benchmark Part 3 (Apr 24); Verdent.ai agentic-coding comparison; BuildFastWithAI benchmark writeup; Hacker News thread on Word Gem Puzzle (May 3, 329 pts, 187 comments). Benchmarks per Moonshot AI, SWE-Bench Pro/Verified, LMArena.