MLPerf Inference v6.0: AMD Breaks 1M Tokens/Sec, First Video Generation Benchmark
TL;DR:MLCommons released MLPerf Inference v6.0 results on April 1, 2026. AMD's Instinct MI355X broke the 1 million tokens-per-second barrier at multi-node scale. NVIDIA's GB300 NVL72 hit 2.5M tokens/sec on a single rack. Five new workloads were added, including the first-ever text-to-video generation benchmark and a 235B multimodal model. 24 organizations submitted results — a new participation record.
Every six months, MLCommons releases MLPerf Inference results — the AI industry's closest equivalent to a standardized benchmark for hardware and software inference performance. The v6.0 round, covering results as of April 1, 2026, is the most significant update since the suite launched.
The headline: AMD is genuinely competitive now. And the benchmarks themselves have finally caught up to where AI workloads actually are in 2026 — multimodal, video-generating, and running at scales that would have seemed impossible in 2024.
Five New Workloads That Reflect Real AI in 2026
The most important change in v6.0 is what got added. Previous rounds measured LLM text generation and recommendation models — workloads from the 2022–2023 era. v6.0 adds:
| New Workload | Type | Why It Matters |
|---|---|---|
| Qwen3-VL-235B-A22B | Multimodal vision-language (MoE) | First multimodal model in the suite — tests image + text inference together |
| GPT-OSS-120B | MoE reasoning LLM (OpenAI open weights) | Tests sparse model inference — a dominant architecture in 2026 |
| WAN-2.2-T2V-A14B | Text-to-video generation | First video generation workload ever in MLPerf — 14B parameters |
| DeepSeek-R1 Interactive | Reasoning LLM (low-latency scenario) | Tests real-time interactive reasoning, not just batch throughput |
| DLRMv3 | Transformer-based recommendation | Replaces the older DLRM-DCNv2 with hyperscale-relevant compute intensity |
The inclusion of WAN-2.2 text-to-video is particularly significant. Video generation has become one of the most compute-intensive inference workloads in production — used by Runway, Kling, and dozens of enterprise video tools. This is the first time it has been formally benchmarked across hardware vendors in a standardized setting.
NVIDIA: Still Dominant, But Software Is the New Moat
NVIDIA's GB300 NVL72 — a rack-scale system with 72 Blackwell Ultra GPUs — achieved 2.5 million tokens per second on DeepSeek-R1, earning 10 first-place results across the 16 benchmark categories.
More interesting than the raw numbers: software optimization delivered a 2.77x throughput gainon the GB300 NVL72 compared to its debut in v5.1, six months earlier — with no hardware changes. NVIDIA's cumulative MLPerf wins since 2018 stand at 291, nine times that of all other submitters combined.
The pattern is clear: NVIDIA's lead is increasingly a software story, not just silicon. Their inference stack (TensorRT-LLM, NVLink fabric optimization, serving architecture) compounds faster than hardware can be replicated.
AMD: The 1 Million Tokens/Sec Milestone
The biggest competitive story in v6.0 is AMD. The Instinct MI355X platform became the first to surpass 1 million tokens per second in any MLPerf benchmark — achieving 1,016,380 tokens/sec on DeepSeek-R1 at multi-node scale (11 nodes).
| Metric | AMD MI355X | NVIDIA GB300 NVL72 |
|---|---|---|
| DeepSeek-R1 (multi-node tokens/sec) | 1,016,380 | 575,580 (single rack) |
| GPT-OSS-120B (multi-node tokens/sec) | 1,031,070 | 1,096,770 (single rack) |
| Single Stream vs B200 | 93% (108% post-tuning) | 100% (baseline) |
| Partner ecosystem (submitters) | 9 partners | Dominant across all categories |
AMD's 9 ecosystem partners submitting results ties a record for any single platform in one round. That breadth matters for enterprise buyers who need hardware supply chain diversity.
The caveat: AMD's 1M tokens/sec required 11 nodes vs NVIDIA's single-rack result. Per-node efficiency still favors NVIDIA. But the multi-node story matters for hyperscalers building at cluster scale.
Intel: The CPU-Only Outlier
Intel submitted results using its Arc Pro B-Series GPUs for accessible inference. A 4-GPU Arc Pro B70 configuration provides 128GB total VRAM — enough to run 120B-parameter models without splitting. Intel remains the only server processor vendor to submit CPU-only MLPerf Inference results, targeting workloads where dedicated GPU infrastructure isn't available.
For enterprises running inference on existing CPU infrastructure — a large segment of the market — Intel's continued participation in MLPerf provides the only apples-to-apples CPU benchmarking data.
System Scale: Multi-Node Goes Mainstream
v6.0 marks a structural shift in how AI inference is being deployed:
- 30% increase in multi-node system submissions vs v5.1
- Largest system: 72 nodes, 288 accelerators — 4x the previous record
- 10% of all systems had more than 10 nodes (vs 2% in v5.1)
This reflects the real-world shift happening in production: inference is no longer a single-GPU or single-server problem. Frontier models at 100B+ parameters, served at millions of tokens per second, require coordinated multi-node inference with high-bandwidth interconnects. The benchmark is finally catching up to production reality.
Full Performance Results Summary
| Model | Platform | Tokens/sec | Scale |
|---|---|---|---|
| DeepSeek-R1 | NVIDIA GB300 NVL72 | 2,500,000+ | 72 GPUs (1 rack) |
| DeepSeek-R1 | AMD MI355X | 1,016,380 | 11 nodes |
| GPT-OSS-120B | NVIDIA GB300 NVL72 | 1,096,770 | 72 GPUs (1 rack) |
| GPT-OSS-120B | AMD MI355X | 1,031,070 | 12 nodes |
| DeepSeek-R1 | NVIDIA GB300 NVL72 | 575,580 | Single rack (server) |
What This Means for AI Builders
For teams choosing inference infrastructure in 2026:
- NVIDIA H100/B200 remains the default for most use cases — the ecosystem, software stack, and per-accelerator efficiency are unmatched
- AMD is a credible alternative for hyperscale multi-node deployments — especially as supply chain diversity becomes a priority
- Video generation benchmarks are now comparable across vendors for the first time — useful for teams evaluating text-to-video at scale
- Software optimization matters more than ever — NVIDIA's 2.77x gain from pure software tuning is a reminder that your serving stack is as important as your hardware
For teams using multi-model AI platforms like Happycapy — which runs models from OpenAI, Anthropic, Google, and others through a single interface — these benchmark improvements translate directly to faster, cheaper responses as inference providers upgrade their hardware.
Key Takeaways
- MLPerf Inference v6.0 released April 1, 2026 — the most significant update to the benchmark suite to date
- AMD MI355X first to break 1 million tokens/sec in any MLPerf benchmark (multi-node)
- NVIDIA GB300 NVL72 hits 2.5M tokens/sec on DeepSeek-R1 from a single rack
- Five new workloads: first multimodal model, first text-to-video model, new MoE reasoning and recommendation benchmarks
- 30% more multi-node systems; 10% of all submitted systems have 10+ nodes
- Software optimization delivered 2.77x gain on existing NVIDIA hardware — no new silicon needed
- 24 organizations submitted results — a new participation record
Sources: AMD blog (amd.com, April 1, 2026), MLCommons official results (mlcommons.org/benchmarks/inference-datacenter), VentureBeat MLPerf v6.0 coverage, Nebius submission documentation. All performance numbers are from official MLCommons submissions as of April 1, 2026.