HappycapyGuide

By Connie · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

MLPerf Inference v6.0: AMD Breaks 1M Tokens/Sec, First Video Generation Benchmark

TL;DR:MLCommons released MLPerf Inference v6.0 results on April 1, 2026. AMD's Instinct MI355X broke the 1 million tokens-per-second barrier at multi-node scale. NVIDIA's GB300 NVL72 hit 2.5M tokens/sec on a single rack. Five new workloads were added, including the first-ever text-to-video generation benchmark and a 235B multimodal model. 24 organizations submitted results — a new participation record.

Every six months, MLCommons releases MLPerf Inference results — the AI industry's closest equivalent to a standardized benchmark for hardware and software inference performance. The v6.0 round, covering results as of April 1, 2026, is the most significant update since the suite launched.

The headline: AMD is genuinely competitive now. And the benchmarks themselves have finally caught up to where AI workloads actually are in 2026 — multimodal, video-generating, and running at scales that would have seemed impossible in 2024.

Five New Workloads That Reflect Real AI in 2026

The most important change in v6.0 is what got added. Previous rounds measured LLM text generation and recommendation models — workloads from the 2022–2023 era. v6.0 adds:

New WorkloadTypeWhy It Matters
Qwen3-VL-235B-A22BMultimodal vision-language (MoE)First multimodal model in the suite — tests image + text inference together
GPT-OSS-120BMoE reasoning LLM (OpenAI open weights)Tests sparse model inference — a dominant architecture in 2026
WAN-2.2-T2V-A14BText-to-video generationFirst video generation workload ever in MLPerf — 14B parameters
DeepSeek-R1 InteractiveReasoning LLM (low-latency scenario)Tests real-time interactive reasoning, not just batch throughput
DLRMv3Transformer-based recommendationReplaces the older DLRM-DCNv2 with hyperscale-relevant compute intensity

The inclusion of WAN-2.2 text-to-video is particularly significant. Video generation has become one of the most compute-intensive inference workloads in production — used by Runway, Kling, and dozens of enterprise video tools. This is the first time it has been formally benchmarked across hardware vendors in a standardized setting.

NVIDIA: Still Dominant, But Software Is the New Moat

NVIDIA's GB300 NVL72 — a rack-scale system with 72 Blackwell Ultra GPUs — achieved 2.5 million tokens per second on DeepSeek-R1, earning 10 first-place results across the 16 benchmark categories.

More interesting than the raw numbers: software optimization delivered a 2.77x throughput gainon the GB300 NVL72 compared to its debut in v5.1, six months earlier — with no hardware changes. NVIDIA's cumulative MLPerf wins since 2018 stand at 291, nine times that of all other submitters combined.

The pattern is clear: NVIDIA's lead is increasingly a software story, not just silicon. Their inference stack (TensorRT-LLM, NVLink fabric optimization, serving architecture) compounds faster than hardware can be replicated.

AMD: The 1 Million Tokens/Sec Milestone

The biggest competitive story in v6.0 is AMD. The Instinct MI355X platform became the first to surpass 1 million tokens per second in any MLPerf benchmark — achieving 1,016,380 tokens/sec on DeepSeek-R1 at multi-node scale (11 nodes).

MetricAMD MI355XNVIDIA GB300 NVL72
DeepSeek-R1 (multi-node tokens/sec)1,016,380575,580 (single rack)
GPT-OSS-120B (multi-node tokens/sec)1,031,0701,096,770 (single rack)
Single Stream vs B20093% (108% post-tuning)100% (baseline)
Partner ecosystem (submitters)9 partnersDominant across all categories

AMD's 9 ecosystem partners submitting results ties a record for any single platform in one round. That breadth matters for enterprise buyers who need hardware supply chain diversity.

The caveat: AMD's 1M tokens/sec required 11 nodes vs NVIDIA's single-rack result. Per-node efficiency still favors NVIDIA. But the multi-node story matters for hyperscalers building at cluster scale.

Intel: The CPU-Only Outlier

Intel submitted results using its Arc Pro B-Series GPUs for accessible inference. A 4-GPU Arc Pro B70 configuration provides 128GB total VRAM — enough to run 120B-parameter models without splitting. Intel remains the only server processor vendor to submit CPU-only MLPerf Inference results, targeting workloads where dedicated GPU infrastructure isn't available.

For enterprises running inference on existing CPU infrastructure — a large segment of the market — Intel's continued participation in MLPerf provides the only apples-to-apples CPU benchmarking data.

System Scale: Multi-Node Goes Mainstream

v6.0 marks a structural shift in how AI inference is being deployed:

This reflects the real-world shift happening in production: inference is no longer a single-GPU or single-server problem. Frontier models at 100B+ parameters, served at millions of tokens per second, require coordinated multi-node inference with high-bandwidth interconnects. The benchmark is finally catching up to production reality.

Full Performance Results Summary

ModelPlatformTokens/secScale
DeepSeek-R1NVIDIA GB300 NVL722,500,000+72 GPUs (1 rack)
DeepSeek-R1AMD MI355X1,016,38011 nodes
GPT-OSS-120BNVIDIA GB300 NVL721,096,77072 GPUs (1 rack)
GPT-OSS-120BAMD MI355X1,031,07012 nodes
DeepSeek-R1NVIDIA GB300 NVL72575,580Single rack (server)

What This Means for AI Builders

For teams choosing inference infrastructure in 2026:

For teams using multi-model AI platforms like Happycapy — which runs models from OpenAI, Anthropic, Google, and others through a single interface — these benchmark improvements translate directly to faster, cheaper responses as inference providers upgrade their hardware.

Key Takeaways


Sources: AMD blog (amd.com, April 1, 2026), MLCommons official results (mlcommons.org/benchmarks/inference-datacenter), VentureBeat MLPerf v6.0 coverage, Nebius submission documentation. All performance numbers are from official MLCommons submissions as of April 1, 2026.

SharePost on XLinkedIn
Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

Comments