What is MLPerf Inference v6.0?

MLPerf Inference v6.0 is the latest AI inference benchmark suite from MLCommons, released April 1, 2026. It is the industry-standard benchmark for measuring AI chip and system inference performance. v6.0 adds five new workloads including the first text-to-video model and a multimodal vision-language benchmark, with 24 organizations submitting results.

What were AMD's results in MLPerf Inference v6.0?

AMD's Instinct MI355X platform was the first to surpass 1 million tokens per second in MLPerf Inference benchmarks at multi-node scale, achieving 1,016,380 tokens/sec on DeepSeek-R1. In single-node comparisons, the MI355X achieved 93% of NVIDIA's B200 performance on Single Stream, with post-deadline tuning reaching 108%.

What new workloads were added in MLPerf Inference v6.0?

MLPerf v6.0 added five new workloads: Qwen3-VL-235B-A22B (first multimodal vision-language model), GPT-OSS-120B (OpenAI's 120B MoE reasoning model), WAN-2.2-T2V-A14B (first text-to-video generation model), DeepSeek-R1 Interactive (low-latency scenario), and DLRMv3 (transformer-based recommendation model).

Does NVIDIA still lead MLPerf Inference v6.0?

Yes. NVIDIA's GB300 NVL72 system (72 Blackwell Ultra GPUs) achieved 2.5 million tokens per second on DeepSeek-R1 and earned 10 of 16 first-place results. NVIDIA's cumulative MLPerf wins since 2018 stand at 291 — nine times that of all other submitters combined. AMD closed the gap significantly on multi-node clusters but NVIDIA remains dominant.

MLPerf Inference v6.0: AMD Breaks 1M Tokens/Sec, First Video Generation Benchmark

TL;DR: MLCommons released MLPerf Inference v6.0 results on April 1, 2026. AMD's Instinct MI355X broke the 1 million tokens-per-second barrier at multi-node scale. NVIDIA's GB300 NVL72 hit 2.5M tokens/sec on a single rack. Five new workloads were added, including the first-ever text-to-video generation benchmark and a 235B multimodal model. 24 organizations submitted results — a new participation record.

Every six months, MLCommons releases MLPerf Inference results — the AI industry's closest equivalent to a standardized benchmark for hardware and software inference performance. The v6.0 round, covering results as of April 1, 2026, is the most significant update since the suite launched.

The headline: AMD is genuinely competitive now. And the benchmarks themselves have finally caught up to where AI workloads actually are in 2026 — multimodal, video-generating, and running at scales that would have seemed impossible in 2024.

Five New Workloads That Reflect Real AI in 2026

The most important change in v6.0 is what got added. Previous rounds measured LLM text generation and recommendation models — workloads from the 2022–2023 era. v6.0 adds:

New Workload	Type	Why It Matters
Qwen3-VL-235B-A22B	Multimodal vision-language (MoE)	First multimodal model in the suite — tests image + text inference together
GPT-OSS-120B	MoE reasoning LLM (OpenAI open weights)	Tests sparse model inference — a dominant architecture in 2026
WAN-2.2-T2V-A14B	Text-to-video generation	First video generation workload ever in MLPerf — 14B parameters
DeepSeek-R1 Interactive	Reasoning LLM (low-latency scenario)	Tests real-time interactive reasoning, not just batch throughput
DLRMv3	Transformer-based recommendation	Replaces the older DLRM-DCNv2 with hyperscale-relevant compute intensity

The inclusion of WAN-2.2 text-to-video is particularly significant. Video generation has become one of the most compute-intensive inference workloads in production — used by Runway, Kling, and dozens of enterprise video tools. This is the first time it has been formally benchmarked across hardware vendors in a standardized setting.

NVIDIA: Still Dominant, But Software Is the New Moat

NVIDIA's GB300 NVL72 — a rack-scale system with 72 Blackwell Ultra GPUs — achieved 2.5 million tokens per second on DeepSeek-R1, earning 10 first-place results across the 16 benchmark categories.

More interesting than the raw numbers: software optimization delivered a 2.77x throughput gain on the GB300 NVL72 compared to its debut in v5.1, six months earlier — with no hardware changes. NVIDIA's cumulative MLPerf wins since 2018 stand at 291, nine times that of all other submitters combined.

The pattern is clear: NVIDIA's lead is increasingly a software story, not just silicon. Their inference stack (TensorRT-LLM, NVLink fabric optimization, serving architecture) compounds faster than hardware can be replicated.

AMD: The 1 Million Tokens/Sec Milestone

The biggest competitive story in v6.0 is AMD. The Instinct MI355X platform became the first to surpass 1 million tokens per second in any MLPerf benchmark — achieving 1,016,380 tokens/sec on DeepSeek-R1 at multi-node scale (11 nodes).

Metric	AMD MI355X	NVIDIA GB300 NVL72
DeepSeek-R1 (multi-node tokens/sec)	1,016,380	575,580 (single rack)
GPT-OSS-120B (multi-node tokens/sec)	1,031,070	1,096,770 (single rack)
Single Stream vs B200	93% (108% post-tuning)	100% (baseline)
Partner ecosystem (submitters)	9 partners	Dominant across all categories

AMD's 9 ecosystem partners submitting results ties a record for any single platform in one round. That breadth matters for enterprise buyers who need hardware supply chain diversity.

The caveat: AMD's 1M tokens/sec required 11 nodes vs NVIDIA's single-rack result. Per-node efficiency still favors NVIDIA. But the multi-node story matters for hyperscalers building at cluster scale.

Intel: The CPU-Only Outlier

Intel submitted results using its Arc Pro B-Series GPUs for accessible inference. A 4-GPU Arc Pro B70 configuration provides 128GB total VRAM — enough to run 120B-parameter models without splitting. Intel remains the only server processor vendor to submit CPU-only MLPerf Inference results, targeting workloads where dedicated GPU infrastructure isn't available.

For enterprises running inference on existing CPU infrastructure — a large segment of the market — Intel's continued participation in MLPerf provides the only apples-to-apples CPU benchmarking data.

System Scale: Multi-Node Goes Mainstream

v6.0 marks a structural shift in how AI inference is being deployed:

30% increase in multi-node system submissions vs v5.1
Largest system: 72 nodes, 288 accelerators — 4x the previous record
10% of all systems had more than 10 nodes (vs 2% in v5.1)

This reflects the real-world shift happening in production: inference is no longer a single-GPU or single-server problem. Frontier models at 100B+ parameters, served at millions of tokens per second, require coordinated multi-node inference with high-bandwidth interconnects. The benchmark is finally catching up to production reality.

Full Performance Results Summary

Model	Platform	Tokens/sec	Scale
DeepSeek-R1	NVIDIA GB300 NVL72	2,500,000+	72 GPUs (1 rack)
DeepSeek-R1	AMD MI355X	1,016,380	11 nodes
GPT-OSS-120B	NVIDIA GB300 NVL72	1,096,770	72 GPUs (1 rack)
GPT-OSS-120B	AMD MI355X	1,031,070	12 nodes
DeepSeek-R1	NVIDIA GB300 NVL72	575,580	Single rack (server)

What This Means for AI Builders

For teams choosing inference infrastructure in 2026:

NVIDIA H100/B200 remains the default for most use cases — the ecosystem, software stack, and per-accelerator efficiency are unmatched
AMD is a credible alternative for hyperscale multi-node deployments — especially as supply chain diversity becomes a priority
Video generation benchmarks are now comparable across vendors for the first time — useful for teams evaluating text-to-video at scale
Software optimization matters more than ever — NVIDIA's 2.77x gain from pure software tuning is a reminder that your serving stack is as important as your hardware

For teams using multi-model AI platforms like Happycapy — which runs models from OpenAI, Anthropic, Google, and others through a single interface — these benchmark improvements translate directly to faster, cheaper responses as inference providers upgrade their hardware.

Key Takeaways

MLPerf Inference v6.0 released April 1, 2026 — the most significant update to the benchmark suite to date
AMD MI355X first to break 1 million tokens/sec in any MLPerf benchmark (multi-node)
NVIDIA GB300 NVL72 hits 2.5M tokens/sec on DeepSeek-R1 from a single rack
Five new workloads: first multimodal model, first text-to-video model, new MoE reasoning and recommendation benchmarks
30% more multi-node systems; 10% of all submitted systems have 10+ nodes
Software optimization delivered 2.77x gain on existing NVIDIA hardware — no new silicon needed
24 organizations submitted results — a new participation record

Sources: AMD blog (amd.com, April 1, 2026), MLCommons official results (mlcommons.org/benchmarks/inference-datacenter), VentureBeat MLPerf v6.0 coverage, Nebius submission documentation. All performance numbers are from official MLCommons submissions as of April 1, 2026.

Sources

OpenAI Anthropic NVIDIA DeepSeek

← Back to all articles