llm-d is an open-source CNCF Sandbox project for distributed LLM inference, co-founded by Google Cloud, Red Hat, IBM Research, CoreWeave, and NVIDIA. It enables organizations to run large language models across multiple nodes and clouds without vendor lock-in.

Why does distributed inference matter?

Large models like GPT-5.4 and Claude Opus 4.6 no longer fit on a single GPU. Distributed inference splits the model across many accelerators — but doing this efficiently requires scheduling, load balancing, and prefill/decode separation. llm-d standardizes this.

How does llm-d compare to vLLM or TGI?

vLLM and TGI (Text Generation Inference) are single-node serving engines. llm-d is a distributed orchestration layer on top — it can use vLLM as a backend while adding multi-node scheduling, KV cache disaggregation, and cloud-native routing.

Can I use llm-d to run models with Happycapy?

Happycapy already runs on enterprise-grade distributed inference infrastructure. For most users — especially individuals and teams — Happycapy Pro ($17/month) delivers multi-model access to GPT-5.4, Claude, and Gemini without any infrastructure to manage.

By Connie · Last reviewed: April 2026 — pricing & tools verified · AI-assisted, human-edited · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

AI Infrastructure

Google llm-d: The Open-Source Project Reshaping Distributed AI Inference

April 4, 2026

TL;DR

Google Cloud, Red Hat, IBM Research, CoreWeave, and NVIDIA have jointly launched llm-d as a CNCF Sandbox project — an open-source framework for running large language models across distributed infrastructure. It standardizes multi-node LLM serving, separates prefill from decode workloads, and is designed to become the Kubernetes of AI inference.

On April 2, 2026, five of the most influential names in enterprise infrastructure announced a joint initiative that could do for AI serving what Kubernetes did for container orchestration. llm-d — short for LLM Distributed — is now an official CNCF (Cloud Native Computing Foundation) Sandbox project, backed by Google Cloud, Red Hat, IBM Research, CoreWeave, and NVIDIA.

The timing is not coincidental. Frontier models have grown so large that running them on a single GPU is no longer viable. GPT-5.4 and Claude Opus 4.6 each require hundreds of accelerators to serve at production scale. The industry has fragmented around proprietary serving stacks — every cloud provider builds its own. llm-d is a bet that the market needs an open standard.

What llm-d Actually Does

llm-d is not a new language model. It is a distributed orchestration layer that sits above existing inference engines like vLLM, TGI (Text Generation Inference), and TensorRT-LLM. Think of it as the scheduling and routing brain that coordinates how requests flow across a fleet of GPU workers.

The core architectural insight behind llm-d is disaggregated prefill and decode. In standard LLM inference, every request goes through two phases: prefill (processing the input prompt) and decode (generating each output token one by one). These have very different compute profiles. Prefill is compute-heavy and latency-tolerant. Decode is memory-bandwidth-bound and latency-sensitive. llm-d routes them to different hardware pools — maximizing throughput without sacrificing response time.

Key innovation: By separating prefill and decode workloads, llm-d achieves up to 3x higher GPU utilization compared to monolithic serving stacks, according to Google Cloud's internal benchmarks shared at the launch.

The Five Founding Partners and Their Roles

Partner	Contribution
Google Cloud	Core scheduling engine and GKE integration; originated the disaggregated serving research
Red Hat	OpenShift AI integration; enterprise Kubernetes hardening and RBAC policy
IBM Research	KV cache disaggregation algorithms; research on memory-bandwidth optimization
CoreWeave	High-density GPU cluster deployment; bare-metal performance benchmarking
NVIDIA	Hardware-level optimizations; TensorRT-LLM backend integration; NVLink topology awareness

Why CNCF Matters Here

Submitting llm-d to the CNCF Sandbox is a deliberate governance move. CNCF (the foundation behind Kubernetes, Prometheus, and Envoy) provides neutral stewardship — no single company controls the roadmap. This directly addresses the concern that enterprise AI infrastructure would permanently splinter along cloud provider lines.

Amazon Web Services, Microsoft Azure, and Oracle Cloud are not founding partners, which means llm-d is immediately positioned as a challenger to their proprietary serving stacks. AWS has SageMaker LMI (Large Model Inference). Azure has its own vCore-based serving layer. Whether those ecosystems adopt llm-d or resist it will define how this plays out over the next two years.

Run AI across GPT-5.4, Claude, and Gemini — without managing infrastructure

Happycapy gives individuals and teams multi-model access in one platform. No GPU clusters. No ops overhead. From $17/month.

Try Happycapy Pro

llm-d vs. Existing Serving Frameworks

Framework	Type	Distributed?	Key Strength
llm-d	Orchestration layer	Yes — native multi-node	Disaggregated prefill/decode, KV cache routing, open governance
vLLM	Serving engine	Partial (tensor parallel)	PagedAttention, wide model support, large community
TGI (HuggingFace)	Serving engine	Limited	Easy model hub integration, developer-friendly
TensorRT-LLM	Optimized runtime	Yes (tensor parallel)	Maximum NVIDIA hardware utilization
SGLang	Serving engine	Limited	RadixAttention for prefix caching, fast structured output

The KV Cache Disaggregation Innovation

One of llm-d's most technically significant contributions is KV cache disaggregation — a mechanism developed in collaboration with IBM Research. In standard serving, the KV (key-value) cache generated during prefill lives on the same GPU that will run the decode phase. This creates imbalanced memory pressure and forces over-provisioning.

llm-d moves the KV cache to a shared memory tier that both prefill and decode workers can access. This is architecturally similar to what Google TurboQuant does at the compression layer — both approaches attack the same root problem: GPU memory is the binding constraint on LLM serving cost and throughput.

What This Means for Enterprise AI Buyers

For enterprises running private model deployments — whether on-premise or in their own cloud accounts — llm-d solves a real and expensive problem. Today, standing up a production serving cluster for a 70B+ parameter model requires deep expertise in CUDA, tensor parallelism, and cloud networking. Most organizations do not have that expertise.

llm-d's Kubernetes-native design means it fits into existing DevOps workflows. Helm charts, Prometheus metrics, and standard Kubernetes RBAC all work out of the box. For Red Hat OpenShift customers in particular, llm-d is expected to become the default AI serving layer in OpenShift AI by Q3 2026.

Access every major AI model without the infrastructure headache

Happycapy Pro delivers GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, and 20+ models in one dashboard — $17/month annually.

Start Free at Happycapy

Frequently Asked Questions

What is llm-d?

llm-d is an open-source CNCF Sandbox project for distributed LLM inference. Founded by Google Cloud, Red Hat, IBM Research, CoreWeave, and NVIDIA, it provides a Kubernetes-native orchestration layer that splits prefill and decode workloads across multiple GPU nodes to maximize throughput and minimize cost.

How does llm-d differ from vLLM?

vLLM is a single-node serving engine optimized for PagedAttention and high throughput on one machine. llm-d is a multi-node orchestration framework that can use vLLM as a backend while adding distributed scheduling, KV cache disaggregation, and cross-node load balancing.

Is llm-d production-ready?

As a CNCF Sandbox project, llm-d is in active development and early adoption phase. Sandbox status means it has passed initial technical review but has not yet reached CNCF Incubating or Graduated status. Production use is expected from early adopters by Q3 2026.

Which cloud platforms support llm-d?

At launch, llm-d has native integrations with Google Kubernetes Engine (GKE), Red Hat OpenShift, and CoreWeave's managed Kubernetes. AWS EKS and Azure AKS support is planned via community contributions, though neither AWS nor Microsoft are founding partners.

Sources: LaB-LA.org — New AI Releases April 2-4, 2026 · Radical Data Science — AI News Briefs April 2026 · MicroCenter — This Week in AI April 3, 2026

← Back to all articles

SharePost on X LinkedIn

—Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

AI Infrastructure

Cloudflare's AI Agent Infrastructure: What 'Agents Week' Means for Builders and Users

9 min

AI Infrastructure

Kepler's NVIDIA-Powered Orbital Compute Cluster Is Live — What It Means for AI Infrastructure

9 min