AI Research

Google TurboQuant: 6x LLM Memory Compression with Zero Accuracy Loss

Q: How does TurboQuant work?

TurboQuant uses a two-stage pipeline. First, PolarQuant rotates vectors and converts them from Cartesian to polar coordinates, exploiting the natural geometry of attention vectors to achieve high compression without calibration data. Second, Quantized Johnson-Lindenstrauss (QJL) applies a 1-bit error correction layer to eliminate bias and preserve attention accuracy. The algorithm is data-oblivious — it requires no training data or per-model calibration.

Q: What is the KV cache and why does it matter?

The KV (Key-Value) cache stores intermediate attention states computed during LLM inference. As context windows expand to 1 million tokens, the KV cache becomes the dominant memory bottleneck — often requiring more GPU memory than the model weights themselves. For a 70B model serving 512 concurrent users, the KV cache alone can require 512 GB of GPU memory. TurboQuant reduces this to ~85 GB.

March 25, 2026 · 8 min read · Happycapy Guide

TL;DR

Google Research introduced TurboQuant on March 25, 2026 — a training-free compression algorithm that reduces LLM KV cache memory by 6x (from 16-bit to ~3-bit) and delivers up to 8x inference speedup on NVIDIA H100 GPUs, with zero accuracy loss across all tested benchmarks. No model retraining, no calibration data. Samsung stock fell 5%, SK Hynix 6%, Micron 7% on the announcement. Official open-source code is expected in Q2 2026; community implementations in PyTorch, MLX, and llama.cpp already exist.

6xKV cache memory reduction

8xAttention computation speedup (H100)

0%Accuracy loss on benchmarks

-7%Micron stock drop on announcement day

What TurboQuant Does and Why It Matters

The KV (Key-Value) cache is the hidden memory tax of modern LLM inference. Every time you run a model with a long context window, the model stores intermediate attention states — the "working memory" of the transformer — in GPU memory. As context windows expand from 128K to 1M tokens, this cache grows to the point where it consumes more GPU memory than the model weights themselves.

For operators running large-context models at scale, the KV cache is the primary bottleneck that limits throughput, drives hardware costs, and determines how many concurrent users a single GPU can serve. This is the problem TurboQuant solves.

Google Research's algorithm compresses each KV cache value from standard 16-bit floating point down to approximately 3 bits — a 6x reduction — without touching the model weights and without requiring any training data or calibration. Apply it at deployment time to any existing transformer model and it works immediately.

How TurboQuant Works: The Two-Stage Pipeline

TurboQuant operates in two stages designed to compress efficiently while avoiding the bias problems that plague simpler quantization methods:

Stage 1: PolarQuant

PolarQuant randomly rotates attention vectors and then converts them from Cartesian coordinates (x, y, z…) to polar coordinates (radius, angles). This geometric transformation exploits the fact that transformer attention vectors lie on predictable spherical surfaces, making them highly compressible without loss of information.

Unlike traditional quantization methods (GPTQ, AWQ), PolarQuant requires no per-block scaling constants and no data-dependent calibration. It is applied identically to every model and every batch.

Stage 2: Quantized Johnson-Lindenstrauss (QJL)

After PolarQuant reduces the main representation to ~2 bits, QJL adds approximately 1 bit of error correction using a Johnson-Lindenstrauss projection — a mathematical technique that provably preserves distances between vectors while compressing them.

The combined output is ~3 bits per KV cache element, compared to 16 bits in standard inference. The QJL stage eliminates the bias that basic quantization introduces, which is why TurboQuant achieves zero accuracy loss while competitors often sacrifice 1-3% on downstream benchmarks.

Real-World Cost Impact

The practical economics of TurboQuant are significant. Consider a 70B parameter model serving 512 concurrent users on a 1M-token context window:

Scenario	KV Cache Memory	GPU Requirement	Est. Monthly Cost
Standard 16-bit inference	512 GB	8× H100 80GB	~$3,000
With TurboQuant (6x compression)	~85 GB	2× H100 80GB	~$750
Cost reduction	83% less memory	75% fewer GPUs	75% cost savings

These numbers explain the immediate stock market reaction. If enterprise operators can run the same AI inference on 75% fewer memory chips, demand for Samsung HBM, SK Hynix HBM4, and Micron DRAM could fall significantly at existing capacity levels.

Important Limitation

As of April 2026, TurboQuant has been benchmarked only on 7B–8B parameter models (Llama-3.1-8B, Mistral-7B, Gemma). Performance on 70B+ models has not been published. The algorithm's scalability to enterprise-grade frontier models is an open question that will likely be addressed at ICLR 2026 in April.

Build AI Workflows Without Infrastructure Headaches

Happycapy handles all inference infrastructure. You get Claude, GPT-5.4, Gemini 3, Mistral, and 150+ models — no KV cache management, no GPU provisioning, no ops overhead. Pro starts at $17/month.

Try Happycapy Free →

TurboQuant vs. NVIDIA KVTC: The ICLR 2026 Showdown

TurboQuant is not the only compression algorithm presenting at ICLR 2026. NVIDIA is simultaneously presenting KVTC (Key-Value Tensor Compression), its own approach to the same problem. The two methods have strikingly different tradeoffs:

Property	Google TurboQuant	NVIDIA KVTC
Max compression	6x (3-bit)	20x (higher compression)
Accuracy loss	Zero	Minor (reported ~1%)
Requires calibration data	No	Yes (one-time PCA per model)
Setup complexity	Drop-in deployment	Requires per-model preprocessing
Best for	Zero-config deployment, accuracy-critical applications	Maximum compression when minor accuracy trade-off is acceptable

For most production use cases, TurboQuant's zero-accuracy, no-calibration profile will win. KVTC's 20x compression is compelling for edge deployment where memory is an absolute constraint, but the accuracy penalty makes it unsuitable for precision-critical enterprise applications.

Benchmarks: What Google Tested

Google tested TurboQuant on Llama-3.1-8B, Mistral-7B, and Gemma models across five long-context benchmarks:

Needle in a Haystack — perfect recall across all compression levels
LongBench — matched uncompressed baseline on all subtasks
ZeroSCROLLS — no degradation on summarization and QA tasks
RULER — full accuracy on multi-hop retrieval tasks
L-Eval — matched baseline on document understanding

Additionally, TurboQuant was tested on vector search tasks using the GloVe dataset, where it outperformed Product Quantization (PQ) — the current industry standard — at equivalent compression ratios. This makes it relevant not just for LLM inference but for any system using approximate nearest neighbor search at scale.

What This Means for AI Users and Builders

For enterprise AI teams, TurboQuant means running 1M-context workloads becomes economically viable on existing hardware. The 75% memory reduction directly translates to serving more users per GPU — which is the single most important cost driver in production LLM deployments.

For AI consumers using platforms like Happycapy, TurboQuant's downstream effect is faster responses and lower per-query costs as cloud providers adopt the algorithm. Gemini, being a Google model, is likely to be the first major frontier model to run TurboQuant at scale.

For the hardware industry, TurboQuant represents the software layer catching up with the hardware buildout. When algorithms can extract 6x more work from existing memory, the near-term demand ceiling for HBM and DRAM in AI data centers looks lower than previously modeled — which is precisely why Samsung, SK Hynix, and Micron sold off sharply on announcement day.

Get Faster AI Without Managing Infrastructure

Happycapy runs on the fastest available inference infrastructure for every model. Claude Opus 4.6, GPT-5.4, Gemini 3 Pro, and 150+ others — in one workspace starting free.

Start Free on Happycapy →

Frequently Asked Questions

What is Google TurboQuant?

TurboQuant is a vector quantization algorithm introduced by Google Research on March 25, 2026. It compresses the KV cache used in LLM inference from 16-bit to approximately 3-bit representation, achieving 6x memory reduction with zero accuracy loss and no model retraining. On NVIDIA H100 GPUs, it delivers up to 8x speedup in attention computation.

How does TurboQuant work?

TurboQuant uses a two-stage pipeline. PolarQuant rotates attention vectors into polar coordinates, enabling high compression without calibration data. Quantized Johnson-Lindenstrauss (QJL) then applies 1-bit error correction to eliminate bias. The combined approach achieves ~3 bits per KV cache element with no accuracy loss — unlike traditional methods like GPTQ or AWQ that require per-model calibration.

What is the KV cache and why does it matter?

The KV cache stores intermediate attention states during LLM inference. As context windows expand to 1M tokens, the cache becomes the dominant GPU memory bottleneck. A 70B model serving 512 users at 1M context requires ~512 GB of KV cache memory. TurboQuant reduces this to ~85 GB — potentially cutting the GPU count required from 8 to 2 H100s.

When will TurboQuant be available?

Google has not released official code as of April 2026. TurboQuant is presenting at ICLR 2026 (April 23-27). Community implementations in PyTorch, MLX, Triton, and llama.cpp already exist based on the published paper. Official open-source code from Google is expected around Q2 2026.

Sources: Google Research (Mar 25) · TechCrunch (Mar 25) · Ars Technica (Mar 2026) · Forbes (Mar 26) · TradingKey (Mar 2026)

← Back to all articles