April 14, 2026 · 7 min read
Anthropic Quietly Cut Claude Code's Cache TTL from 1 Hour to 5 Minutes — Developers Are Furious
TL;DR
- Anthropic silently reduced Claude Code prompt cache TTL from 1 hour → 5 minutes in April 2026
- Cache misses now happen 12x more frequently — developers with long system prompts pay full input pricing
- No changelog, no announcement, no advance notice — discovered via billing anomalies
- Workarounds: keepalive pings, prompt splitting, retrieval layers, or switching to providers with longer TTLs
- Happycapy routes to multiple model providers — reducing exposure to single-vendor policy changes
On April 14, 2026, developers using Anthropic's Claude Code API began reporting a sudden spike in token usage with no corresponding increase in requests. The cause: Anthropic had silently reduced the prompt cache TTL — the window during which a cached system prompt stays valid — from 1 hour to 5 minutes. No changelog entry. No email. No advance notice. Developers found out through their billing dashboards.
What Prompt Caching Does and Why TTL Matters
Anthropic's prompt caching lets developers store long system prompts on Anthropic's infrastructure so they don't have to re-send — and re-pay for — the same thousands of tokens on every API call. The cache is keyed to a hash of the prompt content and expires after a TTL window.
Under the old 1-hour TTL, a developer with a 50,000-token system prompt could make API calls for up to 60 minutes before the cache expired. Under the new 5-minute TTL, the cache expires after 300 seconds — forcing a full cache refresh at full input token pricing.
| Scenario | Old TTL (1 hr) | New TTL (5 min) |
|---|---|---|
| Continuous pipeline, call every 2 min | 1 cache miss/hr | 1 cache miss/5 min = 12/hr |
| Batch job, runs hourly | 1 cache miss/hr | 1 miss per batch run (unchanged) |
| User-facing app with bursty traffic | 1 warm-up per session | Cold start every 5 min of inactivity |
| Overnight automation (runs 2am–4am) | 1–2 cache misses total | 1 cache miss per run if gap >5 min |
The Real Cost Impact
For a developer with a 50,000-token system prompt making API calls every 6 minutes (just outside the 5-minute TTL window), the math is brutal:
- Cache hit price: ~$0.003 per 1,000 input tokens (cached read rate)
- Cache miss price: ~$0.015 per 1,000 input tokens (full input rate)
- 50,000 tokens × 10 cache misses/hour = 500,000 input tokens/hour billed at full rate
- vs. 50,000 tokens × 1 cache miss/hour under old TTL = 50,000 tokens at full rate
- 10x cost increase on system prompt tokens for this single change
Why Anthropic Made the Change (Likely)
Anthropic has not issued a public statement. Based on community analysis, the most likely explanations are:
- Infrastructure capacity — With Claude Code adoption surging, holding millions of large prompt caches for 60 minutes requires significant memory footprint. A 5-minute TTL reduces memory pressure by approximately 12x for the same number of users.
- Revenue optimization — Cache hits are priced at a significant discount to full input tokens. Reducing TTL converts cache hits into higher-margin full-price token reads.
- Model version migration — Claude Code has been transitioning users across model versions; shorter TTLs make cache invalidation across model updates more manageable.
4 Workarounds Developers Are Using Right Now
1. Keepalive ping every 4 minutes
Send a minimal API call with the full system prompt every 4 minutes (just inside the TTL window) during active pipeline windows. The ping itself costs only a few input tokens for the message body — far cheaper than a full cache miss on a 50K-token system prompt.
2. Split system prompts by refresh rate
Separate your system prompt into a small high-frequency section (role, format, tone) and a large low-frequency section (reference data, knowledge base). Cache the small section — cheaper if it misses. Move the large section to a retrieval layer that only injects relevant chunks per request.
3. Retrieval-augmented architecture
Replace monolithic system prompts with a RAG layer. Rather than loading all context into the system prompt, retrieve the 3–5 most relevant chunks per query. This reduces both the base token cost and the cache miss penalty to nearly nothing.
4. Provider diversification for batch workloads
For intermittent batch jobs where cache keepalive is impractical, routing to a provider with more predictable caching behavior (or no cache TTL sensitivity) for those workloads reduces exposure to single-vendor policy changes.
Don't let one provider's policy change blow up your costs.
Happycapy routes tasks across Claude, GPT-5, and Gemini — so you're never fully exposed to a single vendor's cache policy, pricing change, or outage.
Try Happycapy Free