HappycapyGuide

By Connie · Last reviewed: April 2026 — pricing & tools verified · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

AI News

April 7, 2026 · 7 min read

Google Launches LiteRT-LM: Run LLMs on Your Phone Without the Cloud

Google released LiteRT-LM on April 7, 2026 — a production-grade, open-source framework for running large language models on smartphones and IoT devices. It enables fast, private on-device AI inference, and is the runtime behind Gemini Nano 4 on Android.

TL;DR

  • Google released LiteRT-LM on April 7, 2026 — the runtime powering Gemini Nano 4 on Android
  • Runs full LLMs on smartphones without any cloud connection or API call
  • Open source under Apache 2.0; production-grade, not a research prototype
  • Supports GPU/NPU acceleration on Android — significantly faster than llama.cpp on mobile
  • Gemma 4 E2B and E4B are the primary supported models at launch

Why On-Device AI Is the Next Big Shift

Every AI tool you use today — ChatGPT, Happycapy, Claude, Gemini — routes your requests to data centers. That works for most tasks. But it creates three friction points: latency (100–500ms per request), privacy (your data leaves your device), and availability (you need internet).

On-device LLMs eliminate all three. When the model runs on your phone's chip, responses are instant, nothing leaves your device, and the feature works in airplane mode.

The barrier has been performance. Mobile chips lack the raw compute of server GPUs, and most inference frameworks were never designed for them. LiteRT-LM is Google's production answer to that gap.

What LiteRT-LM Is

LiteRT-LM is the LLM-specific layer of Google's LiteRT framework (the successor to TensorFlow Lite). It was developed by the Google AI Edge team — the same group that ships the AI features in Android.

The framework provides:

  • Hardware-accelerated inference via Android's NNAPI, OpenCL GPU backend, and Qualcomm/MediaTek NPU backends
  • Quantization support — INT4, INT8, and FP16 formats to fit models within 1–4GB of device RAM
  • Streaming generation — token-by-token output so the first token appears in under 100ms on mid-range hardware
  • Cross-platform C++ API — runs on Android, embedded Linux, and Raspberry Pi-class hardware
  • Model conversion tools — convert HuggingFace models to LiteRT format via a single command

Benchmarks: What "Fast" Means on a Phone

Google published internal benchmarks at launch using Gemma 4 E2B (5.1B total, 2.3B effective parameters, INT4 quantized):

DeviceChipFirst Token (ms)Generation Speed
Pixel 9 ProTensor G484ms42 tokens/sec
Samsung Galaxy S26Snapdragon 8 Elite71ms58 tokens/sec
Galaxy A57Snapdragon 7s Gen 3210ms18 tokens/sec
Raspberry Pi 5ARM Cortex-A76680ms6 tokens/sec

42–58 tokens/second on a flagship phone is genuinely fast — fast enough for conversational applications. Even mid-range hardware at 18 tokens/second is usable for short-context tasks.

The Gemma 4 Connection

LiteRT-LM is not a standalone tool — it is the infrastructure layer enabling Gemini Nano 4 on Android devices. The Gemma 4 E2B and E4B models (Google's open-weight edge models released April 2) are the primary first-party targets.

Gemma 4 E2B is native audio input-capable, meaning Android apps running on LiteRT-LM can do real-time transcription, voice command interpretation, and on-device speech summarization without any network call.

For Android developers, access to Gemini Nano 4 via AICore is in developer preview now. The production API is expected to ship alongside Android 17 later in 2026.

How It Compares to Existing Solutions

FrameworkPrimary TargetAndroid NPU SupportOpen Source
LiteRT-LM (Google)Mobile + IoTYes (NNAPI, OpenCL)Apache 2.0
llama.cppDesktop / serverLimitedMIT
OllamaDesktop / serverNoMIT
MLC LLMMulti-platformPartial (Vulkan)Apache 2.0
Apple Core MLiOS/macOS onlyYes (ANE)Closed

LiteRT-LM's key advantage over llama.cpp on Android is native NNAPI integration — it delegates computation to dedicated NPU cores that llama.cpp cannot access, resulting in 3–5x faster inference on Snapdragon-equipped devices.

Cloud AI with frontier models

While edge AI is growing fast, cloud frontier models are still 10x more capable. Happycapy runs Claude Opus 4.6 and GPT-5.4 — accessible from any device, including mobile. Starting at $17/mo.

Try Happycapy Free →

Privacy: The Real Driver

Performance benchmarks are compelling, but privacy is likely to be the primary adoption driver for LiteRT-LM — especially in enterprise and healthcare contexts.

When an LLM runs on-device, patient data never leaves the hospital's devices. Employee conversations with an AI assistant never reach a cloud server. Legal documents stay within the firm's network perimeter.

This is already driving procurement decisions in regulated industries. Google's ability to ship production-quality on-device AI that passes enterprise security reviews — rather than research prototypes — is where LiteRT-LM earns its relevance beyond developer enthusiasm.

What to Watch Next

Three developments will determine how widely LiteRT-LM is adopted:

  • Android 17 AICore GA — when production access to Gemini Nano 4 opens to all app developers, expect a wave of on-device AI features in mainstream Android apps
  • Qualcomm and MediaTek NPU support breadth — mid-range device performance depends on how well LiteRT-LM abstracts vendor-specific NPU backends across hundreds of chip variants
  • Third-party model conversion quality — the framework is most useful if community models (not just Google's) convert cleanly; early reports from the Gemma 4 community are positive

The Bottom Line

LiteRT-LM is a genuine infrastructure release, not a marketing announcement. It solves a real problem — efficient LLM inference on constrained hardware — and it does so with production-grade tooling, hardware acceleration, and a permissive open-source license.

For mobile developers, it is the most capable on-device AI runtime available on Android today. For the broader AI landscape, it is another signal that AI is moving off the server and into every device in your pocket, home, and workplace.

Sources

Access every frontier model in one place

Happycapy gives you Claude Opus 4.6, GPT-5.4, Gemini 3 Pro, and 15+ AI skills — no setup, no API keys. Free to start, $17/mo Pro.

Start Free on Happycapy →
SharePost on XLinkedIn
Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

Comments