What is Google LiteRT-LM?

LiteRT-LM is Google's open-source, production-grade inference framework for running large language models on edge devices such as smartphones, tablets, and IoT hardware. It is optimized for low memory footprint, fast first-token latency, and GPU/NPU acceleration on Android and embedded Linux devices.

How does LiteRT-LM differ from Ollama or llama.cpp?

Ollama and llama.cpp are primarily designed for desktop and server hardware (macOS, Linux, Windows). LiteRT-LM is purpose-built for constrained mobile and embedded devices — it leverages Android's GPU and NPU hardware backends (via NNAPI and OpenCL), targets sub-2GB memory footprints, and integrates directly with Google's Gemini Nano model family for Android.

Which models run on LiteRT-LM?

LiteRT-LM supports any model exported to the LiteRT format. Google's own Gemma 4 E2B and E4B (edge models) are the primary targets at launch, along with Gemini Nano 4. The framework supports INT4, INT8, and FP16 quantization formats, meaning most popular open-weight models (Phi-4 Mini, TinyLlama, Mistral 7B quantized) can run on compatible hardware.

Is LiteRT-LM production-ready for app developers?

Yes. Google describes LiteRT-LM as a production-grade framework, not a research prototype. Android app developers can integrate it via the Android AICore API (in developer preview for Gemini Nano 4), and embedded Linux developers can use the C++ API directly. Stable 1.0 release is expected alongside the Gemini Nano 4 Android rollout later in 2026.

Google Launches LiteRT-LM: Run LLMs on Your Phone Without the Cloud

Google released LiteRT-LM on April 7, 2026 — a production-grade, open-source framework for running large language models on smartphones and IoT devices. It enables fast, private on-device AI inference, and is the runtime behind Gemini Nano 4 on Android.

Why On-Device AI Is the Next Big Shift

Every AI tool you use today — ChatGPT, Happycapy, Claude, Gemini — routes your requests to data centers. That works for most tasks. But it creates three friction points: latency (100–500ms per request), privacy (your data leaves your device), and availability (you need internet).

On-device LLMs eliminate all three. When the model runs on your phone's chip, responses are instant, nothing leaves your device, and the feature works in airplane mode.

The barrier has been performance. Mobile chips lack the raw compute of server GPUs, and most inference frameworks were never designed for them. LiteRT-LM is Google's production answer to that gap.

What LiteRT-LM Is

LiteRT-LM is the LLM-specific layer of Google's LiteRT framework (the successor to TensorFlow Lite). It was developed by the Google AI Edge team — the same group that ships the AI features in Android.

The framework provides:

Hardware-accelerated inference via Android's NNAPI, OpenCL GPU backend, and Qualcomm/MediaTek NPU backends
Quantization support — INT4, INT8, and FP16 formats to fit models within 1–4GB of device RAM
Streaming generation — token-by-token output so the first token appears in under 100ms on mid-range hardware
Cross-platform C++ API — runs on Android, embedded Linux, and Raspberry Pi-class hardware
Model conversion tools — convert HuggingFace models to LiteRT format via a single command

Benchmarks: What "Fast" Means on a Phone

Google published internal benchmarks at launch using Gemma 4 E2B (5.1B total, 2.3B effective parameters, INT4 quantized):

Device	Chip	First Token (ms)	Generation Speed
Pixel 9 Pro	Tensor G4	84ms	42 tokens/sec
Samsung Galaxy S26	Snapdragon 8 Elite	71ms	58 tokens/sec
Galaxy A57	Snapdragon 7s Gen 3	210ms	18 tokens/sec
Raspberry Pi 5	ARM Cortex-A76	680ms	6 tokens/sec

42–58 tokens/second on a flagship phone is genuinely fast — fast enough for conversational applications. Even mid-range hardware at 18 tokens/second is usable for short-context tasks.

The Gemma 4 Connection

LiteRT-LM is not a standalone tool — it is the infrastructure layer enabling Gemini Nano 4 on Android devices. The Gemma 4 E2B and E4B models (Google's open-weight edge models released April 2) are the primary first-party targets.

Gemma 4 E2B is native audio input-capable, meaning Android apps running on LiteRT-LM can do real-time transcription, voice command interpretation, and on-device speech summarization without any network call.

For Android developers, access to Gemini Nano 4 via AICore is in developer preview now. The production API is expected to ship alongside Android 17 later in 2026.

How It Compares to Existing Solutions

Framework	Primary Target	Android NPU Support	Open Source
LiteRT-LM (Google)	Mobile + IoT	Yes (NNAPI, OpenCL)	Apache 2.0
llama.cpp	Desktop / server	Limited	MIT
Ollama	Desktop / server	No	MIT
MLC LLM	Multi-platform	Partial (Vulkan)	Apache 2.0
Apple Core ML	iOS/macOS only	Yes (ANE)	Closed

LiteRT-LM's key advantage over llama.cpp on Android is native NNAPI integration — it delegates computation to dedicated NPU cores that llama.cpp cannot access, resulting in 3–5x faster inference on Snapdragon-equipped devices.

Cloud AI with frontier models

While edge AI is growing fast, cloud frontier models are still 10x more capable. Happycapy runs Claude Opus 4.6 and GPT-5.4 — accessible from any device, including mobile. Starting at $17/mo.

Try Happycapy Free →

Privacy: The Real Driver

Performance benchmarks are compelling, but privacy is likely to be the primary adoption driver for LiteRT-LM — especially in enterprise and healthcare contexts.

When an LLM runs on-device, patient data never leaves the hospital's devices. Employee conversations with an AI assistant never reach a cloud server. Legal documents stay within the firm's network perimeter.

This is already driving procurement decisions in regulated industries. Google's ability to ship production-quality on-device AI that passes enterprise security reviews — rather than research prototypes — is where LiteRT-LM earns its relevance beyond developer enthusiasm.

What to Watch Next

Three developments will determine how widely LiteRT-LM is adopted:

Android 17 AICore GA — when production access to Gemini Nano 4 opens to all app developers, expect a wave of on-device AI features in mainstream Android apps
Qualcomm and MediaTek NPU support breadth — mid-range device performance depends on how well LiteRT-LM abstracts vendor-specific NPU backends across hundreds of chip variants
Third-party model conversion quality — the framework is most useful if community models (not just Google's) convert cleanly; early reports from the Gemma 4 community are positive

The Bottom Line

LiteRT-LM is a genuine infrastructure release, not a marketing announcement. It solves a real problem — efficient LLM inference on constrained hardware — and it does so with production-grade tooling, hardware acceleration, and a permissive open-source license.

For mobile developers, it is the most capable on-device AI runtime available on Android today. For the broader AI landscape, it is another signal that AI is moving off the server and into every device in your pocket, home, and workplace.