What is NVIDIA Nemotron 3 Super?

NVIDIA Nemotron 3 Super is a 120-billion-parameter open-source AI model released in March 2026. It uses a hybrid Mamba2-Transformer Mixture-of-Experts (MoE) architecture with only 12B active parameters per token, delivering 7x higher throughput than comparable models. It features a 1-million-token context window and is purpose-built for agentic AI applications like software development and cybersecurity.

Is NVIDIA Nemotron 3 Super free to use commercially?

Yes. NVIDIA Nemotron 3 Super is released under the NVIDIA Nemotron Open Model License, which permits commercial use. The model weights, training datasets, and RL environments are all open-sourced. It is available on Hugging Face, build.nvidia.com, OpenRouter, Google Cloud Vertex AI, Oracle Cloud, and Amazon Bedrock.

How does Nemotron 3 Super compare to GPT-5.4 and Claude Opus 4.6?

Nemotron 3 Super is designed for high-throughput agentic workloads — it runs at 7x the throughput of comparable models at a fraction of the cost. GPT-5.4 leads on computer use (75% OSWorld) and knowledge work; Claude Opus 4.6 leads on coding (80.8% SWE-bench). Nemotron 3 Super is best for teams running high-volume agent pipelines where cost and speed matter more than top-of-benchmark scores.

What is the difference between Nemotron 3 Nano, Super, and Ultra?

The Nemotron 3 family has three tiers: Nano (30B parameters) for lightweight edge deployment, Super (120B total / 12B active) for production agentic workloads with maximum efficiency, and Ultra (500B parameters, expected H1 2026) for the highest-capability enterprise tasks. Super is currently the primary production model in the family.

Model LaunchApril 3, 2026·7 min read

NVIDIA Nemotron 3 Super: 120B Open-Source AI Model Built for Agents — Full Breakdown

TL;DR: NVIDIA's Nemotron 3 Super is a 120B-parameter open-source AI model with only 12B active parameters per forward pass, delivering 7x throughput vs. comparable models. It uses a hybrid Mamba2-Transformer MoE architecture, a 1-million-token context window, and is free for commercial use. Available on Hugging Face, OpenRouter, and all major cloud providers. Built specifically for agentic AI workloads — software dev, cybersecurity, complex multi-step reasoning.

Most 2026 AI model releases compete on benchmark scores. NVIDIA went a different direction with Nemotron 3 Super: optimize for throughput and cost, not just accuracy. The result is a 120B-parameter model that activates only 12B parameters per token, runs 7x faster than standard models at its performance tier, and costs a fraction of GPT-5.4 or Claude Opus 4.6 per inference call.

For teams building AI agents at scale — where every API call counts — that tradeoff matters more than a 2-point benchmark advantage.

What Is Nemotron 3 Super?

Nemotron 3 Super is NVIDIA's second generation of its open-source model family. The architecture is what makes it unusual: a hybrid Mamba2-Transformer Mixture-of-Experts (MoE) model combined with Multi-Token Prediction (MTP) layers.

The MoE design means the model has 120B total parameters but only routes 12B of them per inference pass. This is different from dense models like GPT-5.4, which activate all parameters for every token. The result is dramatically lower compute cost at comparable output quality.

Mamba2 layers replace standard attention mechanisms for most of the model — this is where the throughput gains come from. Mamba processes sequences in linear time (not quadratic like attention), which is why it can handle the 1M token context window efficiently at scale.

Key Specifications

Spec	Value
Total parameters	120 billion
Active parameters per token	12 billion (Latent MoE)
Architecture	Hybrid Mamba2-Transformer MoE with MTP layers
Context window	1,000,000 tokens
Throughput vs. comparable models	7x higher
Training data	~25 trillion tokens (pre-training cutoff Dec 2025)
Languages	19 natural languages, 43 programming languages
License	NVIDIA Nemotron Open Model License (commercial OK)
Quantizations	BF16, FP8, NVFP4
Availability	Hugging Face, build.nvidia.com, OpenRouter, AWS Bedrock, GCP Vertex, Oracle Cloud

What Makes the Architecture Different

Latent MoE is the key innovation. Standard MoE models route tokens to experts based on the raw input. Nemotron 3 Super routes based on the hidden state — a richer representation that captures semantic context. This means experts specialize on meaning, not surface token patterns, which improves accuracy while keeping active parameters low.

Multi-Token Prediction (MTP) layers let the model predict multiple future tokens simultaneously during inference. For complex reasoning chains (like multi-step agent planning), this produces 3x faster inference compared to autoregressive single-token prediction.

Reasoning budget control is an API-level feature that lets developers adjust how much compute the model spends on each request:

Full Reasoning — default for deep multi-step problems
Reasoning Budget — cap compute time for latency-sensitive apps
Low Effort — maximum speed for simple tasks like summarization

This kind of programmatic reasoning control is rare. It lets you tune the cost-accuracy tradeoff at runtime rather than picking a different model entirely.

Run Nemotron 3 Super in Your Happycapy Workflows

Happycapy's AI Gateway supports multiple models — swap between Claude, GPT-5.4, and Nemotron 3 Super based on task requirements, all without changing your setup.

Try Happycapy Free →

Nemotron 3 Super vs. GPT-5.4 vs. Claude Opus 4.6

Model	Params (Active)	Context	Best At	Open Source	Cost
Nemotron 3 Super	120B (12B active)	1M tokens	High-throughput agents, tool-calling at scale	Yes (commercial)	Very low (7x efficiency)
GPT-5.4	Dense (undisclosed)	1M tokens	Computer use, knowledge work (75% OSWorld)	No	High
Claude Opus 4.6	Dense (undisclosed)	1M tokens	Coding (80.8% SWE-bench), complex reasoning	No	High ($5/$25 per M tokens)
Nemotron 3 Nano (30B)	30B	128K tokens	Edge, low-latency, constrained hardware	Yes (commercial)	Lowest

Nemotron 3 Super is not trying to beat GPT-5.4 or Claude Opus 4.6 on reasoning benchmarks. It targets a different buyer: teams running high-volume agentic pipelines where inference cost dominates. At 7x throughput, a task that costs $100 with Claude Opus 4.6 costs roughly $14 with Nemotron 3 Super — before quantization optimizations.

What It's Built For

NVIDIA specifically designed Nemotron 3 Super for two use cases where throughput matters most:

1. Software development agents — multi-step code generation, review, and testing pipelines. The 1M context window lets the model hold an entire codebase in context without chunking. The MTP layers accelerate code generation specifically because code follows predictable patterns that multi-token prediction exploits well.

2. Cybersecurity triaging — the tool-calling benchmark is notable here. Nemotron 3 Super can navigate over 100 tools simultaneously in a single workflow. For security operations where agents need to call vulnerability scanners, log analyzers, and remediation APIs in sequence, this matters.

The reasoning budget control also makes it practical for latency-sensitive security alerting — you can set a low reasoning budget for quick triage, and a full reasoning budget for deep incident analysis.

Using Nemotron 3 Super with Happycapy

Happycapy's multi-model selector lets Pro and Max users choose which AI model handles each task. For high-volume agent runs — bulk research, document processing, code review pipelines — switching from Claude Opus 4.6 to Nemotron 3 Super via the AI Gateway can dramatically reduce compute cost while maintaining output quality for most tasks.

The workflow is identical — you don't change how you write prompts or which skills you use. The model switch is transparent. You can run A/B tests directly: send the same task to both models and compare results before committing to one for production.

For a comparison of the best AI models across reasoning, coding, and agents, see the GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro comparison.

The Nemotron 3 Family Roadmap

NVIDIA has telegraphed the full lineup:

Nemotron 3 Nano (30B) — available now, for edge and constrained environments
Nemotron 3 Super (120B / 12B active) — available now, primary production model
Nemotron 3 Ultra (500B) — expected H1 2026, for highest-capability enterprise workloads

The Ultra model will be the serious competitor to GPT-5.4 and Claude Opus 4.6 on top reasoning benchmarks. For now, Super is the production choice for teams building agents at scale.

Bottom Line

Nemotron 3 Super fills a gap in the 2026 AI model landscape: a commercially-free, high-throughput model optimized for agents rather than benchmark competition. If you're building pipelines where you call an AI model thousands of times per day, the 7x efficiency advantage compounds quickly.

It's not the right model for every task. For deep reasoning or complex code generation, Claude Opus 4.6 still leads. But for high-volume agent orchestration, cybersecurity workflows, or any use case where you're watching your inference bill — Nemotron 3 Super is worth testing immediately.

Access 150+ AI Models in One Platform

Happycapy Pro includes access to Claude, GPT-5.4, Gemini, and open-source models via the AI Gateway — switch models per task, no API keys to manage.

Get Happycapy Pro for $17/month →

Sources & Further Reading

← Back to all articles