HappycapyGuide

By Connie · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.

Model Release

Cohere Transcribe: Open-Source ASR Beats Whisper with 5.4% Word Error Rate

March 26, 2026  ·  7 min read  ·  HappyCapy Guide
TL;DR

Cohere released Cohere Transcribe on March 26, 2026 — a 2B-parameter open-source ASR model that ranks #1 on the Hugging Face Open ASR Leaderboard with a 5.42% word error rate, beating OpenAI Whisper Large v3 (7.44%) by 27%. It supports 14 languages, runs on a consumer RTX 3060, and is free under Apache 2.0.

Cohere, best known as the enterprise large language model company, just made a surprising entry into audio AI. On March 26, 2026, it released Cohere Transcribe — a dedicated speech recognition model that immediately claimed the top spot on the Hugging Face Open ASR Leaderboard, surpassing OpenAI Whisper, ElevenLabs Scribe, Zoom Scribe, and IBM Granite Speech.

Unlike most speech models that fine-tune existing text LLMs, Cohere built Transcribe from scratch on 500,000 hours of audio-transcript pairs using a Fast-Conformer encoder-decoder architecture — prioritizing accuracy and throughput over model generality.

Benchmark Results: #1 on Open ASR Leaderboard

Cohere Transcribe's average word error rate of 5.42% puts it ahead of every model on the Hugging Face Open ASR Leaderboard as of its launch date.

ModelAvg WER ↓LicenseParams
Cohere Transcribe5.42%Apache 2.02B
Zoom Scribe v15.47%Proprietary
IBM Granite 4.0 Speech 1B5.52%Apache 2.01B
ElevenLabs Scribe v25.83%Proprietary
OpenAI Whisper Large v37.44%MIT1.5B

Human evaluation results reinforce the leaderboard rankings: in English pairwise comparisons, humans preferred Cohere Transcribe over Whisper Large v3 in 64% of test cases and over ElevenLabs Scribe v2 in 51% of cases. Japanese showed an even stronger preference, with Cohere winning 66–70% of comparisons.

Architecture: Built From Scratch for Speed

Cohere Transcribe uses a Conformer-based encoder-decoder architecture trained on 500,000 hours of audio. Over 90% of its 2 billion parameters are in the Fast-Conformer encoder, which handles acoustic representation. A lightweight Transformer decoder converts the encoded audio to text.

This design choice explains why Transcribe achieves 3x higher offline throughput than similarly-sized models. It is purpose-built for transcription, not general audio understanding — which makes it both faster and more accurate for the specific task of converting speech to text.

Hardware requirements are remarkably modest. Cohere confirmed that Transcribe runs on consumer-grade GPUs (RTX 3060 and above) rather than requiring enterprise A100 or H100 clusters. This opens the door for self-hosted deployment in organizations with data privacy requirements.

Dataset Benchmarks

DatasetWERNotes
LibriSpeech (clean)1.25%Read speech, studio conditions
LibriSpeech (other)2.37%Harder read speech
TED-LIUM2.49%Conference talks
SPGISpeech3.08%Financial audio
VoxPopuli5.87%Diverse accents, EU Parliament
GigaSpeech9.34%Web audio, podcasts
Earnings2210.86%Financial earnings calls
AMI (multi-speaker)8.13–8.15%Meeting transcription
Try Happycapy — Use Claude, GPT-4.1, Gemini 3 & More in One App →

14 Languages Supported

Cohere Transcribe supports the following languages: English, Chinese (Mandarin), Japanese, Arabic, French, German, Spanish, Portuguese, Dutch, Italian, Polish, Russian, Korean, and Hindi. Unlike Whisper's 100+ language support, Transcribe is optimized for quality over breadth — each supported language is thoroughly represented in training data.

A notable limitation: language must be specified at inference time. Transcribe does not auto-detect the spoken language. For multilingual pipelines, you will need a language identification step upstream.

Enterprise Limitations to Know

Cohere Transcribe is production-ready for the right use cases, but three limitations affect enterprise deployments:

How to Access Cohere Transcribe

Three deployment options are available:

"We built Transcribe from the ground up, dedicating the vast majority of model capacity to the acoustic encoder — this is why we achieve state-of-the-art accuracy at a fraction of the compute cost." — Cohere research team

Why This Matters for AI Developers

Until now, Whisper Large v3 was the default open-source choice for speech-to-text — it was free, accurate enough, and widely supported. Cohere Transcribe changes that calculus for any team where transcription accuracy directly affects business outcomes (legal, medical, financial, customer support).

The 27% WER improvement over Whisper is not a marginal gain. At the sentence level, it means fewer corrections, fewer missed words, and lower post-processing costs. For a team transcribing 1,000 hours of earnings calls per quarter, the difference is measurable in analyst time saved.

The Apache 2.0 license also removes the friction of using proprietary APIs for sensitive data. Legal, healthcare, and financial organizations can self-host without routing audio through third-party servers.

For developers building transcription workflows, Happycapy provides access to state-of-the-art language models for downstream processing — summarization, extraction, analysis — once your audio is converted to text.

Start Free with Happycapy — Claude, GPT-4.1, Gemini 3 in One Platform →

Frequently Asked Questions

What is Cohere Transcribe?

Cohere Transcribe is a 2-billion-parameter open-source ASR model released on March 26, 2026. It achieves #1 on the Hugging Face Open ASR Leaderboard with a 5.42% average word error rate. It is available under Apache 2.0 for local deployment or via Cohere's managed API.

How does Cohere Transcribe compare to OpenAI Whisper?

Cohere Transcribe (5.42% WER) outperforms Whisper Large v3 (7.44% WER) by 27% in accuracy. Transcribe also delivers 3x higher throughput and runs on consumer GPUs. Whisper supports 100+ languages versus Transcribe's 14, so Whisper remains the better option for multilingual use cases outside the 14 supported languages.

What languages does Cohere Transcribe support?

Cohere Transcribe supports 14 languages: English, Chinese, Japanese, Arabic, French, German, Spanish, Portuguese, Dutch, Italian, Polish, Russian, Korean, and Hindi. Language detection is not automatic — the language code must be specified at inference time.

Is Cohere Transcribe free to use?

Yes. Cohere Transcribe weights are available for free on Hugging Face under the Apache 2.0 license for local deployment. Cohere also offers a free rate-limited API for testing. A paid managed tier ("Model Vault") removes rate limits for production use.

Sources:
Cohere Blog — "Introducing Cohere Transcribe" (March 26, 2026)
Hugging Face — CohereLabs/cohere-transcribe-03-2026 model card
TechCrunch — "Cohere launches an open source voice model specifically for transcription" (March 26, 2026)
VentureBeat — "Cohere's open-weight ASR model hits 5.4% word error rate" (March 31, 2026)
Ars Technica — Hugging Face Open ASR Leaderboard data
SharePost on XLinkedIn
Was this helpful?

Get the best AI tools tips — weekly

Honest reviews, tutorials, and Happycapy tips. No spam.

Comments