By Connie · This article contains affiliate links. We may earn a commission at no extra cost to you if you sign up through our links.
Cohere Transcribe: Open-Source ASR Beats Whisper with 5.4% Word Error Rate
Cohere released Cohere Transcribe on March 26, 2026 — a 2B-parameter open-source ASR model that ranks #1 on the Hugging Face Open ASR Leaderboard with a 5.42% word error rate, beating OpenAI Whisper Large v3 (7.44%) by 27%. It supports 14 languages, runs on a consumer RTX 3060, and is free under Apache 2.0.
Cohere, best known as the enterprise large language model company, just made a surprising entry into audio AI. On March 26, 2026, it released Cohere Transcribe — a dedicated speech recognition model that immediately claimed the top spot on the Hugging Face Open ASR Leaderboard, surpassing OpenAI Whisper, ElevenLabs Scribe, Zoom Scribe, and IBM Granite Speech.
Unlike most speech models that fine-tune existing text LLMs, Cohere built Transcribe from scratch on 500,000 hours of audio-transcript pairs using a Fast-Conformer encoder-decoder architecture — prioritizing accuracy and throughput over model generality.
Benchmark Results: #1 on Open ASR Leaderboard
Cohere Transcribe's average word error rate of 5.42% puts it ahead of every model on the Hugging Face Open ASR Leaderboard as of its launch date.
| Model | Avg WER ↓ | License | Params |
|---|---|---|---|
| Cohere Transcribe | 5.42% | Apache 2.0 | 2B |
| Zoom Scribe v1 | 5.47% | Proprietary | — |
| IBM Granite 4.0 Speech 1B | 5.52% | Apache 2.0 | 1B |
| ElevenLabs Scribe v2 | 5.83% | Proprietary | — |
| OpenAI Whisper Large v3 | 7.44% | MIT | 1.5B |
Human evaluation results reinforce the leaderboard rankings: in English pairwise comparisons, humans preferred Cohere Transcribe over Whisper Large v3 in 64% of test cases and over ElevenLabs Scribe v2 in 51% of cases. Japanese showed an even stronger preference, with Cohere winning 66–70% of comparisons.
Architecture: Built From Scratch for Speed
Cohere Transcribe uses a Conformer-based encoder-decoder architecture trained on 500,000 hours of audio. Over 90% of its 2 billion parameters are in the Fast-Conformer encoder, which handles acoustic representation. A lightweight Transformer decoder converts the encoded audio to text.
This design choice explains why Transcribe achieves 3x higher offline throughput than similarly-sized models. It is purpose-built for transcription, not general audio understanding — which makes it both faster and more accurate for the specific task of converting speech to text.
Hardware requirements are remarkably modest. Cohere confirmed that Transcribe runs on consumer-grade GPUs (RTX 3060 and above) rather than requiring enterprise A100 or H100 clusters. This opens the door for self-hosted deployment in organizations with data privacy requirements.
Dataset Benchmarks
| Dataset | WER | Notes |
|---|---|---|
| LibriSpeech (clean) | 1.25% | Read speech, studio conditions |
| LibriSpeech (other) | 2.37% | Harder read speech |
| TED-LIUM | 2.49% | Conference talks |
| SPGISpeech | 3.08% | Financial audio |
| VoxPopuli | 5.87% | Diverse accents, EU Parliament |
| GigaSpeech | 9.34% | Web audio, podcasts |
| Earnings22 | 10.86% | Financial earnings calls |
| AMI (multi-speaker) | 8.13–8.15% | Meeting transcription |
14 Languages Supported
Cohere Transcribe supports the following languages: English, Chinese (Mandarin), Japanese, Arabic, French, German, Spanish, Portuguese, Dutch, Italian, Polish, Russian, Korean, and Hindi. Unlike Whisper's 100+ language support, Transcribe is optimized for quality over breadth — each supported language is thoroughly represented in training data.
A notable limitation: language must be specified at inference time. Transcribe does not auto-detect the spoken language. For multilingual pipelines, you will need a language identification step upstream.
Enterprise Limitations to Know
Cohere Transcribe is production-ready for the right use cases, but three limitations affect enterprise deployments:
- No speaker diarization. The model transcribes speech but does not separate or label individual speakers. Third-party diarization tools (e.g., pyannote.audio) are needed for meeting transcription with named participants.
- No automatic language detection. Language code must be passed at runtime. This adds a pipeline step for multi-language audio.
- Hallucination from noise. Like all ASR models, Transcribe may generate text from music, ambient sound, or silence. Voice Activity Detection (VAD) preprocessing is recommended for noisy environments such as call centers or video recordings.
How to Access Cohere Transcribe
Three deployment options are available:
- Local deployment (free): Download model weights from Hugging Face under Apache 2.0. Requires an RTX 3060 or equivalent GPU. No usage fees.
- Free API (rate-limited): Cohere's API provides a free tier with rate limits for testing and light production use.
- Model Vault (paid): Cohere's managed deployment removes rate limits and adds SLAs for enterprise production workloads.
"We built Transcribe from the ground up, dedicating the vast majority of model capacity to the acoustic encoder — this is why we achieve state-of-the-art accuracy at a fraction of the compute cost." — Cohere research team
Why This Matters for AI Developers
Until now, Whisper Large v3 was the default open-source choice for speech-to-text — it was free, accurate enough, and widely supported. Cohere Transcribe changes that calculus for any team where transcription accuracy directly affects business outcomes (legal, medical, financial, customer support).
The 27% WER improvement over Whisper is not a marginal gain. At the sentence level, it means fewer corrections, fewer missed words, and lower post-processing costs. For a team transcribing 1,000 hours of earnings calls per quarter, the difference is measurable in analyst time saved.
The Apache 2.0 license also removes the friction of using proprietary APIs for sensitive data. Legal, healthcare, and financial organizations can self-host without routing audio through third-party servers.
For developers building transcription workflows, Happycapy provides access to state-of-the-art language models for downstream processing — summarization, extraction, analysis — once your audio is converted to text.
Start Free with Happycapy — Claude, GPT-4.1, Gemini 3 in One Platform →Frequently Asked Questions
What is Cohere Transcribe?
Cohere Transcribe is a 2-billion-parameter open-source ASR model released on March 26, 2026. It achieves #1 on the Hugging Face Open ASR Leaderboard with a 5.42% average word error rate. It is available under Apache 2.0 for local deployment or via Cohere's managed API.
How does Cohere Transcribe compare to OpenAI Whisper?
Cohere Transcribe (5.42% WER) outperforms Whisper Large v3 (7.44% WER) by 27% in accuracy. Transcribe also delivers 3x higher throughput and runs on consumer GPUs. Whisper supports 100+ languages versus Transcribe's 14, so Whisper remains the better option for multilingual use cases outside the 14 supported languages.
What languages does Cohere Transcribe support?
Cohere Transcribe supports 14 languages: English, Chinese, Japanese, Arabic, French, German, Spanish, Portuguese, Dutch, Italian, Polish, Russian, Korean, and Hindi. Language detection is not automatic — the language code must be specified at inference time.
Is Cohere Transcribe free to use?
Yes. Cohere Transcribe weights are available for free on Hugging Face under the Apache 2.0 license for local deployment. Cohere also offers a free rate-limited API for testing. A paid managed tier ("Model Vault") removes rate limits for production use.
Cohere Blog — "Introducing Cohere Transcribe" (March 26, 2026)
Hugging Face — CohereLabs/cohere-transcribe-03-2026 model card
TechCrunch — "Cohere launches an open source voice model specifically for transcription" (March 26, 2026)
VentureBeat — "Cohere's open-weight ASR model hits 5.4% word error rate" (March 31, 2026)
Ars Technica — Hugging Face Open ASR Leaderboard data
Get the best AI tools tips — weekly
Honest reviews, tutorials, and Happycapy tips. No spam.