Parlona Blog
NVIDIA Nemotron 3.5 ASR: A 600M Streaming Speech Model
May 30, 2026
NVIDIA Nemotron 3.5 ASR: Why a 600M Streaming Speech Model Matters for the Next Wave of Voice Agents
Voice agents are only as good as their first second.
Before the LLM can reason, before a workflow can be triggered, before a calendar can be booked or a support ticket can be created, the system must solve a deceptively hard problem: turn messy, multilingual, real-time human speech into clean text fast enough that the conversation still feels natural.
That is why NVIDIA’s release of Nemotron 3.5 ASR is interesting.
On paper, it is “just” a 600M-parameter multilingual automatic speech recognition model. In practice, it is a sign of where voice AI infrastructure is moving: away from slow, buffered, single-language transcription pipelines and toward streaming, multilingual, low-latency ASR designed specifically for voice agents.
Nemotron 3.5 ASR supports 40 language-locales from a single checkpoint, includes punctuation and capitalization, supports automatic language detection, and is designed around a cache-aware FastConformer-RNNT architecture that avoids re-processing the same audio again and again.
For teams building AI receptionists, call-center copilots, multilingual support bots, meeting assistants, sales qualification agents, or embedded voice interfaces, this is the part that matters: Nemotron 3.5 ASR is not only about accuracy. It is about the full production equation:
latency + throughput + multilingual coverage + deployability + cost per stream.
The old ASR problem: voice agents hate waiting
A human conversation has an unforgiving latency budget.
A typical voice-agent loop looks like this:
User speaks ↓ ASR transcribes speech to text ↓ LLM understands intent and generates response ↓ Business logic / tools are called ↓ TTS turns response into speech ↓ User hears the answer
If ASR takes too long, the whole system feels broken. Even if the LLM is brilliant and the TTS voice is natural, the agent starts to feel like a walkie-talkie instead of a conversation.
For example, imagine a website voice agent:
User: Hi, I want to book a repair appointment for tomorrow morning.
A good voice agent should start acting almost immediately:
{ "intent": "book_appointment", "date": "tomorrow", "time_preference": "morning", "service": "repair" }
But in many real systems, ASR is still built around buffered chunks. The model receives a slice of audio, transcribes it, then receives an overlapping slice, re-processes much of the same audio, and repeats. This improves context, but it wastes compute and adds delay.
That is acceptable for offline transcription.
It is painful for real-time agents.
What NVIDIA released
Nemotron 3.5 ASR is a 600M-parameter multilingual streaming ASR model released by NVIDIA for real-time speech-to-text workloads.
The headline features are:
| Feature | Why it matters |
|---|---|
| 600M parameters | Small enough to be practical for production streaming workloads, large enough to handle multilingual ASR with strong quality. |
| 40 language-locales | One model can serve global voice traffic instead of maintaining separate ASR models per language. |
| Streaming inference | The model is designed for real-time transcription, not only offline batch processing. |
| Cache-aware architecture | It reuses previous internal states instead of recomputing overlapping audio windows. |
| Configurable chunk sizes | Developers can choose latency/accuracy trade-offs at runtime. |
| Punctuation and capitalization | Output is closer to what an LLM or workflow engine actually wants. |
| Automatic language detection | Useful for multilingual calls and code-switching scenarios. |
| Open weights / commercial use | Teams can inspect, fine-tune, and deploy without being locked into a single hosted API. |
The supported language-locales are split into three practical tiers.
Transcription-ready languages
These are the highest-quality out-of-the-box languages:
English US/GB Spanish US/ES French FR/CA Italian Portuguese BR/PT Dutch German Turkish Russian Arabic Hindi Japanese Korean Vietnamese Ukrainian
Broad-coverage languages
These are also usable out of the box, but generally with higher error rates:
Polish Swedish Czech Norwegian Bokmål Danish Bulgarian Finnish Croatian Slovak Mandarin Hungarian Romanian Estonian
Adaptation-ready languages
These are recognized by the tokenizer, but NVIDIA recommends fine-tuning for production-quality transcription:
Greek Lithuanian Latvian Maltese Slovenian Hebrew Thai Norwegian Nynorsk
This tiering is important. “Supports 40 languages” does not mean “all 40 are equally accurate out of the box.” For production, the right question is not only “is my language supported?” but also:
Is it transcription-ready, broad-coverage, or adaptation-ready?
For a global voice-agent platform, that difference matters.
The architecture idea: stop transcribing the same audio twice
The most interesting technical part is not only the model size or language list. It is the streaming architecture.
Nemotron 3.5 ASR uses a Cache-Aware FastConformer-RNNT architecture.
In simple terms:
Traditional buffered streaming: Audio chunk 1: process frames 0–10 Audio chunk 2: process frames 5–15 Audio chunk 3: process frames 10–20 Some frames are processed again and again.
Nemotron’s cache-aware streaming is closer to:
Cache-aware streaming: Audio chunk 1: process frames 0–10 and cache internal state Audio chunk 2: process only new frames 11–15, reuse cache Audio chunk 3: process only new frames 16–20, reuse cache
That changes the production economics.
If every new chunk forces the model to recompute overlapping context, then lower latency means more compute. If the model can reuse cached encoder states, it can keep latency low without paying the same compute penalty.
For a single demo, that might not matter.
For 1,000 simultaneous voice streams, it matters a lot.
Runtime latency knob: 80ms to 1.12s chunks
Nemotron 3.5 ASR exposes a useful runtime control: configurable chunk size.
The model supports chunk sizes such as:
| Chunk size | Meaning | Best suited for |
|---|---|---|
| 80 ms | Very low-latency streaming | Voice agents, live interruption handling, real-time command detection |
| 160 ms | Low latency, slightly more context | Conversational assistants |
| 320 ms | Balanced latency/accuracy | Support bots, meeting captions |
| 560 ms | More stable transcription | Call analytics, dictation-like flows |
| 1.12 s | Higher accuracy, more delay | Batch-ish streaming, captions where delay is acceptable |
This is useful because not all voice applications have the same latency target.
A voice receptionist should probably prioritize responsiveness:
User: Can I speak to sales? Agent: Sure — may I have your name?
A meeting transcription system can tolerate more delay:
Speaker: We agreed to move the deployment to Friday. Caption appears 500–1000ms later.
A call-center analytics system may care more about throughput than instant final text:
1000 calls are transcribed, summarized, scored, and indexed.
The key point: Nemotron 3.5 ASR lets developers choose the operating point instead of forcing a single latency profile.
The throughput story: why this model is production-oriented
One of the strongest claims around Nemotron 3.5 ASR is throughput.
NVIDIA compares it against a larger 1.1B Parakeet RNNT multilingual model using buffered streaming. On a single NVIDIA H100, NVIDIA reports that Nemotron 3.5 ASR sustains:
| Setting | Nemotron 3.5 ASR | Buffered Parakeet RNNT 1.1B | Difference |
|---|---|---|---|
| 80 ms chunk | ~240 concurrent streams | ~14 concurrent streams | ~17× more |
| 1.12 s chunk | ~2,400 concurrent streams | ~400 concurrent streams | ~6× more |
This is the kind of number that changes architecture discussions.
For a voice-agent product, the question is not:
Can we transcribe one user in a demo?
The real question is:
Can we transcribe many users at once, with predictable latency, without the ASR bill becoming the product’s biggest cost?
Throughput matters because voice agents are concurrency-heavy. A website chatbot can process text messages asynchronously. A voice agent must keep a live stream open. Every active caller consumes real-time resources.
That makes “streams per GPU” a business metric, not just a benchmark.
Accuracy: good, but with realistic language tiers
Nemotron 3.5 ASR is evaluated using WER — Word Error Rate — for most languages, and CER — Character Error Rate — for languages such as Japanese, Korean, and Mandarin.
At the 1.12s chunk setting with language ID provided, NVIDIA reports the following example results on FLEURS:
| Language | Error rate |
|---|---|
| Spanish | 4.11% WER |
| Italian | 4.25% WER |
| Portuguese | 5.48% WER |
| Hindi | 6.81% WER |
| English | 7.91% WER |
| German | 8.31% WER |
| French | 9.03% WER |
| Russian | 9.17% WER |
| Turkish | 11.17% WER |
| Vietnamese | 11.18% WER |
| Arabic | 12.03% WER |
| Ukrainian | 13.07% WER |
The average for the transcription-ready group is reported as:
80 ms chunk: 10.38% average WER 1.12 s chunk: 8.84% average WER
That illustrates the expected latency/accuracy trade-off: larger chunks give the model more right context and reduce error rate, but increase delay.
For broad-coverage languages, error rates are higher. NVIDIA reports an average of:
80 ms chunk: 25.86% average WER 1.12 s chunk: 22.13% average WER
That is still useful for some applications, but it suggests a different deployment strategy:
Transcription-ready languages: Deploy directly, then test on your domain. Broad-coverage languages: Test carefully with your own call/audio data. Adaptation-ready languages: Plan fine-tuning before production.
This distinction is especially important for enterprise voice agents, where “good enough for a benchmark” is not the same as “good enough for your customers, accents, product names, noisy microphones, and domain vocabulary.”
Why punctuation and capitalization matter more than they seem
Many ASR systems output text like this:
hello i want to change my delivery address can you send it to berlin instead of munich
A human can read it. An LLM can probably understand it. But downstream systems work better with cleaner text:
Hello, I want to change my delivery address. Can you send it to Berlin instead of Munich?
Punctuation and capitalization are not cosmetic in voice-agent systems.
They help with:
Intent detection Entity extraction Sentence boundary detection Tool-call timing Conversation summarization CRM logging Compliance review Human handoff
For example, compare these two transcripts:
cancel my order no wait change the address
vs.
Cancel my order. No, wait — change the address.
The difference can change the action the agent takes.
If punctuation is built into the ASR model, the voice pipeline can avoid adding a separate punctuation-restoration model. That reduces latency, operational complexity, and failure points.
Example: multilingual voice receptionist
Consider a multilingual AI receptionist on a company website.
A visitor opens the widget and starts speaking:
Hola, quiero saber si pueden entregar componentes SMD a Alemania.
The ASR output might become:
Hola, quiero saber si pueden entregar componentes SMD a Alemania. <es-ES>
The voice agent can then route the conversation:
{ "detected_language": "es-ES", "intent": "shipping_question", "topic": "SMD components delivery to Germany", "next_action": "answer_faq_or_collect_contact" }
The assistant replies in Spanish, but if the user switches language:
Actually, can we continue in English?
The same ASR deployment can detect the language and continue without swapping models.
For a business website, this is powerful. Instead of separate English, German, Spanish, French, and Polish voice stacks, you can build one multilingual ASR backend and let the conversation layer decide how to respond.
Example: call-center analytics
Now consider a call center with 500 concurrent calls.
The system needs to:
1. Transcribe audio in real time 2. Detect customer intent 3. Flag escalation risk 4. Extract order IDs, dates, names, and product names 5. Summarize the call 6. Push structured data into CRM
A slow ASR system creates a queue. A high-throughput streaming model enables real-time processing.
A practical architecture might look like this:
SIP / WebRTC audio stream ↓ Audio chunking ↓ Nemotron 3.5 ASR streaming ↓ Partial transcript events ↓ LLM / intent engine ↓ CRM, ticketing, analytics, supervisor dashboard
The useful output is not just a transcript. It is structured operational intelligence:
{ "call_language": "de-DE", "customer_intent": "return_request", "sentiment": "frustrated", "entities": { "order_id": "A-49281", "product": "IP camera", "requested_action": "refund" }, "recommended_agent_action": "escalate_to_retention_team" }
In this type of system, ASR latency affects the supervisor dashboard, agent assist, compliance monitoring, and customer experience at the same time.
Example: real-time meeting captions with language tags
For meetings, the problem is often not only transcription but multilingual transcription.
Imagine a European engineering team:
Speaker 1: We need to finish the API integration by Friday. Speaker 2: Ja, aber das Problem ist noch im Scanner-Modul. Speaker 3: Możemy to przetestować jutro rano.
A multilingual ASR model with automatic language tags can produce a transcript stream like:
We need to finish the API integration by Friday. <en-US> Ja, aber das Problem ist noch im Scanner-Modul. <de-DE> Możemy to przetestować jutro rano. <pl-PL>
From there, the application can:
Show captions in original language Translate into a common language Create multilingual summaries Extract action items Index the meeting by speaker, language, and topic
This is where ASR becomes part of a larger agentic workflow.
Where Nemotron 3.5 ASR fits in a voice-agent stack
A practical production voice-agent architecture could look like this:
Frontend - WebRTC voice widget - Microphone capture - Optional text input - File upload / form fields Realtime backend - Audio stream gateway - Voice activity detection - Nemotron 3.5 ASR - Partial transcript handler - Turn-taking / interruption logic Agent layer - LLM - Prompt / memory - Tools and business APIs - CRM / calendar / ticketing integration Voice output - TTS - Audio streaming back to user Storage and analytics - Transcript - Summary - Intent - Lead/contact details - Call quality metrics
The ASR model affects almost every layer.
If ASR is slow, the agent feels slow.
If ASR lacks punctuation, the LLM receives worse input.
If ASR does not support multilingual traffic, the product needs routing logic or multiple models.
If ASR throughput is weak, GPU costs rise quickly.
Nemotron 3.5 ASR is interesting because it addresses these issues together rather than treating ASR as a simple speech-to-text utility.
What developers should test before production
Nemotron 3.5 ASR looks strong, but no ASR model should be deployed blindly.
For real products, teams should test at least six things.
1. Domain vocabulary
General ASR benchmarks rarely include your product names, part numbers, medical terms, legal phrases, or internal acronyms.
Example problem:
User says: "We use ASM and Mycronic lines." Bad transcript: "We use awesome and my chronic lines."
For industrial, medical, legal, and support use cases, domain vocabulary testing is mandatory.
2. Accents and microphone quality
A model can perform well on clean benchmark speech and still struggle with:
Laptop microphones Bluetooth headsets Car audio Factory noise Call-center compression Non-native speakers Regional accents
3. Code-switching
Multilingual support is not only about language selection. Real users switch languages mid-call:
Can you send the invoice to Buchhaltung, bitte?
A good test set should include mixed-language utterances.
4. Streaming partials vs final transcript
Voice agents often act on partial transcripts before the final result is available.
The system needs to decide:
When do we wait? When do we interrupt? When do we call a tool? When do we ask for confirmation?
ASR quality should be measured not only on final transcript accuracy, but also on partial transcript stability.
5. Latency under concurrency
Single-stream latency is not enough.
You need to test:
10 concurrent streams 100 concurrent streams 500 concurrent streams 1000 concurrent streams
The key metric is not only average latency. It is tail latency:
p50 latency p95 latency p99 latency dropped chunks GPU utilization cost per hour cost per 1000 call-minutes
6. Fine-tuning ROI
For broad-coverage or adaptation-ready languages, fine-tuning may be the difference between a demo and a production system.
A realistic evaluation plan:
Collect 5–20 hours of domain-specific audio Create clean transcripts Evaluate baseline WER Fine-tune Nemotron 3.5 ASR Re-evaluate WER and entity accuracy Measure latency and throughput again
For voice agents, entity accuracy may matter more than raw WER.
If the ASR misses filler words, that may be fine.
If it misses the order number, date, medicine name, address, or product SKU, that is a business problem.
Why this release matters
The ASR layer is becoming part of the agent infrastructure stack.
In older voice systems, speech recognition was often treated as a black-box API:
audio in → text out
But modern voice agents need more:
audio in → streaming transcript → language tag → clean punctuation → low-latency partials → tool-ready text → scalable concurrency
Nemotron 3.5 ASR is important because it points toward this new baseline.
For AI product teams, the lesson is clear:
The best voice agent is not just the best LLM with a microphone. It is a carefully optimized real-time system where ASR, LLM, tools, and TTS are designed around latency, concurrency, and human conversation patterns.
Nemotron 3.5 ASR does not solve the entire voice-agent problem. You still need turn-taking, interruption handling, memory, tool orchestration, safety, evaluation, and domain adaptation.
But it makes the ASR part of the stack much more attractive:
One multilingual model Streaming-first design Configurable latency High concurrency Built-in punctuation Automatic language detection Open deployment path
For companies building AI receptionists, support agents, call-center copilots, multilingual meeting assistants, and embedded voice interfaces, this is exactly the direction the market needs.
Voice agents are no longer limited by whether speech-to-text works.
The new question is:
Can speech-to-text work fast enough, cheaply enough, and globally enough to make voice feel native?
Nemotron 3.5 ASR is NVIDIA’s answer.