Bilingual Voice Agents Struggle: How Frontier ASR Models Handle Code-Switching

ServiceNow AI benchmarked leading speech recognition models on code-switched bilingual speech. Here's what broke, what held up, and what it means for your voice agent.

LUMIENJune 10, 20264 min read

Bilingual Voice Agents Struggle: How Frontier ASR Models Handle Code-Switching

ServiceNow AI published a benchmark on the Hugging Face Blog testing how well leading automatic speech recognition (ASR) models handle code-switching, the real-world habit of bilingual speakers mixing two languages within a single sentence. The results show that every frontier model tested performs noticeably worse on code-switched speech than on clean, single-language audio. For businesses running voice agents that serve bilingual customers, that gap translates directly into failed transactions and misrouted calls.

What happened

ServiceNow AI released a benchmarking study on the Hugging Face Blog examining how frontier ASR models cope with code-switched speech. Code-switching is what happens when a bilingual speaker flips between two languages mid-sentence, something that is completely normal in communities that speak, for example, both Spanish and English, or Tagalog and English.

The study tested several leading speech recognition systems against code-switched audio and compared their word error rates (WER) to performance on standard monolingual input. According to the research, every model showed a significant increase in errors when code-switching appeared in the audio.

Why ASR breaks on code-switched speech

Most ASR models are trained predominantly on single-language datasets. When a speaker switches languages mid-phrase, the model has to handle a vocabulary, phoneme set, and grammar structure it was not optimized for in that context. The transition points between languages are where errors cluster most heavily.

This is not a minor edge case. Bilingual speakers often code-switch by default in casual or service conversation contexts, meaning the problem surfaces constantly in real customer interactions, not just in lab conditions.

Why it matters

Voice agents are increasingly being deployed for customer service, appointment booking, order tracking, and support triage. If the underlying ASR layer misrecognizes a product name, an account number, or a yes/no answer because the speaker switched languages, the whole interaction breaks down.

The consequences are practical and measurable:

Calls get misrouted or dropped entirely.
Customers repeat themselves, driving up handle time.
Transcripts fed into downstream AI (intent detection, CRM logging) contain errors that compound.
Bilingual customers get a worse experience than monolingual ones, which is a real fairness and retention issue.

Businesses serving markets with large bilingual populations, including much of the US Southwest, South Florida, the Philippines, Singapore, and many parts of Canada, are directly exposed to this problem right now.

What the benchmark reveals about model selection

The ServiceNow AI study is notable because it puts numbers on a problem that many teams have observed informally but never systematically measured. According to the Hugging Face Blog post, the benchmark is designed to help teams evaluate which ASR system is best suited for a bilingual deployment before they commit to a production stack.

The results indicate that model choice matters more for code-switched use cases than for standard transcription tasks, where most frontier models perform similarly. Picking the wrong model for a bilingual voice agent is not just a minor quality issue. It can make the product functionally unreliable for a large portion of your users.

Our take

This research confirms something we see regularly when auditing client voice and chatbot deployments. Teams choose an ASR vendor based on a benchmark score that was measured on clean, monolingual test sets, then go live and wonder why satisfaction scores tank among Spanish-English or French-English speaking users.

The honest truth is that most off-the-shelf voice agent stacks are not tested against the actual speech patterns of the customer base they will serve. Code-switching is not exotic. It is everyday language for tens of millions of people. A benchmark like this one is useful precisely because it creates a shared, reproducible way to expose that gap before it becomes a production incident.

We would add one caution: a benchmark is only as good as its test data. Before trusting any WER number for your specific use case, you should record or collect a sample of your actual users speaking in context and run your own evaluation. A model that ranks well on a published dataset may still fail on your particular language pair or domain vocabulary.

What to do about it

Identify your bilingual exposure. Check your call recordings or chat logs for evidence of code-switching. If it is there, you have a problem worth measuring.
Use the ServiceNow AI benchmark as a shortlist tool. The results can help you narrow down which ASR models are worth testing for bilingual deployments.
Run your own evaluation on real user audio. Do not rely solely on published benchmarks. Collect 50 to 100 real code-switched samples from your user base and score candidate models against them.
Check your downstream pipeline. Even if you improve ASR accuracy, verify that your intent detection and CRM logging can handle mixed-language text correctly.
Monitor by language segment. After deployment, split your quality metrics by detected language so bilingual users do not get averaged out of your success numbers.

If your voice agent serves any bilingual market, run this test before your next deployment, not after your first wave of complaints.

Source: Hugging Face Blog

Bilingual Voice Agents Struggle: How Frontier ASR Models Handle Code-Switching

What happened

Why ASR breaks on code-switched speech

Why it matters

What the benchmark reveals about model selection

Our take

What to do about it

More from AI

Google Lost Its DMCA Case Against Web Scraper SerpApi, But Won’t Stop Fighting

Every Major AI Model Scores Libertarian-Left on Political Compass. Even Grok.

Cornell Tech’s Optical Chip Could Update Robot AI Models Using Light