ServiceNow AI benchmarked leading speech recognition models on code-switched bilingual speech. Here's what broke, what held up, and what it means for your voice agent.
ServiceNow AI published a benchmark on the Hugging Face Blog testing how well leading automatic speech recognition (ASR) models handle code-switching, the real-world habit of bilingual speakers mixing two languages within a single sentence. The results show that every frontier model tested performs noticeably worse on code-switched speech than on clean, single-language audio. For businesses running voice agents that serve bilingual customers, that gap translates directly into failed transactions and misrouted calls.
ServiceNow AI released a benchmarking study on the Hugging Face Blog examining how frontier ASR models cope with code-switched speech. Code-switching is what happens when a bilingual speaker flips between two languages mid-sentence, something that is completely normal in communities that speak, for example, both Spanish and English, or Tagalog and English.
The study tested several leading speech recognition systems against code-switched audio and compared their word error rates (WER) to performance on standard monolingual input. According to the research, every model showed a significant increase in errors when code-switching appeared in the audio.
Most ASR models are trained predominantly on single-language datasets. When a speaker switches languages mid-phrase, the model has to handle a vocabulary, phoneme set, and grammar structure it was not optimized for in that context. The transition points between languages are where errors cluster most heavily.
This is not a minor edge case. Bilingual speakers often code-switch by default in casual or service conversation contexts, meaning the problem surfaces constantly in real customer interactions, not just in lab conditions.
Voice agents are increasingly being deployed for customer service, appointment booking, order tracking, and support triage. If the underlying ASR layer misrecognizes a product name, an account number, or a yes/no answer because the speaker switched languages, the whole interaction breaks down.
The consequences are practical and measurable:
Businesses serving markets with large bilingual populations, including much of the US Southwest, South Florida, the Philippines, Singapore, and many parts of Canada, are directly exposed to this problem right now.
The ServiceNow AI study is notable because it puts numbers on a problem that many teams have observed informally but never systematically measured. According to the Hugging Face Blog post, the benchmark is designed to help teams evaluate which ASR system is best suited for a bilingual deployment before they commit to a production stack.
The results indicate that model choice matters more for code-switched use cases than for standard transcription tasks, where most frontier models perform similarly. Picking the wrong model for a bilingual voice agent is not just a minor quality issue. It can make the product functionally unreliable for a large portion of your users.
This research confirms something we see regularly when auditing client voice and chatbot deployments. Teams choose an ASR vendor based on a benchmark score that was measured on clean, monolingual test sets, then go live and wonder why satisfaction scores tank among Spanish-English or French-English speaking users.
The honest truth is that most off-the-shelf voice agent stacks are not tested against the actual speech patterns of the customer base they will serve. Code-switching is not exotic. It is everyday language for tens of millions of people. A benchmark like this one is useful precisely because it creates a shared, reproducible way to expose that gap before it becomes a production incident.
We would add one caution: a benchmark is only as good as its test data. Before trusting any WER number for your specific use case, you should record or collect a sample of your actual users speaking in context and run your own evaluation. A model that ranks well on a published dataset may still fail on your particular language pair or domain vocabulary.
If your voice agent serves any bilingual market, run this test before your next deployment, not after your first wave of complaints.