Sarvam AI Puts India on the Global Map with Vision OCR and Bulbul V3 Voice Model

Sarvam AI Puts India on the Global Map with Vision OCR and Bulbul V3 Voice Model
X

Sarvam AI’s Vision OCR and Bulbul V3 showcase India’s rising strength in language-focused AI, rivaling global tech giants effectively.

For years, the global artificial intelligence race has largely been dominated by companies in the United States and China. Despite its vast tech talent and growing digital ecosystem, India has rarely been recognised for building foundational AI technologies. That narrative may now be shifting, thanks to Bengaluru-based startup Sarvam AI, which is carving out a niche with tools designed specifically for Indian languages and use cases.

Sarvam AI describes its approach as building “sovereign AI” — foundational models developed and trained entirely within India. This week, two of its products have generated significant attention: Sarvam Vision, an optical character recognition (OCR) system, and Bulbul V3, a text-to-speech model tailored for Indian voices.

Sarvam Vision has impressed the AI community by outperforming larger, well-known models such as Google Gemini, ChatGPT, and Anthropic Claude on certain OCR benchmarks. According to the company, the model achieved an accuracy score of 84.3 percent on the olmOCR-Bench, beating Gemini 3 Pro and other recent tools like DeepSeek OCR v2, while ChatGPT scored noticeably lower.



The system also performed strongly on OmniDocBench v1.5, a test that evaluates how well AI can process complex, real-world documents. Sarvam Vision recorded an overall score of 93.28 percent, demonstrating particular strength in interpreting complicated layouts, technical tables, and mathematical formulas — areas where traditional OCR tools often struggle.

The results have drawn international attention, especially given that Sarvam had previously faced doubts about its focus on Indic-language AI. Critics who once questioned the strategy are now revising their views.

Tech commentator Deedy Das acknowledged this shift, writing, “I was wrong about Sarvam. When I wrote about them a year ago, I felt like the direction to train small Indic language models was wrong. But boy, have they turned it around. They have the best text-to-speech, speech-to text, and OCR models for Indic languages, and that's actually really valuable. The pricing is very reasonable.”

Users have also shared positive feedback. One early adopter wrote, “I used this a couple of days ago! Oh man wow.”

Alongside its OCR breakthrough, Sarvam introduced Bulbul V3, a text-to-speech system designed to produce natural and expressive voices across Indian languages. The tool competes with global leaders such as ElevenLabs but focuses on local linguistic nuances.

“Today we're releasing Bulbul V3, our most capable text-to-speech model designed to deliver natural, expressive and production-ready voices for Indian languages,” Sarvam noted in a blog post. “Bulbul V3 minimizes failure modes, delivering content-accurate, stable speech across the inputs that matter for India-specific use cases.”

Currently, Bulbul V3 supports more than 35 voices across 11 Indian languages, with plans to expand to 22.

Pratik Desai, founder of KissanAI, shared his endorsement: “We use Bulbul as our go-to tts model for our Indic use cases, and they have just gotten better with each release. Meanwhile, ElevenLabs cost never made sense for Indic or any other languages.”


Next Story
Share it