NVIDIA’s Open Source Push: Transforming Multilingual AI with Granary

In a world where over 7,000 languages exist, AI developers face the formidable challenge of accommodating the vast linguistic diversity. Until now, only a fraction of these languages were supported by mainstream speech translation technologies. Enter NVIDIA, a leader in AI innovation, which has launched the Granary multilingual speech dataset alongside two new AI models, Canary-1b-v2 and Parakeet-tdt-0.6b-v3. Together, these tools promise to enhance speech recognition and translation, setting a new standard for both efficiency and inclusivity.

Expanding the Language Horizon

The inception of Granary marks a milestone in collaborative innovation, initiated by NVIDIA in partnership with Carnegie Mellon University and the Bruno Kessler Foundation. Addressing the barriers posed by rare languages in AI development, the team leveraged NVIDIA’s NeMo, a speech data processing toolkit, to transform vast unannotated public audio sources into structured, high-quality training samples, enhancing the AI models’ learning capabilities without heavy manual annotation.

Granary’s collection encompasses approximately 650,000 hours of speech recognition and over 350,000 hours of speech translation data, spanning 25 European languages. Notably, it includes languages with limited data, such as Estonian, Croatian, and Maltese, while also supporting Russian and Ukrainian. This initiative empowers developers to rapidly train ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) models for most EU official languages, reinforcing language diversity in speech AI.

  • Granary achieves comparable recognition and translation accuracy with half the training data needed by other popular datasets, making it ideal for underrepresented language development.
  • It has been open-sourced on GitHub and will be featured at the Interspeech conference in the Netherlands.

Introducing Canary-1b-v2 and Parakeet-tdt-0.6b-v3

Demonstrating Granary’s application potential, NVIDIA unveiled two cutting-edge speech models. Canary-1b-v2, powered by a billion parameters, is tailored for highly accurate transcription and translation tasks. Ranking high on Hugging Face’s multilingual speech recognition charts, it supports transcription and translation across 25 languages with English interchangeability, maintaining quality nearly on par with models three times its size but with ten times faster inference speed.

The Parakeet-tdt-0.6b-v3 model boasts robust high-speed, high-throughput processing capabilities. Handling audio up to 24 minutes in length per inference, this streamlined model, with 600 million parameters, detects input languages automatically for transcription without additional prompts, ideal for scenarios demanding low latency and real-time responses.

Evolving Speech Translation and Subtitles

Both models offer extensive features like automatic punctuation, marking, capitalization, and word-level timestamps, facilitating their use in generating subtitles, multilingual customer support, speech translation, and virtual assistant applications. Developers can also fine-tune or retrain the models to cater to different languages and applications as needed.

NVIDIA NeMo: The Backbone of Speech Translation Development

At the heart of this speech translation innovation is NVIDIA NeMo, a modular AI development platform designed for managing the AI model lifecycle. The NeMo Curator tool helps sift through source data for suitable samples, ensuring training data quality and consistency, while the NeMo speech data processor formats speech data for the models, including aligning and cleaning data.

Pushing the Frontiers of Speech AI

By opening up Granary and the associated speech models, along with the data processing and model building methodologies, NVIDIA is accelerating the global speech AI development pace. It’s particularly impactful in regions with scarce language translation resources, laying the groundwork for a more inclusive technological infrastructure. The simultaneous release of Granary, Canary, and Parakeet expands language boundaries for speech AI, providing a solid foundation for creating global, multilingual AI conversation and translation systems.

The datasets and models are available for download on GitHub and Hugging Face, inviting exploration into how these resources can shape the future of speech technology.

Scroll to Top