Google’s Gemini 2.5 Adds Real-Time Audio And Custom Speech Tools

Image by Firmbee.com, from Unsplash

Google’s Gemini 2.5 Adds Real-Time Audio And Custom Speech Tools

Reading time: 2 min

Google introduced native audio capabilities in the Gemini 2.5 model, which expanded its native support for real-time dialogue and controllable text-to-speech (TTS) generation.

In a rush? Here are the quick facts:

  • Users can control tone, accent, and emotion using voice or prompts.
  • Text-to-speech features allow expressive, multilingual, multi-speaker audio generation.
  • Gemini can ignore background noise and respond only when relevant.

Google announced that users and developers can now use AI for spoken conversations, and produce audio content, through more than 24 language options.

Google states that Gemini 2.5 now generates and understands speech directly in audio, which enables users to interact more quickly and naturally. The model accepts natural language commands to modify its tone, accent, and style, while adding non-verbal features such as pauses and whispers.

The system maintains external tool connectivity through Google Search, and custom APIs, throughout conversations to retrieve relevant information.

One feature aims to improve context awareness. The Gemini 2.5 system detects background speech or noise to deliver responses only when appropriate. The system supports audio-video understanding, which enables it to analyze and provide comments about video feed, or shared screen content.

The text-to-speech component has also been updated. Users can now control audio generation with advanced features that include emotional tone adjustment, pacing control, pronunciation customization, and multi-speaker audio output. The features work with different content types, including storytelling, announcements, and podcasts.

Google provides Gemini 2.5 Pro and Flash previews for developers through Google AI Studio or Vertex AI. The Flash preview serves for quick and affordable use, but Pro offers enhanced functionality for complex prompts.

Google implemented watermarking through SynthID in all AI-generated audio during development to ensure transparency and conducted risk assessments for safety purposes. The company performed internal and external safety assessments before releasing the system to the public. Google implements these features as part of its initiative to develop multimodal AI systems, which operate between text, image, video, code, and advanced audio.

Did you like this article? Rate it!
I hated it I don't really like it It was ok Pretty good! Loved it!

We're thrilled you enjoyed our work!

As a valued reader, would you mind giving us a shoutout on Trustpilot? It's quick and means the world to us. Thank you for being amazing!

Rate us on Trustpilot
0 Voted by 0 users
Title
Comment
Thanks for your feedback