April 9 News.AmazonThe release was called Nova Sonic 's next-generation generative AI model, which is capable of processing speech natively and generating natural and smooth speech. According to Amazon's claims, Nova Sonic benchmarked on key metrics such as speed, speech recognition, and conversation quality, thePerformance is comparable to OpenAI and Google's cutting-edge speech models.

The introduction of Nova Sonic is Amazon's strong response to emerging AI speech models, such as those providing support for ChatGPT speech patterns, and these new models are more natural when it comes to voice interactions compared to more stereotypical models such as Amazon's earlier Alexa.
Nova Sonic is available through Amazon's Bedrock Developer Platform, a tool for building enterprise-grade AI applications, and Nova Sonic is accessed through a new two-way streaming API. In a press release, Amazon called Nova Sonic "the most cost-effective" AI voice model on the market.Its price is about 80% cheaper than OpenAI's GPT-4o.
According to Rohit Prasad, Senior Vice President and Chief Scientist of Amazon's Artificial General Intelligence (AGI) division.Some of the components of the Nova Sonic already power Amazon's upgraded digital voice assistant, Alexa+.
Prasad said Nova Sonic excels at routing user requests to different APIs compared to competitors' AI speech models. This capability allows Nova Sonic to know when it needs to get real-time information from the Internet, parse proprietary data sources, or take action in an external application, and use the right tools to accomplish the task.
In two-way conversations, Nova Sonic waits for the "right moment" to speak, taking into account the speaker's pauses and interruptions. In addition, Nova Sonic generates text recordings of the user's speech, which developers can use in a variety of application scenarios.
According to Prasad, Nova Sonic makes fewer speech recognition errors than other AI speech models, meaning the model is relatively good at understanding a user's intent even when the user grunts, misspeaks, or is in a noisy environment. In Multilingual LibriSpeech, a speech recognition benchmark test that measures speech recognition across languages and dialects, theAmazon says Nova Sonic has an average Word Error Rate (WER) of just 4.2% in English, French, Italian, German, and SpanishIn other words, in those languages, the model's results were different from those of the manual transcription. That is, in these languages, the model differs from the results of manual transcription in about 4 out of every 100 words.
1AI notes that in another benchmark test measuring high-volume interactions with multiple participants -- Enhanced Multi-Party Interaction -- Amazon says Nova Sonic outperforms OpenAI's GPT-4o-transcribe model by 46.71 TP3T in word error rate accuracy.Nova Sonic also boasts industry-leading speed, with an average perceived latency of 1.09 seconds, Amazon said. This is faster than the GPT-4o model powering OpenAI's real-time API, which has a response time of 1.18 seconds, based on benchmarking results from manual analysis.
Prasad said Nova Sonic is part of Amazon's broader strategy to build Artificial General Intelligence (AGI), which the company defines as "AI systems that can do everything a human can do on a computer." Looking ahead, Prasad said Amazon plans to introduce more AI models that can understand different modalities, including image, video, and speech, as well as "other sensory data that's relevant when bringing things into the physical world."
Amazon's AGI division, headed by Prasad, seems to be playing an increasingly important role in the company's product strategy these days. Just last week, Amazon launched a preview of Nova Act, an AI model that uses the browser and appears to support elements of Alexa+ and Amazon's "Buy for Me" feature. Prasad said that starting with Nova Sonic, the company wants to make more of its internal AI models available to developers to help them build apps.