August 29th.OpenAI Has formalized its "Realtime API" into the production environment, moving it out of Beta.

According to 1AI, this API is aimed at enterprises and developers to help them develop voice assistants for real-world scenarios, covering areas such as customer support, education, and personal efficiency improvement. Its core component, the "gpt-realtime" model, adopts an end-to-end Speech-to-Speech architecture, which can directly generate and process speech, eliminating the need for regular text conversion steps. According to OpenAI, the model is more responsive and natural sounding than its predecessor, and is more capable of handling complex commands.
OpenAI says the gpt-realtime model can now capture non-verbal signals such as laughter, switch languages mid-conversation, and adjust the tone of voice - for example, to achieve a "friendly tone with a French accent" or a "faster, more professional tone". "a friendly tone with a French accent" or "a professional tone at a faster pace", for example. In addition, the model adds the voices "Cedar" and "Marin" and optimizes eight existing voice effects.
In performance benchmarks, the gpt-realtime model shows significant improvements: accuracy from 65.6% to 82.8% in the Big Bench Audio benchmark, from 20.6% to 30.5% in the MultiChallenge benchmark, and from 49.7% to 66.5% in the ComplexFuncBench benchmark. ComplexFuncBench benchmark from 49.7% to 66.5%.
The API update optimizes the tool integration process, and OpenAI says the model improves the reliability of function calls by more accurately selecting the right tool, triggering the tool at the right time, and configuring the tool parameters correctly. Developers can connect to external tools and services through Session Initiation Protocol (SIP) and Remote Media Control Protocol (MCP) servers. Meanwhile, the reusable cue word function supports saving the configuration and tool settings under different usage scenarios, further enhancing development efficiency.
The API now supports image input. Users can send screenshots or photos during a conversation, and the model can interact with the content of the image -- for example, reading text from the image or answering questions related to the image content. Developers can control the range of images that the model can access.
In addition, the API adds two new useful features: developers can set a token usage limit and streamline the content of multi-round conversations. These two features help to better control costs in longer conversations. In terms of pricing, the cost of using the gpt-realtime model has been reduced by 20%, with current pricing at $32 per million for audio input tokens (IT Home note: current exchange rate is about 229 RMB), $64 per million for audio output tokens (current exchange rate is about 457.9 RMB), and $0.40 per million for cache input tokens (current exchange rate is about 2.5 RMB). USD 32 per million (IT Home note: current exchange rate is about RMB 229), audio output token USD 64 per million (current exchange rate is about RMB 457.9), cache input token USD 0.40 per million (current exchange rate is about RMB 2.9).
OpenAI says the API has the ability to detect problematic content and automatically terminate a conversation if it violates the platform's policies. However, judging from the evolution of security in language modeling, this should not be the only means of security, and developers still need to add their own proprietary security requirements.
For EU users, the API offers data localization storage options and special privacy rules for business users to comply with data protection regulations in the EU region.