March 31, 2011 - In today'sBaidu AI DAY.Baidu Releases First Cross-Attention Basedend-to-endphonetic language macromodel, announced the realization of ultra-low latency and ultra-low cost, with call costs dropping by about 50%-90% compared to industry averages in voice Q&A scenarios on telephone voice channels.

On that day.Wen Xiaoyin Announces Brand Refresh, First to Access the ModelIt also brings upgraded functions such as multi-model fusion scheduling and picture Q&A. After accessing the model, Wen Xiaoyan can not only support more simulated language chat effect, but also support Chongqing, Guangxi, Henan, Guangdong, Shandong and other special dialects. According to reports, the voice model has very low training and use costs, very fast reasoning response speed, voice interaction, can reduce the user waiting time from the industry's common 3-5 seconds to about 1 second.
The updated Wen Xiaoyan also supports "multi-model fusion scheduling".It integrates Baidu's self-developed models such as Wenshin X1 and Wenshin 4.5, and accesses third-party quality models such as DeepSeek-R1, realizing intelligent collaboration between multiple models. Users can choose "automatic mode" to call the optimal model combination with one click, or select a single model to complete a specific task according to demand, improving response speed and task processing capability.
1AI learned from the event thatWen Xiaoyan has also enhanced the photo quiz featureThe user shoots or uploads a picture and asks a question in text or voice to get an in-depth analysis directly. For example, shooting a math problem can generate real-time solutions and video analysis; uploading multiple product images can compare parameters and prices to assist shopping decisions.
In addition, Wen Xiaoyan added "Try a cold one.With the function of "History Scholar", users can preset "history scholar", "science and technology expert" and other personalized perspectives to give a multi-dimensional interpretation of the same picture. For example, when the user asks "Cat Window Mystery, why do cats love the scientific truths around the window?" Wen Xiaoyan can give a unique interpretation from the perspectives of hunting instincts, energy acquisition, territorial awareness, etc.
Jia Lei, chief architect of Baidu Speech, reveals that the model is the first in the industry to introduce an end-to-end speech-language grand model based on the new Cross-Attention. "In voice scenarios meeting certain interaction metrics, the cost of large model calls is lower than the industry average 50%-90%In addition, the inference response speed is extremely fast, compressing the waiting time of voice interaction to about 1 second, greatly improving the interaction fluency. At the same time, with the support of the big model, streaming word-by-word LLM-driven multi-emotional speech synthesis is realized, with full, realistic and anthropomorphic emotions, and the interactive listening sense is greatly improved."