July 5, 2012 - AI chatbot platform CharacterAI released a research paper and video demonstration of an autoregressive diffusion model called TalkingMachines.Make AI character interactions more realistic.

The model, which has not yet been deployed on the Character.AI platform, enables FaceTime-like visual interactions on calls, according to the research paper and a video demonstration in which the user simply inputs a picture and a sound signal.
The model is based on Diffusion Transformer (DiT) technology, which is essentially an "artist" that creates detailed images from random noise and optimizes them until they are perfect, and what Character.AI does is to make this process extremely fast, in real time. AI does is make this process extremely fast, in real time.
The TalkingMachines model employs several key techniques such as Flow-Matched Diffusion, Audio-Driven Cross Attention, Sparse Causal Attention, and Asymmetric Distillation. Asymmetric Distillation) and other key techniques.
One of the techniques, Stream Matching Diffusion, ensures that AI characters move more naturally by training a large number of actions, including subtle facial expressions and more exaggerated gestures. Audio-driven cross-attention technology, on the other hand, allows the AI to not only hear words, but also understand the rhythm, pauses, and intonation in the audio and translate them into precise lip-syncing, nodding, and winking.
Sparse Causal Attention technology allows Character.AI to process video frames in a more efficient manner, while Asymmetric Distillation allows video to be generated in real-time, creating a FaceTime-like call.
Character.AI emphasizes that this research breakthrough isn't just about facial animation, it's a step towards real-time interactive audio-visual AI characters. The model supports a wide range of styles including realistic human, anime and 3D avatars.
1AI Attach reference address
- GitHub project page
- TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models