First Audio LLM to Preserve Emotions: Meta Open-Sources 7B-Spirit LM for "Audio + Text" Multimodal Tasks

Meta recently open-sourced a 7B-sized Spirit LM's multimodal language model that understands and generates both speech and text, and can very naturally shift between the two modes, not only handling basic speech-to-text and text-to-speech tasks, but also capturing and reproducing emotions and styles in speech.

Search