Tsinghua Yao class alumni release first multimodal embedding model, 'multimodal retrieval' SOTA

Voyage-multimodal-3 performs well in multimodal retrieval tasks, improving the retrieval accuracy by 19.63% over existing models, and supporting vectorized processing of PDFs, screenshots, etc. with complex layouts; the model is based on the Unified Transformer encoder to process interleaved text and images, overcoming the modal gap problem and realizing higher accuracy in mixed-modal retrieval The model surpasses OpenAI CLIP and other mainstream models 20%-45% in form, document screenshot and text retrieval tasks respectively, simplifying the unstructured data processing process.

Search