{"id":29699,"date":"2025-02-27T09:44:58","date_gmt":"2025-02-27T01:44:58","guid":{"rendered":"https:\/\/www.1ai.net\/?p=29699"},"modified":"2025-02-27T09:44:58","modified_gmt":"2025-02-27T01:44:58","slug":"%e5%be%ae%e8%bd%af-phi-4-%e5%a4%9a%e6%a8%a1%e6%80%81%e5%8f%8a%e8%bf%b7%e4%bd%a0%e6%a8%a1%e5%9e%8b%e4%b8%8a%e7%ba%bf%ef%bc%8c%e8%af%ad%e9%9f%b3%e8%a7%86%e8%a7%89%e6%96%87%e6%9c%ac%e5%85%a8%e8%83%bd","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/29699.html","title":{"rendered":"Microsoft Phi-4 Multi-Modal and Mini-Models Online, Speech Vision Text All-in-One"},"content":{"rendered":"<p>February 27th.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%be%ae%e8%bd%af\" title=\"[View articles tagged with [Microsoft]]\" target=\"_blank\" >Microsoft<\/a>Released in December 2024 <a href=\"https:\/\/www.1ai.net\/en\/tag\/phi-4\" title=\"_Other Organiser\" target=\"_blank\" >Phi-4<\/a>Phi-4 is a small language model (SLM) that is a top performer in its class. Today, Microsoft is further expanding the Phi-4 family by<strong>Two new models were introduced: Phi-4 <a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81\" title=\"[View articles tagged with [multimodal]]\" target=\"_blank\" >Multimodality<\/a>(Phi-4-multimodal) and Phi-4-mini.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-29700\" title=\"6c776badj00ssbji1002bd000ku00brp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/02\/6c776badj00ssbji1002bd000ku00brp.jpg\" alt=\"6c776badj00ssbji1002bd000ku00brp\" width=\"750\" height=\"423\" \/><\/p>\n<p><strong>Phi-4 Multimodal Model is Microsoft's first unified architecture multimodal language model that integrates speech, vision and text processing<\/strong>The Phi-4 multimodal model has 5.6 billion parameters. In several benchmarks, Phi-4 multimodal outperforms other existing state-of-the-art all-modal models, such as Google's Gemini 2.0 Flash and Gemini 2.0 Flash Lite.<\/p>\n<p>In speech-related tasks, Phi-4 Multimodal outperformed specialized speech models such as WhisperV3 and SeamlessM4T-v2-Large in automatic speech recognition (ASR) and speech translation (ST). The model topped the Hugging Face OpenASR charts with a word error rate of 6.14%, Microsoft said.<\/p>\n<p>In vision-related tasks, Phi-4 multimodal excels in mathematical and scientific reasoning. The model matches or even surpasses popular models such as Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet in common multimodal capabilities such as document comprehension, diagram comprehension, Optical Character Recognition (OCR), and visual scientific reasoning.<\/p>\n<p>1AI notes that<strong>The Phi-4 mini-model, on the other hand, focuses on textual tasks.<\/strong>The number of references is 3.8 billion. It outperforms several popular large language models in tasks such as textual reasoning, mathematical computation, programming, instruction following, and function calling.<\/p>\n<p>To ensure the security and reliability of the new models, Microsoft invited internal and external security experts to conduct tests and adopted policies developed by the Microsoft Artificial Intelligence Red Team (AIRT). After further optimization, both Phi-4 Mini and Phi-4 Multi-Modal models can be deployed to the device side via ONNX Runtime, enabling cross-platform use for low-cost and low-latency scenarios.<\/p>\n<p>Phi-4 multimodal and Phi-4 mini-models are now live for developers in the Azure AI Foundry, Hugging Face and NVIDIA API catalogs.<\/p>\n<p>The introduction of the new Phi-4 series of models marks a significant advancement in efficient AI technology, bringing powerful multimodal and text processing capabilities to all types of AI applications.<\/p>","protected":false},"excerpt":{"rendered":"<p>February 27, 2011 - Microsoft released Phi-4 in December 2024, a small language model (SLM) that excels in its class. Today, Microsoft extends the Phi-4 family with two new models: Phi-4-multimodal and Phi-4-mini. Phi-4-multimodal is Microsoft's first unified architectural multimodal language model that integrates speech, vision, and text processing with 5.6 billion parameters. In multiple benchmarks, Phi-4 Multimodal outperforms other existing state-of-the-art fully modal models, such as Google's Gemini 2.0 Flash and Gemi<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[5823,592,280],"collection":[],"class_list":["post-29699","post","type-post","status-publish","format-standard","hentry","category-news","tag-phi-4","tag-592","tag-280"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/29699","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=29699"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/29699\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=29699"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=29699"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=29699"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=29699"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}