{"id":14058,"date":"2024-06-26T08:54:58","date_gmt":"2024-06-26T00:54:58","guid":{"rendered":"https:\/\/www.1ai.net\/?p=14058"},"modified":"2024-06-26T08:54:58","modified_gmt":"2024-06-26T00:54:58","slug":"%e8%8b%b9%e6%9e%9c%e6%8e%a8%e5%87%ba%e5%85%a8%e8%83%bd%e8%a7%86%e8%a7%89%e6%a8%a1%e5%9e%8b4m-21-%e5%8f%af%e5%a4%84%e7%90%8621%e7%a7%8d%e4%b8%8d%e5%90%8c%e6%a8%a1%e6%80%81","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/14058.html","title":{"rendered":"Apple launches all-around visual model 4M-21 that can handle 21 different modalities"},"content":{"rendered":"<p><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%8b%b9%e6%9e%9c\" title=\"[View articles tagged with [apple]]\" target=\"_blank\" >apple<\/a>and researchers at the \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL) in Switzerland have jointly developed a single model for any-to-any modality that can be trained on dozens of highly diverse modalities and co-trained on large-scale multimodal datasets and text corpora. The model, named 4M-21, is trained on 21 different modalities and accomplishes at least three times more than existing models without loss of performance.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-14059\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/6385493500772483727326353.png\" alt=\"\" width=\"686\" height=\"276\" \/><\/p>\n<p>The study uses a 4M pre-training programme that enhances the performance and adaptability of models by scaling up models and data sets, increasing the types and quantities of models involved in training models and conducting joint training on multiple data sets. Researchers use different tokenization methods to disperse modulates with different characteristics, such as global image embedding, human gestures and semantic examples. In terms of structure selection, the study uses a 4M encoder-decoder structure based on Transformer and adds an additional modulation embedded to adapt to the new modulation\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-14060\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/6385493492127369604662061.png\" alt=\"\" width=\"686\" height=\"359\" \/><\/p>\n<p>The model not only performs a range of common vision tasks out-of-the-box, such as DIODE surface normal and depth estimation, COCO semantic and instance segmentation, and 3DPW3D human pose estimation, but is also capable of generating arbitrary training modalities, supports several methods to perform fine-grained and multimodal generation, and can retrieve RGB images or other modalities by using other modalities as queries. In addition, the researchers have conducted multimodal transfer experiments on NYUv2, Hypersim semantic segmentation, and ARKitScenes.<\/p>\n<p>Important functional features include.<\/p>\n<p>Arbitrary to Arbitrary Modality: Increase the number of modalities from the existing best 7 modalities for arbitrary to arbitrary models to 21 different modalities for cross-modality retrieval, controlled generation and powerful out-of-the-box performance.<\/p>\n<p>Versatility support: Add support for more structured data such as human posture, SAM instances, metadata, etc.<\/p>\n<p>Tokenization:Investigates discrete tokenization for different modalities, such as global image embedding, human gestures, and semantic instances, using modality-specific approaches.<\/p>\n<p>Extension:Extend the model size to 3B parameters and the dataset to 0.5B samples.<\/p>\n<p>Co-training: Simultaneous visual and verbal co-training.<\/p>\n<ul>\n<li>Paper address:https:\/\/arxiv.org\/pdf\/2406.09406<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Researchers at Apple and the Swiss Federal Institute of Technology in Lausanne (EPFL) have jointly developed a single any-to-any modality model that can be trained on dozens of highly diverse modalities and co-trained on large multimodal datasets and text corpora. The model, named 4M-21, is trained on 21 different modalities and accomplishes at least three times more than existing models without loss of performance. The study used the 4M pre-training scheme, which can improve the model's performance and adaptability by scaling up the size of the model and dataset, increasing the type and number of modalities involved in training the model, and co-training on multiple datasets. The researchers used different tokenization methods to discretize modalities with different characteristics, the<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[345,1865],"collection":[],"class_list":["post-14058","post","type-post","status-publish","format-standard","hentry","category-news","tag-345","tag-1865"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/14058","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=14058"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/14058\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=14058"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=14058"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=14058"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=14058"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}