{"id":34044,"date":"2025-04-25T14:50:17","date_gmt":"2025-04-25T06:50:17","guid":{"rendered":"https:\/\/www.1ai.net\/?p=34044"},"modified":"2025-04-25T14:50:17","modified_gmt":"2025-04-25T06:50:17","slug":"meta-%e6%8e%a8-webssl-%e6%a8%a1%e5%9e%8b%ef%bc%9a%e6%8e%a2%e7%b4%a2-ai-%e6%97%a0%e8%af%ad%e8%a8%80%e8%a7%86%e8%a7%89%e5%ad%a6%e4%b9%a0%ef%bc%8c%e7%ba%af%e5%9b%be%e8%ae%ad%e7%bb%83%e5%aa%b2%e7%be%8e-op","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/34044.html","title":{"rendered":"Meta Launches WebSSL Models: Exploring AI Languageless Visual Learning, Pure Graph Training Comparable to OpenAI CLIP"},"content":{"rendered":"<p>April 25, 2011 - Technology media outlet marktechpost published a blog post yesterday (April 24) reporting that <a href=\"https:\/\/www.1ai.net\/en\/tag\/meta\" title=\"[View articles tagged with [Meta]]\" target=\"_blank\" >Meta<\/a> Company Releases <a href=\"https:\/\/www.1ai.net\/en\/tag\/webssl\" title=\"[See article with [WebSSL] label]\" target=\"_blank\" >WebSSL<\/a> series of models with parameter sizes ranging from 300 million to 7 billion, trained on pure image data.<strong>aims to explore the potential of visual self-supervised learning (SSL) without verbal supervision.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-34045\" title=\"6f736d4cj00sv9hmc0030d000sg00l8p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/04\/6f736d4cj00sv9hmc0030d000sg00l8p.jpg\" alt=\"6f736d4cj00sv9hmc0030d000sg00l8p\" width=\"1024\" height=\"764\" \/><\/p>\n<p>Open<a href=\"https:\/\/www.1ai.net\/en\/tag\/ai\" title=\"[View articles tagged with [AI]]\" target=\"_blank\" >AI<\/a> represented by CLIP, the contrastive language-image model has become the default choice for learning visual representations, with outstanding performance in multimodal tasks such as visual question and answer (VQA) and document understanding. However, limited by the complexity of dataset acquisition and data size, language dependency faces many challenges.<\/p>\n<p>Meta has addressed these pain points by releasing a family of WebSSL models on the Hugging Face platform, covering DINO and Vision Transformer (ViT) architectures, with parameter sizes ranging from 300 million to 7 billion.<\/p>\n<p>These models were trained using only a subset of 2 billion images from the MetaCLIP dataset (MC-2B), excluding the influence of linguistic supervision.Meta's goal is not to replace CLIP, but rather to provide an in-depth evaluation of the performance potential of purely visual Self-Supervised Learning (SSL) without the constraints of data and model sizes by controlling for variables.<\/p>\n<p>The WebSSL model uses two visual self-supervised learning paradigms: joint embedding learning (DINOv2) and mask modeling (MAE). Training is performed uniformly using 224\u00d7224 resolution images, and the visual coder is frozen to ensure that differences in results stem only from the pre-training strategy.<\/p>\n<p>The model is trained on five capacity tiers (ViT-1B to ViT-7B) and evaluated based on the Cambrian-1 benchmark, covering 16 VQA tasks such as general visual understanding, knowledge reasoning, OCR, and diagram interpretation. In addition, the model is seamlessly integrated into Hugging Face's transformers library for easy research and application.<\/p>\n<p>The experiments reveal several key findings: the WebSSL model's performance on the VQA task improves close to log-linearly as the parameter size increases, while the CLIP saturates in performance after exceeding 3 billion parameters.<\/p>\n<p>WebSSL outperforms CLIP in the OCR and Chart tasks, especially after data filtering, with only 1.3% of rich text image training, and boosts up to 13.6% in the OCRBench and ChartQA tasks.<\/p>\n<p>In addition, high-resolution (518px) fine-tuning further closes the gap with high-resolution models such as SigLIP, which performs particularly well in document tasks.<\/p>\n<p>WebSSL models show good alignment with pre-trained language models (e.g., LLaMA-3) even without linguistic supervision, indicating that large-scale visual models can implicitly learn features related to textual semantics.<\/p>\n<p>At the same time, WebSSL maintains strong performance on traditional benchmarks (e.g., ImageNet-1k classification, ADE20K segmentation) and even outperforms MetaCLIP and DINOv2 in some scenarios.<\/p>\n<p data-vmark=\"24a6\"><span class=\"referenceTitle\">1AI Attach reference address<\/span><\/p>\n<ul class=\"custom_reference list-paddingleft-1\">\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"5e3c\"><a href=\"https:\/\/arxiv.org\/abs\/2504.01017\" target=\"_blank\" rel=\"noopener\">Scaling Language-Free Visual Representation Learning<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"e316\"><a href=\"https:\/\/huggingface.co\/collections\/facebook\/web-ssl-68094132c15fbd7808d1e9bb\" target=\"_blank\" rel=\"noopener\">Hugging Face<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"f991\"><a href=\"https:\/\/github.com\/facebookresearch\/webssl\" target=\"_blank\" rel=\"noopener\">GitHub<\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>April 25, 2011 - Technology media marktechpost published a blog post yesterday (April 24) reporting that Meta Corporation released a series of WebSSL models, with parameter sizes ranging from 300 million to 7 billion, trained on purely image-based data, with the aim of exploring the potential of unsupervised visual self-supervised learning (SSL) without language supervision. Comparative language-image models, represented by OpenAI's CLIP, have become the default choice for learning visual representations, excelling in multimodal tasks such as visual question and answer (VQA) and document understanding. However, language dependency faces many challenges due to the complexity of dataset acquisition and data size limitations. Meta addresses these pain points with its work on Hugging Face<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[411,297,6443],"collection":[],"class_list":["post-34044","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-meta","tag-webssl"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/34044","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=34044"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/34044\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=34044"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=34044"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=34044"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=34044"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}