{"id":18437,"date":"2024-08-22T09:17:21","date_gmt":"2024-08-22T01:17:21","guid":{"rendered":"https:\/\/www.1ai.net\/?p=18437"},"modified":"2024-08-22T09:17:21","modified_gmt":"2024-08-22T01:17:21","slug":"meta-%e9%83%a8%e7%bd%b2%e6%96%b0%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e6%9c%ba%e5%99%a8%e4%ba%ba%ef%bc%8c%e4%b8%ba%e5%85%b6-ai-%e6%a8%a1%e5%9e%8b%e6%94%b6%e9%9b%86%e5%a4%a7%e9%87%8f%e6%95%b0%e6%8d%ae","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/18437.html","title":{"rendered":"Meta deploys new web crawler bot to collect massive amounts of data for its AI models"},"content":{"rendered":"<p>recently,<a href=\"https:\/\/www.1ai.net\/en\/tag\/meta\" title=\"[View articles tagged with [Meta]]\" target=\"_blank\" >Meta<\/a> Quietly released a new<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab\" title=\"[Sees articles with tags]\" target=\"_blank\" >Web crawler<\/a>, which is used to search the Internet and collect large amounts of data for its<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e4%ba%ba%e5%b7%a5%e6%99%ba%e8%83%bd%e6%a8%a1%e5%9e%8b\" title=\"_Other Organiser\" target=\"_blank\" >Artificial Intelligence Model<\/a>Provide support.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-18438\" title=\"4cbb119aj00sili7y000bd000ms00cum\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/08\/4cbb119aj00sili7y000bd000ms00cum.jpg\" alt=\"4cbb119aj00sili7y000bd000ms00cum\" width=\"820\" height=\"462\" \/><\/p>\n<p>According to three companies that track web scrapers,<strong>Meta New Network<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e7%88%ac%e8%99%ab%e6%9c%ba%e5%99%a8%e4%ba%ba\" title=\"[Sees articles with labels]\" target=\"_blank\" >Crawler Robot<\/a> Meta External Agent was launched last month. It is similar to OpenAI&#039;s GPTBot and can crawl artificial intelligence training data on the Internet.<\/strong>For example, the text in a news article or the conversation in an online discussion group.<\/p>\n<p>Meta did update a company website for developers in late July with a tab indicating the existence of the new crawler, according to usage profile history, but Meta has yet to publicly announce its new crawler.<\/p>\n<p>Meta\u2019s Llama is one of the largest LLMs, and while the company did not disclose the training data used for the latest version of its model, Llama 3,<strong>But its initial version of the model used large datasets collected from other sources such as Common Crawl.<\/strong><\/p>\n<p>Earlier this year, Meta co-founder and CEO Mark Zuckerberg boasted on an earnings call that the company\u2019s social platform had amassed a dataset for AI training that was \u201cbigger than even Common Crawl.\u201d<\/p>\n<p><strong>The existence of the new crawler suggests that Meta&#039;s massive database may no longer be sufficient.<\/strong>As the company continues to work on updating Llama and expanding Meta AI, new and high-quality training data is often needed to continually improve capabilities.<\/p>\n<p>Data from Dark Visitors shows that nearly 25% of the world&#039;s most popular websites now block GPTBot, but only 2% of them block Meta&#039;s new crawler bot.<\/p>","protected":false},"excerpt":{"rendered":"<p>Recently, Meta quietly released a new web crawler for scouring the internet and collecting large amounts of data to power its AI models. Meta's new web crawler bot, Meta External Agent, which launched last month, is similar to OpenAI's GPTBot in that it crawls the web for AI training data, such as text in news articles or conversations in online discussion groups, according to three companies tracking web crawlers. Meta did update a company website for developers at the end of July with a tab indicating the presence of the new crawler, according to a history of usage profiles, but Meta has yet to publicly announce its new crawler bot. Meta's Llam<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[297,599,4097,3328],"collection":[],"class_list":["post-18437","post","type-post","status-publish","format-standard","hentry","category-news","tag-meta","tag-599","tag-4097","tag-3328"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/18437","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=18437"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/18437\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=18437"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=18437"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=18437"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=18437"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}