{"id":7479,"date":"2024-04-09T10:03:16","date_gmt":"2024-04-09T02:03:16","guid":{"rendered":"https:\/\/www.1ai.net\/?p=7479"},"modified":"2024-04-09T10:03:16","modified_gmt":"2024-04-09T02:03:16","slug":"openai%e8%ae%a1%e5%88%92%e5%bb%ba%e7%ab%8b%e6%95%b0%e6%8d%ae%e5%b8%82%e5%9c%ba%ef%bc%8c%e8%ae%ad%e5%87%bagpt-5%e7%9f%ad%e7%bc%ba20%e4%b8%87%e4%ba%bf-token","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/7479.html","title":{"rendered":"OpenAI plans to establish a data market, training GPT-5 is short of 20 trillion tokens"},"content":{"rendered":"<p><span class=\"spamTxt\">Whole network<\/span>High-quality dataset emergency! It has been reported that<a href=\"https:\/\/www.1ai.net\/en\/tag\/ai%e5%85%ac%e5%8f%b8\" title=\"[SEES ARTICLES WITH [AI] LABELS]\" target=\"_blank\" >AI Companies<\/a>as if <a href=\"https:\/\/www.1ai.net\/en\/tag\/openai\" title=\"[View articles tagged with [OpenAI]]\" target=\"_blank\" >OpenAI<\/a>, Anthropic and others are struggling to find enough information to train the next generation of AI models. The growing problem of data shortage is critical for training the next generation of powerful models. Faced with this challenge, AI startups, internet giants are beginning to look for new ways to solve the bottleneck of arithmetic and data.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7480\" title=\"202308110956182262_0\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/04\/202308110956182262_0.jpg\" alt=\"202308110956182262_0\" width=\"1000\" height=\"666\" \/><\/p>\n<p>Source Note: The image is generated by AI, and the image is authorized by Midjourney<\/p>\n<p>It is reported that,<a href=\"https:\/\/www.1ai.net\/en\/tag\/gpt-5\" title=\"[SEE ARTICLES WITH [GPT-5] LABELS]\" target=\"_blank\" >GPT-5<\/a>The development of powerful systems such as these requires large amounts of massive data as training material, yet high-quality public data has become scarce in the Internet.<\/p>\n<p>Pablo Villalobos, a researcher at the research institute Epoch, estimates that GPT-4 is found in as many as 12 trillion<a href=\"https:\/\/www.1ai.net\/en\/tag\/token\" title=\"[see articles with [token] labels]\" target=\"_blank\" >token<\/a>trained on it. He went on to say that based on the principles of Chinchilla's scaling law, an AI system like GPT-5 would require 60 trillion-100 trillion tokens of data if it continued to follow this scaling trajectory. That is, after utilizing all the available high quality most linguistic and image data, training out GPT-5 is still short of 20 trillion tokens.<\/p>\n<p>Some data owners, such as Reddit, have also instituted policies restricting AI companies' access to data, exacerbating the data shortage dilemma. To address this dilemma, some companies are trying to train models from synthetic data, but may face issues such as 'model autophagy disorder'.<\/p>\n<p>AI researchers and companies are looking for solutions to the problem of data scarcity. openAI's Ari Morcos notes that data shortages are a cutting-edge research problem, and his company, DatologyAI, is working to improve data selection tools to reduce the cost of training AI models. In addition, OpenAI is discussing the creation of a 'data marketplace' that would help alleviate the data shortage by determining how much data points contribute to model training.<\/p>\n<p>Data shortages pose a major challenge to AI development, and companies are exploring different ways to address the problem. From synthesizing data to creating data marketplaces, the AI field is constantly looking for breakthroughs to secure the data resources needed to train the next generation of powerful AI models.<\/p>","protected":false},"excerpt":{"rendered":"<p>High-quality datasets across the web are in short supply! AI companies such as OpenAI, Anthropic and others are reportedly struggling to find enough information to train the next generation of AI models. The data shortage is a growing issue that is critical for training the next generation of powerful models. Faced with this challenge, AI startups and internet giants are beginning to look for new ways to solve the bottleneck of arithmetic and data. Source Note: Image generated by AI, image license provider Midjourney It is said that the development of powerful systems such as GPT-5 requires a large amount of massive data as training material, however, high-quality public data has become scarce in the Internet. Pablo Villalobos, a researcher at the research institute Epoch, estimates that GPT-4 was developed on as many as 12 trillion t<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[155,719,190,1270],"collection":[],"class_list":["post-7479","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-gpt-5","tag-openai","tag-token"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/7479","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=7479"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/7479\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=7479"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=7479"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=7479"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=7479"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}