{"id":7314,"date":"2024-04-07T09:43:33","date_gmt":"2024-04-07T01:43:33","guid":{"rendered":"https:\/\/www.1ai.net\/?p=7314"},"modified":"2024-04-07T09:43:33","modified_gmt":"2024-04-07T01:43:33","slug":"%e6%8a%a5%e5%91%8a%e7%a7%b0-openai-%e9%87%87%e9%9b%86%e4%ba%86%e8%b6%85%e4%b8%80%e7%99%be%e4%b8%87%e5%b0%8f%e6%97%b6%e7%9a%84-youtube-%e8%a7%86%e9%a2%91%e6%9d%a5%e8%ae%ad%e7%bb%83-gpt-4","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/7314.html","title":{"rendered":"Report: OpenAI collected over 1 million hours of YouTube videos to train GPT-4"},"content":{"rendered":"<p>Recently, the Wall Street Journal reported that artificial intelligence companies have difficulty collecting high-quality training data. Then, the New York Times detailed how some companies are dealing with this problem, which involves the murky gray area of AI copyright law.<\/p>\n<p>The story begins<a href=\"https:\/\/www.1ai.net\/en\/tag\/openai\" title=\"[View articles tagged with [OpenAI]]\" target=\"_blank\" >OpenAI<\/a>The company, desperate for training data, reportedly developed the Whisper audio transcription model by transcribing more than 1 million hours of<a href=\"https:\/\/www.1ai.net\/en\/tag\/youtube\" title=\"_Other Organiser\" target=\"_blank\" >Youtube<\/a>Video to train<span class=\"spamTxt\">First<\/span>Advanced Large Language Model<a href=\"https:\/\/www.1ai.net\/en\/tag\/gpt-4\" title=\"[SEE ARTICLES WITH [GPT-4] LABELS]\" target=\"_blank\" >GPT-4<\/a>The New York Times reported that OpenAI knew this was legally problematic but believed it was fair use. OpenAI President Greg Brockman was personally involved in collecting the videos used.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7315\" title=\"201811151614001643_47\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/04\/201811151614001643_47.jpg\" alt=\"201811151614001643_47\" width=\"600\" height=\"338\" \/><\/p>\n<p>OpenAI spokesman Lindsay Herd told The Verge that the company had designed \u201cunique\u201d data sets for each model and used \u201cmany sources, including open and non-public data partners\u201d. Held also indicated that the company was considering the production of its own synthetic data\u3002<\/p>\n<p>According to sources in The New York Times, Google also collected transcripts from YouTube. Google spokesman Matt Bryant stated that the company \u201ctrained models on some YouTube content in accordance with our agreement with YouTube creators\u201d\u3002<\/p>\n<p>Meta has similarly run into limitations on the availability of good training data, and in its efforts to catch up to OpenAI, the company has considered using copyrighted works without permission, including paying for a book license or outright acquiring a large publisher.<\/p>\n<p>These companies are working to address the rapid evaporation of model training data. The Wall Street Journal wrote this week that by 2028 the company might have surpassed its new content. Solutions include training in \u201csynthesis\u201d of model-created data or using \u201ccourse learning\u201d methods. Another option for these companies, however, is to use whatever they can find, whether or not they obtain a licence, which may raise copyright law concerns\u3002<\/p>","protected":false},"excerpt":{"rendered":"<p>Recently, the Wall Street Journal reported that artificial intelligence companies are having trouble collecting high-quality training data. The New York Times then detailed some of the ways companies are dealing with this issue, which touches on the murky gray area of AI copyright law. The story begins with OpenAI. The company desperately needed training data and reportedly developed the Whisper audio transcription model, transcribing more than a million hours of YouTube videos to train its state-of-the-art large-scale language model, GPT-4. The New York Times reported that OpenAI was aware that this was legally problematic but believed that it was fair use, and that OpenAI president Greg Brockman was personally participated in the collection of the video used. OpenAI spokesperson Lindsay Held told T<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[510,190,423],"collection":[],"class_list":["post-7314","post","type-post","status-publish","format-standard","hentry","category-news","tag-gpt-4","tag-openai","tag-youtube"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/7314","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=7314"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/7314\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=7314"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=7314"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=7314"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=7314"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}