OpenAI captured over a million hours of YouTube video to train GPT-4, report says

Report: OpenAI collected over 1 million hours of YouTube videos to train GPT-4

Recently, the Wall Street Journal reported that artificial intelligence companies have difficulty collecting high-quality training data. Then, the New York Times detailed how some companies are dealing with this problem, which involves the murky gray area of AI copyright law.

The story beginsOpenAIThe company, desperate for training data, reportedly developed the Whisper audio transcription model by transcribing more than 1 million hours ofYoutubeVideo to trainFirstAdvanced Large Language ModelGPT-4The New York Times reported that OpenAI knew this was legally problematic but believed it was fair use. OpenAI President Greg Brockman was personally involved in collecting the videos used.

Report: OpenAI collected over 1 million hours of YouTube videos to train GPT-4

OpenAI spokesperson Lindsay Held told The Verge that the company curates "unique" datasets for each model and uses "numerous sources, including both public and non-public data partners." Held also said the company is considering generating its own synthetic data.

Google also collected transcripts from YouTube, according to a New York Times source. Matt Bryant, a Google spokesman, said the company "trained models on some YouTube content under our agreement with YouTube creators."

Meta has similarly run into limitations on the availability of good training data, and in its efforts to catch up to OpenAI, the company has considered using copyrighted works without permission, including paying for a book license or outright acquiring a large publisher.

These companies are struggling to cope with the rapid evaporation of model training data. The Wall Street Journal wrote this week that companies could outgrow new content by 2028. Solutions include training on "synthetic" data created by models, or using "course-learning" methods. But another option for these companies is to use whatever they can find, whether they have permission or not, which could raise concerns about copyright laws.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

{{userData.name}}Verify

Report: OpenAI collected over 1 million hours of YouTube videos to train GPT-4

Musk's XAI artificial intelligence company is reportedly seeking $3 billion in financing, with a valuation of $18 billion

Sam Altman and former Apple design director practice developing AI devices and seek $1 billion in financing

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

{{userData.name}}Verify

Related content:

Musk's XAI artificial intelligence company is reportedly seeking $3 billion in financing, with a valuation of $18 billion

Sam Altman and former Apple design director practice developing AI devices and seek $1 billion in financing

Accounting for 44%, the report said that OpenAI's GPT-4 is full of copyright content

OpenAI releases GPT-4-Turbo official version that can recognize images

GPT-4 Turbo defeats Claude 3 and regains the title of "Best AI Model"

Google CEO: We will take action if OpenAI abuses YouTube in AI training

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow