Recently, the Wall Street Journal reported that artificial intelligence companies have difficulty collecting high-quality training data. Then, the New York Times detailed how some companies are dealing with this problem, which involves the murky gray area of AI copyright law.
The story beginsOpenAIThe company, desperate for training data, reportedly developed the Whisper audio transcription model by transcribing more than 1 million hours ofYoutubeVideo to trainFirstAdvanced Large Language ModelGPT-4The New York Times reported that OpenAI knew this was legally problematic but believed it was fair use. OpenAI President Greg Brockman was personally involved in collecting the videos used.

OpenAI spokesperson Lindsay Held told The Verge that the company curates "unique" datasets for each model and uses "numerous sources, including both public and non-public data partners." Held also said the company is considering generating its own synthetic data.
Google also collected transcripts from YouTube, according to a New York Times source. Matt Bryant, a Google spokesman, said the company "trained models on some YouTube content under our agreement with YouTube creators."
Meta has similarly run into limitations on the availability of good training data, and in its efforts to catch up to OpenAI, the company has considered using copyrighted works without permission, including paying for a book license or outright acquiring a large publisher.
These companies are struggling to cope with the rapid evaporation of model training data. The Wall Street Journal wrote this week that companies could outgrow new content by 2028. Solutions include training on "synthetic" data created by models, or using "course-learning" methods. But another option for these companies is to use whatever they can find, whether they have permission or not, which could raise concerns about copyright laws.