AI industry faces 'data wall' challenge: high-quality training data may run out by 2028

Recently,AI Big ModelTraining DataThe shortage is once again at the center of media attention. The Economist magazine's latest article, "AI companies will soon use up most of the Internet's data," sparked widespread discussion in the industry. The article points out that with the depletion of high-quality data on the Internet, the AI field is facing the challenge of "data wall".

Research firm Epoch AI predicts that all high-quality text data on the Internet will be exhausted by 2028, and that machine learning datasets could run out of "high-quality language data" by 2026. This "data wall" phenomenon has become a major problem for AI companies, potentially slowing their training progress.

The AI industry faces the challenge of a "data wall": high-quality training data may be exhausted by 2028

Source Note: The image is generated by AI, and the image is authorized by Midjourney

The industry has been warning about this issue for some time; in July 2023, UC Berkeley professor Stuart Russell warned that AI-powered bots such as ChatGPT could soon "exhaust the universe of text". However, there are different perspectives, and in May 2024, Stanford professor Fei-Fei Li said that there is still a huge amount of differentiated data waiting to be mined to build more customized models.

The use of synthetic data has emerged as a potential solution to the data shortage. However, a recent paper published in Nature suggests that using AI-generated datasets to train future generations of machine-learning models could lead to "model crashes" where the models misinterpret reality. The team recommends keeping some of the original data in the training data, using diverse data sources, and investigating more robust training algorithms.

How to break through the limitations of the "data wall" and ensure the continuous supply of high-quality training data has become an urgent issue for the AI industry. This requires not only technological innovation, but also the joint efforts of governments, enterprises and research institutions. As AI technology is increasingly integrated into various industries, solving the data shortage problem will have a profound impact on the sustained and healthy development of AI.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
HeadlinesInformation

Google launches Gemini 1.5 Pro, a powerful multimodal model, which ranks ahead of GPT-4o and Claude-3.5 Sonnet

2024-8-2 9:46:33

Information

HKU and MIT jointly create ItiNera: your personal AI tour guide, planning the perfect Citywalk route with one click!

2024-8-3 8:57:44

Search