August 5 News.PRC Ministry of State SecurityWeChat Publicity today posted an article saying that the currentAIIt has been deeply integrated into all aspects of economic and social development, and while profoundly changing the way of human production and life, it has also become a key area for high-quality development and high-level security. However, the training data of AI is characterized by a mixture of good and bad quality, including false information, fictional content and biased opinions, resulting indata source contaminationthat bring new challenges to AI safety.

According to the article, the three core elements of artificial intelligence are algorithms, arithmetic and data, of which data is the basic element for training AI models and the core resource for AI applications.
- Provide raw materials for AI models. Massive amounts of data provide sufficient training material for AI models, enabling them to learn the intrinsic laws and patterns of the data and realize semantic understanding, intelligent decision-making and content generation. At the same time, data also drives AI to continuously optimize performance and accuracy, and achieve iterative model upgrades to adapt to new needs.
- AI models require a high level of data quantity, quality and diversity. Adequate data volume is a prerequisite for training large-scale models; highly accurate, complete and consistent data can effectively avoid misleading models; and diversified data covering multiple domains can enhance the ability of models to cope with actual complex scenarios.
- Promote the application of AI models. The increasing abundance of data resources has accelerated the implementation of the "AI+" initiative, and strongly promoted the deep integration of AI with various economic and social fields. This not only cultivates and develops new productivity, but also promotes the leapfrog development of science and technology in China, the optimization and upgrading of industries, and the overall leap in productivity.
According to the article, high-quality data can significantly improve the accuracy and reliability of the model, but once the data is contaminated, it may lead to model decision-making errors or even the failure of the AI system, and there are certain security risks.
- Harmful content. Contaminated data generated through "data poisoning" behaviors, such as alteration, fictionalization, and duplication, will interfere with the model's parameter adjustment in the training phase, weakening the model's performance, reducing its accuracy, and even inducing harmful output. The study shows that when there is only 0.01% of false text in the training dataset, the harmful output of the model will increase by 11.2%; even if there is 0.001% of false text, the harmful output of the model will rise by 7.2% accordingly.
- (b) Recursive contamination. False content generated by data-contaminated AI may become the data source for subsequent model training, forming a continuous "pollution legacy effect". Currently, the quantity of AI-generated content on the Internet has far exceeded that of real human-produced content, and a large amount of low-quality and non-objective data is flooded with it, leading to the accumulation of misinformation in the AI training dataset from generation to generation, and ultimately distorting the cognitive ability of the model itself.
- (c) Real risks. Data pollution may also trigger a series of real risks, especially in the financial market, public security and healthcare. In the financial field, the use of AI by unscrupulous elements to concoct false information, resulting in data contamination, may lead to abnormal fluctuations in stock prices, constituting a new type of market manipulation risk; in the field of public security, data contamination is prone to disturbing public cognition, misleading social opinion, and inducing social panic; in the field of medical and health care, data contamination may lead to the generation of erroneous diagnosis and treatment recommendations from the model, which not only jeopardizes the safety of patients' lives, but also exacerbates the spread of pseudoscience. In the field of healthcare, data pollution may cause models to generate wrong diagnosis and treatment recommendations, which not only jeopardize patients' life safety, but also aggravate the spread of pseudoscience.
1AI notes that the article concludes with response options:
- Strengthen supervision at the source to prevent the generation of pollution. Based on the Network Security Law, the Data Security Law, the Personal Information Protection Law and other laws and regulations, we have established an AI data categorization and grading protection system to prevent the generation of contaminated data from the root, and help effectively prevent AI data security threats.
- Strengthen risk assessment and safeguard data circulation. Strengthen the overall assessment of the risk of AI data security, and ensure the security of data in the whole life cycle of collection, storage, transmission, use, exchange and backup. Synchronize and accelerate the construction of the AI security risk classification management system, and continuously improve the comprehensive data security guarantee capability.
- End-to-end cleaning and remediation to build a governance framework. Regularly clean and restore contaminated data in accordance with regulations and standards. Formulate specific rules for data cleaning based on relevant laws and regulations and industry standards. Gradually build a modular, monitorable and expandable data governance framework to realize continuous management and quality control.