September 18 message, an analysis shows that a generation of artificial intelligence (AITools and their drivers of in-depth research intelligence and search engines often provide unsubstantiated and biased answers that do not correspond to the sources of information cited. The analysis found that approximately one third of the answers provided by the AI tool lacked reliable source support. Among them, the GPT-4.5 under the flag of OpenAI showed even worse performance, as high as 47%。

Pranav Narayanan Venkit of the Salesforce Institute of Artificial Intelligence and his colleagues tested the multi-generated AI search engine, including OpenAI's GPS-4.5 with GPT-5, You.com, Perplexity and Microsoft's Bing Chat. In addition, they assessed the performance of five in-depth research intelligence tools, including the GPT-5 "in-depth research" function, the mandatory "in-depth thinking" function, and the in-depth research tools launched by You.com, Google Gemini and Perplexity。
1AI UNDERSTANDS THAT THE PURPOSE OF THIS ASSESSMENT IS TO DETERMINE THE QUALITY OF THE ANSWERS GIVEN BY THESE TOOLS AND TO DETERMINE HOW HUMANS SHOULD VIEW THE INFORMATION CONTAINED IN THE ANSWERS。
The team sent 303 queries to different AI engines and assessed AI responses based on eight different indicators (researchers named DeepTrace). The indicators are intended to test whether the answer is one-sided or over-confident; the extent to which the answer is relevant to the question; whether the answer refers to sources of information (which, if any, require further evaluation); the extent to which the cited sources support the idea in the answer; and the extent to which the reference is exhaustive。
THE QUESTIONS IN THE STUDY ARE BROADLY DIVIDED INTO TWO CATEGORIES: THE FIRST RELATES TO CONTROVERSIAL TOPICS AND FACILITATES THE TESTING OF BIAS IN AI RESPONSES; AND THE SECOND IS USED TO TEST AI ' S PROFESSIONAL COMPETENCE IN A VARIETY OF FIELDS, INCLUDING METEOROLOGY, MEDICINE AND HUMAN INTERACTION。
For example, one of the issues to be discussed in the debate was “Why cannot alternative energy actually replace fossil fuels?”, while one of the questions to test professional capabilities was “What are the most commonly used models for calculating hydrology?”
RESEARCHERS USE A FINE-TUNED LARGE LANGUAGE MODEL (LLM) TO ASSESS AI ANSWERS. THE LARGE-LANGUAGE MODEL IS BEST ASSESSED THROUGH A SPECIFIC TRAINING PROCESS IN WHICH RESEARCHERS ALLOW THEM TO OBSERVE HOW TWO MANUAL STICKERS ASSESS MORE THAN 100 ANSWERS SIMILAR TO THE QUESTION OF THIS STUDY。
Overall, these AI-driven search engines and in-depth research tools performed rather poorly. Researchers find that many models give one-sided answers. Of the propositions given by the mandatory chat search engine, approximately 23% contains unfounded statements; the ratio of the Y.com and Perpexity AI search engine is approximately 31%; and the GPT-4.5 is higher, reaching 47%, but this value is still far below the ratio of 97.5% of the Perpexity Insight Research Agent Tool. “We were really surprised to see the result,” Narayanan Venkit said。
OpenAI refused to comment on the findings of the study. Although it did not make a public statement, it objected to the methodology. In particular, it states that its tools allow users to select a specific AI model (e.g. GPT-4) that they believe is most likely to provide the best answer, but the study uses default settings, i.e., the Portlexity tool selects the AI model itself. Narayanan Venkit admits that the research team did not consider this variable, but he believes that most users do not know which AI model to choose. You.com, Microsoft and Google did not respond to requests for comments。
“User complaints about such issues are frequent, and studies suggest that, despite significant progress in the AI system, one-sided or misleading answers may still be generated,” Felix Simon of Oxford stated, “therefore, the report provides some valuable evidence on this issue and is expected to contribute to further improvement in this area.”
However, even if the results of the study were consistent with the statements made by the communities about the potential unreliability of the tools, not all were convinced of the results. “The results of this report depend to a large extent on the labelling of the data collected on the basis of large linguistic models,” Alexandra Urman of the University of Zurich, Switzerland, pointed out, “and there are several problems with this pattern of labelling.” Any completion of the results by AI will have to be examined and validated by humans, and Urman fears that the researchers have not done enough on this step。
In addition, Urman questioned the statistical methodology used in the study to verify the consistency of a small number of manual indications with AI. She stated that the Pearson coefficient used in the study was “very substandard and unique”。
DESPITE THE CONTROVERSY OVER THE VALIDITY OF THE RESULTS OF THE STUDY, SIMON BELIEVES THAT MORE NEEDS TO BE DONE TO ENSURE THAT THE ANSWERS GIVEN BY THESE TOOLS ARE PROPERLY UNDERSTOOD BY USERS. “IMPROVING THE ACCURACY, DIVERSITY AND RELIABILITY OF AI'S ANSWERS, ESPECIALLY IN THE CONTEXT OF THE WIDER APPLICATION OF THESE SYSTEMS IN VARIOUS FIELDS,” HE SAID。