{"id":25302,"date":"2024-12-18T20:22:13","date_gmt":"2024-12-18T12:22:13","guid":{"rendered":"https:\/\/www.1ai.net\/?p=25302"},"modified":"2024-12-18T20:22:13","modified_gmt":"2024-12-18T12:22:13","slug":"%e8%b0%b7%e6%ad%8c%e5%8f%91%e5%b8%83-facts-grounding-%e5%9f%ba%e5%87%86%ef%bc%9agemini%e3%80%81gpt-4o%e3%80%81claude-%e5%bd%93%e8%af%84%e5%a7%94%ef%bc%8c%e6%88%90-ai-%e5%a4%a7%e8%af%ad%e8%a8%80","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/25302.html","title":{"rendered":"Google Releases FACTS Grounding Benchmarks: Gemini, GPT-4o, Claude as Judges, \"Illusionary Mirror\" for AI Big Language Models"},"content":{"rendered":"<p>December 18th.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%b0%b7%e6%ad%8c\" title=\"[View articles tagged with [Google]]\" target=\"_blank\" >Google<\/a> In a December 17th blog post, the DeepMind team announced the launch of the FACTS Grounding benchmark test, which evaluates how accurately large-scale language models (LLMs) can answer questions based on given material.<strong>and the ability to avoid \"hallucinations\" (i.e., fabricated information), thereby improving the factual accuracy of LLMs.<\/strong>Enhance user trust and expand its applications.<\/p>\n<p><strong>Dataset<\/strong><\/p>\n<p>In terms of datasets, the ACTS Grounding dataset contains 1,719 examples across a variety of domains, including finance, technology, retail, healthcare, and legal, with each example consisting of a document, a directive to request LLM's document-based system, and accompanying prompt words.<\/p>\n<p>Example documents vary in length, up to 32,000 tokens (about 20,000 words). User requests cover tasks such as summarization, Q&amp;A generation, and rewriting, but do not include tasks that require creativity, math, or complex reasoning.1<a href=\"https:\/\/www.1ai.net\/en\/tag\/ai\" title=\"[View articles tagged with [AI]]\" target=\"_blank\" >AI<\/a>Attached are the demo images below:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-25303\" title=\"22774be8j00soovms0020d000h4009mp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/12\/22774be8j00soovms0020d000h4009mp.jpg\" alt=\"22774be8j00soovms0020d000h4009mp\" width=\"616\" height=\"346\" \/><\/p>\n<p>The dataset is divided into 860 \"public\" examples and 859 \"private\" examples, with the public dataset now released for use in assessments and the private dataset used for leaderboard scoring to prevent benchmark contamination and leaderboard cheating.<\/p>\n<p><strong>Evaluation plan<\/strong><\/p>\n<p>For the assessment program, FACTS Grounding used 3 models, Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, as judges to assess the adequacy of answers, factual accuracy, and document support.<\/p>\n<p>The evaluation is divided into two phases: first, the response is evaluated for eligibility, i.e., whether it adequately answers the user's request; then the response is evaluated for factual accuracy, i.e., whether it is fully based on the documentation provided, whether there are no \"illusions\", and then based on the model's average score across all the examples, which is ultimately calculated.<\/p>\n<p>In the FACTS Grounding Benchmark, Google's Gemini model achieved the highest score for factually accurate text generation.<\/p>\n<ul class=\"custom_reference list-paddingleft-1\">\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"3993\"><a href=\"https:\/\/the-decoder.com\/google-deepmind-launches-new-ai-fact-checking-benchmark-with-gemini-in-the-lead\/\" target=\"_blank\" rel=\"noopener\">Google DeepMind launches new AI fact-checking benchmark with Gemini in the lead<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"5541\"><a href=\"https:\/\/deepmind.google\/discover\/blog\/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models\/\" target=\"_blank\" rel=\"noopener\">FACTS Grounding: A new benchmark for evaluating the factuality of large language models<\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>On December 18th, the Google DeepMind team published a book on December 17th, announcing the launch of the FACTS Greening benchmarking test, assessing the ability of large language models (LLMs) to respond to the given material on the basis of the accuracy of the given material, and to avoid \u201cfantasy\u201d (i.e. the fabrication of information), thereby increasing the factual accuracy of LLMs, enhancing user confidence and expanding their application. In terms of datasets, the ACTS Greening data set contains 1719 examples covering a wide range of areas such as finance, technology, retail trade, medical care and law, each containing a document, a system directive requiring LLM to be document-based and accompanying hints. Example document lengths vary, maximum<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[411,281],"collection":[],"class_list":["post-25302","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-281"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/25302","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=25302"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/25302\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=25302"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=25302"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=25302"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=25302"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}