April 22, 2012 - A new report states.GoogleIts Gemini 2.5 Pro models can be accurately resolved PDF the visual structure of the document, realizing the precise visual citation function of theBe the first to fully understand PDF layouts AI Models.
Note: Google released the Gemini 2.5 Pro experimental model to paid subscribers and developers on March 25, just four days after making it available to users worldwide through a free web app.

Gemini 2.5 Pro not only extracts the textual content of PDF documents, but also understands their visual layout, including charts, tables and overall typography.
Google said in the developer document, the model has a "native vision" (Native Vision) ability to support the processing of up to 3,000 PDF files (each file limit of 1,000 pages or 50MB), while having 1 million tokens of large context window, the future plans to expand to 2 million tokens.
Sergey Filimonov, co-founder of AI startup Matrisk, particularly praised Gemini 2.5 Pro's performance on PDF visual referencing.
Filimonov points out that traditional text segmentation methods cut off the user's visual connection to the original text, making it impossible to visually verify the source of the information. Even in ChatGPT, clicking on a citation only downloads the PDF, forcing the user to determine if the model is an "illusion," which seriously undermines user trust.
In the past, quoting document content was often limited to highlighting large segments of irrelevant text with minimal precision, but Gemini 2.5 revolutionizes this by not only mapping extracted text segments back to the exact location of the original PDF, but also targeting specific sentences, table cells, and even images with unprecedented precision.
This technological breakthrough provides users with intuitive visual feedback, such as the ability to directly highlight relevant data in a document (e.g., a rate change of 15.4%) with the source rationale when inquiring about a housing rate change.
With a level of clarity and interactivity unmatched by existing tools, Gemini 2.5 not only optimizes existing processes, but also opens up a whole new paradigm of document interaction.
In contrast, Gemini 2.5 demonstrates amazing spatial understanding with an IoU (intersection and concurrency ratio) accuracy of 0.804 significantly ahead of other models such as OpenAI's GPT-4o (0.223) and Claude 3.7 Sonnet (0.210).
| provider (company) | Model | IOU | brief comment |
|---|---|---|---|
| Gemini | 2.5 Pro | 0.804 | rare |
| Gemini | 2.5 Flash | 0.614 | Sometimes it's good. |
| Gemini | 2.0 Flash | 0.395 | |
| OpenAI | gpt-4o | 0.223 | |
| OpenAI | gpt-4.1 | 0.268 | |
| OpenAI | gpt-4.1-mini | 0.253 | |
| Claude | 3.7 Sonnet | 0.210 |
The potential of Gemini 2.5 goes far beyond text localization. It can also extract structured data from PDFs while clearly labeling the location of the source of each piece of data, solving the trust barrier in downstream decision-making that arises when the source of the data is unknown.