Vary team open source OCR-2.0 generic end-to-end model GOT, the model in the PDF image to markdown, two-column text awareness, natural scenes and fine-grained OCR, dynamic resolution OCR, multi-page OCR, more symbols OCR and other aspects of the ability to stronger. Although GOT performs well, there are some limitations, such as more language support, more complex geometric diagrams and chart on the OCR performance needs to be improved.GOT's generalization is reflected in the input support for a variety of OCR tasks, the output supports both plain texts and formatted text output. Its structure and training method adopt the pipeline of vision encoder+input embedding layer+decoder, the main body of Encoder adopts the VITDet architecture with local attention, and the last two layers adopt Vary's double convolutional design scheme.
The whole training process is divided into three stages: the first stage efficiently pre-trains the encoder, using the small OPT-125M as the decoder to provide optimization direction for the encoder; the second stage co-trains the encoder-decoder, using the pre-trained encoder and Qwen0.5B as the decoder, and increasing the decoder size; the third stage locks the encoder and strengthens the decoder to adapt to more OCR application scenarios. In the third stage, the encoder is locked and the decoder is strengthened to adapt to more OCR application scenarios. Facing the data engineering challenges, the research team learned numerous data rendering tools. Regarding the reason to continue to study OCR in the era of large model mutual pokers, the research team believes that OCR is close to the ground, is the crystallization of technology in the AI-1.0 era, and becomes the basic ability of multi-mode large model in the AI-2.0 era, but the pure OCR research has not come to the end.
Open source address:
https://github.com/Ucas-HaoranWei/GOT-OCR2.0
