The Natural Language Intelligence team at Tongyi Labs has open-sourced the VRAG-RL framework, which solves the challenge of retrieving and reasoning about key information from complex visual documents through reinforcement learning and multimodal techniques. It introduces visual perceptual actions to optimize information extraction, adopts a multi-expert sampling strategy and a fine-grained reward mechanism to improve performance, and improves training efficiency through the GRPO algorithm. Experiments show that VRAG-RL performs well on a wide range of visual tasks, supporting multiple rounds of interaction and fine-grained reasoning.
Paper address: https://arxiv.org/abs/2505.22019
