2.8m Gmail.txt -

2.8m Gmail.txt -

) used in the RL stages or the used to measure the success of the 2.8M dataset?

: Uses 11k pairs with a balance of textual and visual rewards ( 2.8M GMAIL.txt

: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) : ) used in the RL stages or the

The paper addresses the "SFT plateau," a phenomenon where Supervised Fine-Tuning (SFT) performance on Large Language Models (LLMs) stops improving even as the dataset size increases [11, 22]. The authors use a specific of chart-to-code data to demonstrate this limitation and propose Multimodal Structured Reinforcement Learning (MSRL) as a solution [11, 22]. 2. Methodology Supervised Fine-Tuning (SFT) Phase : Baseline Model : Qwen2.5-VL-7B-Instruct [11, 22]. The authors use a specific of chart-to-code data

: The SFT stage requires 60 hours of training on 16 H800 GPUs . The RL stages take an additional 34 hours on 24 H800 GPUs [11].