1ffc83b3aa8c274a2477daf6aff5dad7_origi.jpg

: The paper demonstrates that by effectively fine-tuning or prompting CLIP, models can achieve significantly higher accuracy in recognizing verbs and their associated semantic roles compared to previous state-of-the-art systems.

: They introduce methods to adapt CLIP's powerful visual-linguistic representations specifically for the task of generating structured descriptions that capture the "who, what, and where" of a scene. 1ffc83b3aa8c274a2477daf6aff5dad7_origi.jpg

This paper focuses on , which involves identifying not just objects in an image, but the actions (verbs) and the roles played by various entities (e.g., agent, tool, location). : The paper demonstrates that by effectively fine-tuning