arXiv
Language Models
Why vision-language AI forgets what it's seeing halfway through a long answer
Imagine trying to describe a photo while someone keeps interrupting with questions—eventually all the interruptions pile up so much that you stop glancing at the photo and just rely on memory. That's what happens to vision models during long outputs.
This means understanding why vision-language models lose visual grounding during extended generation could make them much more reliable for tasks like document analysis or image reasoning.
Bug reported: No