Jefouree

The discoveries worth talking about each week.


Story permalink

arXiv Language Models

Why vision-language AI forgets what it's seeing halfway through a long answer

Log in to share

Imagine trying to describe a photo while someone keeps interrupting with questions—eventually all the interruptions pile up so much that you stop glancing at the photo and just rely on memory. That's what happens to vision models during long outputs.

This means understanding why vision-language models lose visual grounding during extended generation could make them much more reliable for tasks like document analysis or image reasoning.


Bug reported: No

Confirm action