Why human-in-the-loop
Pure automatic QA generation over 30-minute video produces too many factually-wrong or trivially-answerable items. A small human-review step at the generation stage - not just the evaluation stage - was what made the resulting benchmark worth running inference on.
What the benchmark surfaced
Long-form Video-LMMs degrade in characteristic ways: temporal grounding ("at what point did X happen") collapses faster than entity recognition; egocentric reasoning is the hardest axis. The benchmark made these failure modes visible side-by-side.