MBZUAI

Long-Form Video Understanding Benchmark - MBZUAI IVAL

Existing Video-LMM benchmarks focused on short clips (seconds to a minute), but downstream applications - egocentric assistance, surveillance, instructional video - require reasoning over 3-30 minute footage. There was no systematic way to evaluate long-form video understanding, and prior benchmarks under-sampled egocentric content entirely.

2024

Approach

Built a human-in-the-loop QA generation pipeline sourcing both egocentric (Ego4D) and non-egocentric long-form videos. The pipeline proposed candidate Q&A pairs, surfaced them for human review, and emitted curated benchmark items only after sign-off. Implemented inference and evaluation code to benchmark 7 open-source and 2 proprietary Video-LMMs against the resulting test set.

Why human-in-the-loop

Pure automatic QA generation over 30-minute video produces too many factually-wrong or trivially-answerable items. A small human-review step at the generation stage - not just the evaluation stage - was what made the resulting benchmark worth running inference on.

What the benchmark surfaced

Long-form Video-LMMs degrade in characteristic ways: temporal grounding ("at what point did X happen") collapses faster than entity recognition; egocentric reasoning is the hardest axis. The benchmark made these failure modes visible side-by-side.