Why these three extensions
Each addressed a separate failure mode. Token explosion on long videos is a budget problem: MovieChat and Token Merging compress redundant frame tokens. Lack of grounding is a precision problem: pixel-level outputs let the model say that car, not just a car. Lack of audio is an information problem: speech and ambient sound carry meaning the visual stream alone misses.
Where Whisper fit
Audio-to-text transcription was the cheapest way to plug an audio channel into a VLM that had none. The transcript joined the language context, so the existing language stack handled it without retraining. Imperfect (Whisper's errors propagate), but a useful baseline before considering a learned audio encoder.
Where it broke
Long-video evaluation was thinner than I wanted. The benchmarks for multi-minute video QA were sparse in 2023, so the comparison was qualitative on hand-curated cases more often than ideal. The Long-form Video QA benchmark project that followed was partly a response to that gap.