MBZUAI

LMM Prompt-Sensitivity Evaluation Harness - MBZUAI IVAL

Evaluations of large multimodal models (LMMs) report a single accuracy number per benchmark, but those numbers are silently a function of the prompt phrasing. A robust LMM should not swing 15 points on the same MCQA question because of cosmetic prompt edits - and the field had no systematic way to measure how much it actually does.

2025

Approach

Built an evaluation harness on 20× NVIDIA A100s that orchestrated 1,500+ inference runs across 8 open-source and 2 proprietary LMMs over 3 MCQA benchmarks, sweeping a 61-prompt sensitivity grid spanning 15 categories. Output became Promptception, accepted at EMNLP 2025 (Findings).

What the harness measured

For each (model, benchmark, prompt) triple, the harness ran the inference, scored the MCQA output, and wrote results into a comparable grid. The 61 prompt variants were grouped into 15 categories - task framing, role priming, output format, refusal phrasing, and so on - so the swings could be attributed back to which kind of phrasing the model was sensitive to.

Why this mattered

The answer is uncomfortable: up to 15-point accuracy swings from prompt phrasing alone. That moves "which model is best?" from a number into a distribution.