What the harness measured
For each (model, benchmark, prompt) triple, the harness ran the inference, scored the MCQA output, and wrote results into a comparable grid. The 61 prompt variants were grouped into 15 categories - task framing, role priming, output format, refusal phrasing, and so on - so the swings could be attributed back to which kind of phrasing the model was sensitive to.
Why this mattered
The answer is uncomfortable: up to 15-point accuracy swings from prompt phrasing alone. That moves "which model is best?" from a number into a distribution.