What the graph does
Three branches run in parallel: ground-truth Q&A generation, candidate-answer generation across multiple providers, and judge-based scoring. The graph fans out the candidates against the ground-truth and emits a per-provider scoreboard plus a Langfuse trace per run.
Why this changed daily work
Once any prompt or model change is a single command away from a benchmarked diff, "does this regression?" stops being a debate and starts being a number. Langfuse traces close the loop for stakeholders who want to see the actual prompts and outputs.