LangGraph Evaluation Pipeline - Reproducible benchmarking for portfolio agents

What the graph does

Three branches run in parallel: ground-truth Q&A generation, candidate-answer generation across multiple providers, and judge-based scoring. The graph fans out the candidates against the ground-truth and emits a per-provider scoreboard plus a Langfuse trace per run.

Why this changed daily work

Once any prompt or model change is a single command away from a benchmarked diff, "does this regression?" stops being a debate and starts being a number. Langfuse traces close the loop for stakeholders who want to see the actual prompts and outputs.