Skip to content

2PointZero Group

LangGraph Evaluation Pipeline - Reproducible benchmarking for portfolio agents

Portfolio analytics agents at 2PointZero were being tested ad-hoc - different prompts, different runs, no shared ground-truth, and no way to compare providers or regression-test prompt changes.

2026

Approach

Built a LangGraph-orchestrated evaluation pipeline that auto-generates ground-truth Q&A datasets from agent context, runs multi-provider parallel answer generation, scores answers via LLM-as-Judge, and traces every run through Langfuse for audit and side-by-side comparison.

What the graph does

Three branches run in parallel: ground-truth Q&A generation, candidate-answer generation across multiple providers, and judge-based scoring. The graph fans out the candidates against the ground-truth and emits a per-provider scoreboard plus a Langfuse trace per run.

Why this changed daily work

Once any prompt or model change is a single command away from a benchmarked diff, "does this regression?" stops being a debate and starts being a number. Langfuse traces close the loop for stakeholders who want to see the actual prompts and outputs.