Current status
Leaderboard scoring should remain pending until the benchmark methodology and evaluation data are approved.
This page should make the standard visible without implying unsupported competitor rankings. A placeholder leaderboard can explain what will be scored and how results will be versioned.
What the leaderboard will compare
Future versions can compare agents by task family, rubric dimension, output quality, and review readiness.
The most useful leaderboard will show strengths by workflow instead of a single blended score. A generic model can be strong at summarization and still weak at BOE support or clause review.
How results should be governed
Results need versioning, reviewer notes, known limitations, and clear disclosure of task packet design.
Benchmark results should be treated as procurement support, not a final technical claim. Every result needs enough context for buyers to understand what was measured.