Rubric dimensions
The benchmark uses 8 dimensions: Source faithfulness, Completeness, Compliance awareness, GovCon domain reasoning, Output usability, Auditability, Human-review readiness, Security posture awareness.
Each dimension should be scored with written evidence and reviewer notes. The point is to reveal where the agent is useful and where human review remains essential.
Failure mode tracking
Failure modes are part of the rubric because GovCon buyers need to know how AI breaks before they trust it.
Failures include missing requirements, unsupported assumptions, weak citation, invented facts, poor FAR/DFARS awareness, and outputs that cannot be reviewed or defended.
Human review readiness
A strong agent output should accelerate human review rather than bypass it.
The benchmark should reward outputs with clear assumptions, traceable evidence, reviewer questions, and next actions that fit existing GovCon operating rhythms.