A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.
structured-evaluation provides standardized types for evaluation reports, enabling:
- βοΈ LLM-as-Judge assessments with categorical scoring and severity-based findings
- π Dual-scale support with Likert (1-5) scales for human comparison studies
- π Inter-rater reliability metrics for LLM calibration and quality assurance
- β GO/NO-GO summary reports for deterministic checks (CI, tests, validation)
- π Multi-agent coordination with DAG-based report aggregation
go get github.com/plexusone/structured-evaluation| Package | Description |
|---|---|
evaluation |
EvaluationReport, CategoryResult, Finding, Severity types |
summary |
SummaryReport, TeamSection, TaskResult for GO/NO-GO checks |
combine |
DAG-based report aggregation using Kahn's algorithm |
render/box |
Box-format terminal renderer for summary reports |
render/detailed |
Detailed terminal renderer for evaluation reports |
render/terminal |
ANSI-colored terminal renderer with UTF8 icons |
render/markdown |
Markdown report renderer |
schema |
JSON Schema generation and embedding |
For subjective quality assessments with detailed findings:
import "github.com/plexusone/structured-evaluation/evaluation"
report := evaluation.NewEvaluationReport("prd", "document.md")
report.AddCategory(evaluation.CategoryResult{
Category: "problem_definition",
Score: evaluation.ScorePass,
Reasoning: "Clear problem statement with measurable goals",
})
report.AddFinding(evaluation.Finding{
Severity: evaluation.SeverityMedium,
Category: "metrics",
Title: "Missing baseline metrics",
Recommendation: "Add current baseline measurements",
})
report.Finalize("sevaluation check document.md")For deterministic checks with pass/fail status:
import "github.com/plexusone/structured-evaluation/summary"
report := summary.NewSummaryReport("my-service", "v1.0.0", "Release Validation")
report.AddTeam(summary.TeamSection{
ID: "qa",
Name: "Quality Assurance",
Tasks: []summary.TaskResult{
{ID: "unit-tests", Status: summary.StatusGo, Detail: "Coverage: 92%"},
{ID: "e2e-tests", Status: summary.StatusWarn, Detail: "2 flaky tests"},
},
})Following InfoSec conventions:
| Severity | Icon | Blocking | Description |
|---|---|---|---|
| Critical | π΄ | Yes | Must fix before approval |
| High | π΄ | Yes | Must fix before approval |
| Medium | π‘ | No | Should fix, tracked |
| Low | π’ | No | Nice to fix |
| Info | βͺ | No | Informational only |
Default criteria (zero blocking findings, all categories passing):
criteria := evaluation.DefaultPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: -1 (unlimited), RequireAllPass: false
criteria := evaluation.StrictPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: 3, RequireAllPass: true# Install
go install github.com/plexusone/structured-evaluation/cmd/sevaluation@latest
# Render reports
sevaluation render report.json --format=detailed
sevaluation render report.json --format=terminal # ANSI colors + UTF8 icons
sevaluation render report.json --format=markdown # Markdown output
sevaluation render report.json --format=box
sevaluation render report.json --format=json
# Check pass/fail (exit code 0/1)
sevaluation check report.json
# Validate structure
sevaluation validate report.json
# Generate JSON Schema
sevaluation schema generate -o ./schema/For multi-agent workflows with dependencies:
import "github.com/plexusone/structured-evaluation/combine"
results := []combine.AgentResult{
{TeamID: "qa", Tasks: qaTasks},
{TeamID: "security", Tasks: secTasks, DependsOn: []string{"qa"}},
{TeamID: "release", Tasks: relTasks, DependsOn: []string{"qa", "security"}},
}
report := combine.AggregateResults(results, "my-project", "v1.0.0", "Release")
// Teams are topologically sorted: qa β security β releaseSchemas are embedded for runtime validation:
import "github.com/plexusone/structured-evaluation/schema"
evalSchema := schema.EvaluationSchemaJSON
summarySchema := schema.SummarySchemaJSONDefine explicit criteria for consistent categorical evaluations:
rubric := evaluation.NewRubric("quality", "Output quality").
WithPassCriteria("Meets all requirements, no significant issues").
WithPartialCriteria("Meets most requirements, minor issues").
WithFailCriteria("Missing key requirements or major issues")
// Use default PRD rubric
rubricSet := evaluation.DefaultPRDRubricSet()Track LLM judge configuration for reproducibility:
judge := evaluation.NewJudgeMetadata("claude-3-opus").
WithProvider("anthropic").
WithPrompt("prd-eval-v1", "1.0").
WithTemperature(0.0).
WithTokenUsage(1500, 800)
report.SetJudge(judge)Compare two outputs instead of absolute scoring:
comparison := evaluation.NewPairwiseComparison(input, outputA, outputB)
comparison.SetWinner(evaluation.WinnerA, "A is more accurate", 0.9)
// Aggregate multiple comparisons
result := evaluation.ComputePairwiseResult(comparisons)
// result.WinRateA, result.OverallWinnerCombine evaluations from multiple judges:
result := evaluation.AggregateEvaluations(evaluations, evaluation.AggregationMajority)
// Methods: AggregationMajority, AggregationConservative, AggregationOptimistic
// result.Agreement - inter-judge agreement (0-1)
// result.Disagreements - categories with significant disagreement
// result.ConsolidatedDecision - final aggregated decisionUse 1-5 numeric scales for human comparison studies:
// Create a Likert-scale category
cat := evaluation.NewCategory("quality", "Content Quality", "Overall quality").
WithLikert5(evaluation.StandardLikert5Anchors())
// Record a Likert score (automatically maps to categorical)
result := evaluation.NewCategoryResultFromLikert("quality", 4, config, "Good quality")
// result.Score = ScorePass, result.NumericScore = 4.0
// Or record both categorical and numeric
result := evaluation.NewCategoryResultWithNumeric("quality", evaluation.ScorePass, 4.5, "reasoning")Compare LLM evaluations with human ground truth:
// Compute IRR metrics
metrics := evaluation.ComputeIRRFromResults(humanResults, llmResults)
fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Adjacent Agreement: %.1f%%\n", metrics.AdjacentAgreement*100)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)
// Categorical agreement with confusion matrix
agreement := evaluation.ComputeCategoricalAgreement(humanResults, llmResults)Export evaluations to Opik, Phoenix, or Langfuse:
import "github.com/plexusone/omniobserve/integrations/sevaluation"
// Export to observability platform
err := sevaluation.Export(ctx, provider, traceID, report)Designed to work with:
github.com/plexusone/omniobserve- LLM observability (Opik, Phoenix, Langfuse)github.com/grokify/structured-requirements- PRD evaluation templatesgithub.com/plexusone/multi-agent-spec- Agent coordinationgithub.com/grokify/structured-changelog- Release validation
MIT License - see LICENSE for details.