Specs and Records¶

Public Spec Model¶

Type	Purpose
`ProjectSpec`	Storage root, seed, and execution policy
`BenchmarkSpec`	Top-level benchmark definition
`SliceSpec`	One dataset slice, prompt scope, dimensions, parses, and scores
`DatasetQuerySpec`	Subset, item IDs, metadata filters, and sampling hints
`PromptVariantSpec`	Prompt family plus message template
`ParseSpec`	Named parser pipeline
`ScoreSpec`	Named scoring overlay

The persisted records are still trial-shaped internally, but the public analysis surface speaks benchmark language:

BenchmarkResult.iter_trial_summaries() returns rows with benchmark_id, slice_id, prompt_variant_id, and dimensions
BenchmarkResult.aggregate(...) groups score rows by benchmark semantics
RecordTimelineView exposes concrete inference, parse, score, and judge data for one candidate

Put semantics in the benchmark spec, not in payload conventions: