Specs and Records¶
Public Spec Model¶
| Type | Purpose |
|---|---|
ProjectSpec |
Storage root, seed, and execution policy |
BenchmarkSpec |
Top-level benchmark definition |
SliceSpec |
One dataset slice, prompt scope, dimensions, parses, and scores |
DatasetQuerySpec |
Subset, item IDs, metadata filters, and sampling hints |
PromptVariantSpec |
Prompt family plus message template |
ParseSpec |
Named parser pipeline |
ScoreSpec |
Named scoring overlay |
Runtime Records¶
The persisted records are still trial-shaped internally, but the public analysis surface speaks benchmark language:
BenchmarkResult.iter_trial_summaries()returns rows withbenchmark_id,slice_id,prompt_variant_id, and dimensionsBenchmarkResult.aggregate(...)groups score rows by benchmark semanticsRecordTimelineViewexposes concrete inference, parse, score, and judge data for one candidate
Guiding Rule¶
Put semantics in the benchmark spec, not in payload conventions:
- use
dimensions={"source": "synthetic"}instead of encoding source into IDs - use
prompt_variant_ids=[...]instead of manual prompt compatibility logic - use
ParseSpecinstead of reparsing model text inside a metric