Skip to content

FAQ

Why is the public API benchmark-first now?

Because serious eval authors need first-class slices, prompt variants, parse pipelines, semantic dimensions, and benchmark-native reporting.

Why does BenchmarkSpec compile to something private?

Planning and execution still run on a lower-level IR, but that layer is an implementation detail. The public contract is the benchmark surface.

What replaced the old dataset loader contract?

Use DatasetProvider.scan(slice_spec, query).

What should I do with examples/medical_reasoning_eval?

Treat it as a handoff and acceptance reference. It was intentionally not rewritten during the benchmark-first overhaul.

How do I inspect results without importing Python?

Use themis-quickcheck against the SQLite database.

How do I group results by benchmark semantics?

Use BenchmarkResult.aggregate(...) and include slice_id, prompt_variant_id, or dimension keys in group_by.