Themis¶
Themis is a benchmark-first evaluation framework for LLM systems.
The public workflow is intentionally small:
- author one
ProjectSpec - author one
BenchmarkSpec - register engines, parsers, metrics, judges, and hooks in
PluginRegistry - run with
Orchestrator - inspect a
BenchmarkResult
flowchart LR
A["BenchmarkSpec"] --> B["compile_benchmark(...)"]
B --> C["Trial planning"]
C --> D["Generation / Parse / Score"]
D --> E["SQLite projections"]
E --> F["BenchmarkResult"]
What Changed¶
- Benchmarks are now first-class.
slice_id,prompt_variant_id, and benchmark dimensions are persisted and queryable. - Dataset access is query-aware through
DatasetProvider.scan(slice_spec, query). - Parse pipelines are public authoring concepts, not metric-local hacks.
- Reporting is aggregation-first through
BenchmarkResult.aggregate(...)andpaired_compare(...).
Start Here¶
- New user: Quick Start
- Need the mental model: Public Surface
- Want worked scripts: Tutorials
- Want task-oriented recipes: Guides
- Need exact types and signatures: API Reference