Skip to content

Themis

Themis is a benchmark-first evaluation framework for LLM systems.

The public workflow is intentionally small:

  • author one ProjectSpec
  • author one BenchmarkSpec
  • register engines, parsers, metrics, judges, and hooks in PluginRegistry
  • run with Orchestrator
  • inspect a BenchmarkResult
flowchart LR
    A["BenchmarkSpec"] --> B["compile_benchmark(...)"]
    B --> C["Trial planning"]
    C --> D["Generation / Parse / Score"]
    D --> E["SQLite projections"]
    E --> F["BenchmarkResult"]

What Changed

  • Benchmarks are now first-class. slice_id, prompt_variant_id, and benchmark dimensions are persisted and queryable.
  • Dataset access is query-aware through DatasetProvider.scan(slice_spec, query).
  • Parse pipelines are public authoring concepts, not metric-local hacks.
  • Reporting is aggregation-first through BenchmarkResult.aggregate(...) and paired_compare(...).

Start Here