Skip to content

Themis

Overview

pittawat2542/themis

Themis¶

Themis is a benchmark-first evaluation framework for LLM systems.

The public workflow is intentionally small:

author one ProjectSpec
author one BenchmarkSpec
register engines, parsers, metrics, judges, and hooks in PluginRegistry
run with Orchestrator
inspect a BenchmarkResult

flowchart LR
    A["BenchmarkSpec"] --> B["compile_benchmark(...)"]
    B --> C["Trial planning"]
    C --> D["Generation / Parse / Score"]
    D --> E["SQLite projections"]
    E --> F["BenchmarkResult"]

What Changed¶

Benchmarks are now first-class. slice_id, prompt_variant_id, and benchmark dimensions are persisted and queryable.
Dataset access is query-aware through DatasetProvider.scan(slice_spec, query).
Parse pipelines are public authoring concepts, not metric-local hacks.
Reporting is aggregation-first through BenchmarkResult.aggregate(...) and paired_compare(...).

Start Here¶

New user: Quick Start
Need the mental model: Public Surface
Want worked scripts: Tutorials
Want task-oriented recipes: Guides
Need exact types and signatures: API Reference