Architecture¶

Themis compiles one public benchmark model into a private execution plan, runs that plan, then exposes projection-backed result APIs.

flowchart LR
    A["ProjectSpec"] --> C["Orchestrator"]
    B["BenchmarkSpec"] --> C
    C --> D["compile_benchmark(...)"]
    D --> E["TrialPlanner"]
    E --> F["Generation"]
    F --> G["Parse pipelines"]
    G --> H["Scores"]
    H --> I["Events + projections"]
    I --> J["BenchmarkResult"]
    I --> K["themis-quickcheck"]

Design Consequences¶

benchmark semantics are persisted, not reconstructed from task_id
slice-level prompt applicability is explicit, not a blind cross product
dataset providers own query pushdown
parse pipelines are separate from scoring
aggregation is based on benchmark fields like slice_id, prompt_variant_id, and dimensions