Skip to content

Architecture

Themis compiles one public benchmark model into a private execution plan, runs that plan, then exposes projection-backed result APIs.

flowchart LR
    A["ProjectSpec"] --> C["Orchestrator"]
    B["BenchmarkSpec"] --> C
    C --> D["compile_benchmark(...)"]
    D --> E["TrialPlanner"]
    E --> F["Generation"]
    F --> G["Parse pipelines"]
    G --> H["Scores"]
    H --> I["Events + projections"]
    I --> J["BenchmarkResult"]
    I --> K["themis-quickcheck"]

Design Consequences

  • benchmark semantics are persisted, not reconstructed from task_id
  • slice-level prompt applicability is explicit, not a blind cross product
  • dataset providers own query pushdown
  • parse pipelines are separate from scoring
  • aggregation is based on benchmark fields like slice_id, prompt_variant_id, and dimensions