Python API reference¶
This page is the generated entry point into the public Python API. Use the smaller reference pages in this section when you already know the category of symbol you need.
Root exports¶
| Name | Kind | Use when | Key constraints / notes |
|---|---|---|---|
__version__ |
Constant | You want the installed package version | Useful for docs, debugging, and release checks |
Experiment |
Core class | You want the main reusable experiment authoring surface | Use for config-backed or Python-authored experiments |
InMemoryRunStore |
Store implementation | You want ephemeral local storage | No cross-process persistence |
PromptSpec |
Prompt model | You want prompt instructions, prefixes, suffixes, or prompt blocks as part of experiment identity | Shared across generation and builtin judge workflows |
Reporter |
Reporting API | You want exports such as JSON, Markdown, CSV, or LaTeX | Works from stored projections |
RunEstimate |
Data model | You want planned task counts and token estimates | Informational only; not pricing |
RunResult |
Data model | You want the top-level execution result returned by a run | Includes status and benchmark output |
RunSnapshot |
Data model | You want the compiled identity and provenance artifact | Produced by compile() |
RunStatus |
Enum-like status model | You want run lifecycle state values | Useful in automation and inspection |
RunStore |
Storage protocol | You are typing against or implementing custom stores | Abstract interface rather than a concrete backend |
RuntimeConfig |
Config model | You want runtime tuning without changing logical identity | Covers concurrency, retries, and deferred execution paths |
SqliteRunStore |
Store implementation | You want the default persistent local store | Good default for real runs |
StatsEngine |
Analysis helper | You want statistical comparison utilities | Used in comparison and reporting flows |
evaluate |
Convenience function | You want the shortest synchronous Python path to a run | Best for simple scripts; call only when no event loop is already running |
evaluate_async |
Convenience function | You want the shortest async Python path to a run | Use in notebooks, async apps, and any environment with a running event loop |
export_evaluation_bundle |
Artifact helper | You want portable evaluation workflow artifacts | Best for judge-backed replay or handoff |
export_generation_bundle |
Artifact helper | You want portable generation artifacts | Good for external evaluation pipelines |
export_parse_bundle |
Artifact helper | You want portable parsed-output artifacts | Python-only today |
export_reduction_bundle |
Artifact helper | You want portable reduction-stage artifacts | Python-only today |
export_score_bundle |
Artifact helper | You want portable pure-score artifacts | Python-only today |
get_evaluation_execution |
Inspection helper | You want one stored workflow execution | Judge-backed metrics only; pass dataset_id or case_key when duplicate case_ids exist across datasets |
get_execution_state |
Inspection helper | You want stored progress and failure details | Best before resume or replay decisions |
get_run_snapshot |
Inspection helper | You want compiled identity and provenance details | Read-only lookup |
import_evaluation_bundle |
Artifact helper | You want to ingest external evaluation artifacts into a store | Match bundle shape to the target run |
import_generation_bundle |
Artifact helper | You want to ingest generation artifacts into a store | Enables later replay without regeneration |
import_parse_bundle |
Artifact helper | You want to ingest parsed-output artifacts | Python-only today |
import_reduction_bundle |
Artifact helper | You want to ingest reduction-stage artifacts | Python-only today |
import_score_bundle |
Artifact helper | You want to ingest score artifacts | Python-only today |
quickcheck |
Inspection helper | You want a compact run summary | Smaller surface than full reporting |
snapshot_report |
Reporting helper | You want a concise Python report from stored snapshot data | Lighter than Reporter |
sqlite_store |
Store factory helper | You want a quick SQLite store constructor | Shortcut for the persistent local backend |
Generated modules¶
Root package:
themis ¶
Public package surface for Themis.
Experiment ¶
Bases: FrozenModel
Authoring model for a Themis experiment.
An experiment owns the compile-time inputs required to build a RunSnapshot
and provides sync and async helpers for running or rejudging that snapshot.
from_config
classmethod
¶
from_config(
path: str | Path, *, overrides: list[str] | None = None
) -> Experiment
Load an experiment definition from YAML or TOML configuration.
rejudge ¶
rejudge(
*,
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Re-run workflow-backed metrics synchronously.
rejudge_async
async
¶
rejudge_async(
*,
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Re-run workflow-backed metrics from stored upstream artifacts.
replay ¶
replay(
*,
stage: Literal["reduce", "parse", "score", "judge"],
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Replay persisted runs from a downstream stage synchronously.
replay_async
async
¶
replay_async(
*,
stage: Literal["reduce", "parse", "score", "judge"],
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Replay persisted runs from a downstream stage.
run ¶
run(
*,
until_stage: Literal[
"generate", "reduce", "parse", "score", "judge"
] = "judge",
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Run the compiled snapshot synchronously.
run_async
async
¶
run_async(
*,
until_stage: Literal[
"generate", "reduce", "parse", "score", "judge"
] = "judge",
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Run the compiled snapshot asynchronously.
InMemoryRunStore ¶
Bases: ProjectionRefreshingStore
Simple in-memory store used by tests and local development.
PromptSpec ¶
Bases: HashableModel
Generic prompt instructions and structured prompt material.
Reporter ¶
Export persisted run projections in JSON, Markdown, CSV, or LaTeX.
export_json ¶
export_json(run_id: str) -> str
Export all major persisted projections for a run as formatted JSON.
export_latex ¶
export_latex(run_id: str) -> str
Export benchmark score rows as a compact LaTeX table.
export_markdown ¶
export_markdown(run_id: str) -> str
Export a human-readable Markdown summary for a persisted run.
export_score_table ¶
export_score_table(
run_id: str,
) -> list[dict[str, JSONValue]]
Return benchmark score rows in a normalized table structure.
RunEstimate ¶
RunResult ¶
RunSnapshot ¶
RunStatus ¶
Bases: StrEnum
User-facing run status values.
RunStore ¶
Bases: Protocol
Persistence contract used by Themis runtime components.
RuntimeConfig ¶
SqliteRunStore ¶
Bases: ProjectionRefreshingStore
Small SQLite-backed run store.
evaluate_async
async
¶
evaluate_async(
*,
model: object,
data: Dataset
| Sequence[Dataset]
| Sequence[Mapping[str, Any]],
metric: object | Sequence[object],
parser: object | Sequence[object] | None = None,
judge: object | Sequence[object] | None = None,
samples: int = 1,
reducer: object | None = None,
storage: StorageConfig | None = None,
runtime: RuntimeConfig | None = None,
seeds: list[int] | None = None,
workflow_overrides: dict[str, object] | None = None,
judge_config: dict[str, object] | None = None,
environment_metadata: dict[str, str] | None = None,
themis_version: str | None = None,
python_version: str = "3.12",
platform: str = "unknown",
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
) -> RunResult
Compile and run a Themis experiment asynchronously through the Layer 1 API.
export_evaluation_bundle ¶
export_evaluation_bundle(
store: RunStore, run_id: str
) -> EvaluationBundle
Export stored evaluation artifacts into a portable bundle.
export_generation_bundle ¶
export_generation_bundle(
store: RunStore, run_id: str
) -> GenerationBundle
Export stored generation artifacts into a portable bundle.
export_parse_bundle ¶
export_parse_bundle(
store: RunStore, run_id: str
) -> ParseBundle
Export stored parse artifacts into a portable bundle.
export_reduction_bundle ¶
export_reduction_bundle(
store: RunStore, run_id: str
) -> ReductionBundle
Export stored reduction artifacts into a portable bundle.
export_score_bundle ¶
export_score_bundle(
store: RunStore, run_id: str
) -> ScoreBundle
Export stored score artifacts into a portable bundle.
get_evaluation_execution ¶
get_evaluation_execution(
store: RunStore,
run_id: str,
case_id: str,
metric_id: str,
*,
dataset_id: str | None = None,
case_key: str | None = None,
) -> EvaluationExecution | None
Return one stored workflow execution for a case and metric.
get_execution_state ¶
get_execution_state(
store: RunStore, run_id: str
) -> ExecutionState
Return the persisted execution state for a run.
get_run_snapshot ¶
get_run_snapshot(
store: RunStore, run_id: str
) -> RunSnapshot
Return the persisted snapshot for a run.
import_evaluation_bundle ¶
import_evaluation_bundle(
store: RunStore, bundle: EvaluationBundle
) -> None
Import evaluation artifacts from a bundle into a store.
import_generation_bundle ¶
import_generation_bundle(
store: RunStore, bundle: GenerationBundle
) -> None
Import generation artifacts from a bundle into a store.
import_parse_bundle ¶
import_parse_bundle(
store: RunStore, bundle: ParseBundle
) -> None
Import parse artifacts from a bundle into a store.
import_reduction_bundle ¶
import_reduction_bundle(
store: RunStore, bundle: ReductionBundle
) -> None
Import reduction artifacts from a bundle into a store.
import_score_bundle ¶
import_score_bundle(
store: RunStore, bundle: ScoreBundle
) -> None
Import score artifacts from a bundle into a store.
snapshot_report ¶
snapshot_report(
snapshot: RunSnapshot,
run_metadata: dict[str, JSONValue] | None = None,
) -> dict[str, JSONValue]
Return a JSON-serializable summary for a compiled snapshot.
Catalog namespace:
themis.catalog ¶
Manifest-backed catalog entry points.
builtin_component_refs ¶
builtin_component_refs() -> dict[str, Any]
Return component references for the builtin shipped catalog entries.
get_benchmark ¶
get_benchmark(name: str) -> BenchmarkCatalogEntry
Return structured metadata for a shipped catalog benchmark.
list_benchmark_ids ¶
list_benchmark_ids() -> list[str]
List canonical benchmark identifiers from the shipped catalog.
list_benchmarks ¶
list_benchmarks() -> list[BenchmarkCatalogEntry]
Return structured metadata for shipped catalog benchmarks.
list_component_ids ¶
list_component_ids(*, kind: str | None = None) -> list[str]
List builtin component identifiers, optionally filtered by kind.
load ¶
load(name: str) -> object
Load a builtin component or named benchmark from the shipped catalog.
run ¶
run(
name: str,
*,
model: object | None = None,
store: RunStore | None = None,
) -> RunResult
Execute a named benchmark through the catalog convenience layer.
validate_benchmark ¶
validate_benchmark(name: str) -> BenchmarkValidationResult
Validate that a shipped benchmark can load, materialize, and score.
Core namespace:
themis.core ¶
Core namespace for Themis.
AfterGenerate ¶
Bases: Protocol
Hook invoked after a generator returns a candidate.
AfterJudge ¶
Bases: Protocol
Hook invoked after a workflow-backed metric finishes.
AfterParse ¶
Bases: Protocol
Hook invoked after parsing completes.
AfterReduce ¶
Bases: Protocol
Hook invoked after reduction produces a final candidate.
AfterScore ¶
Bases: Protocol
Hook invoked after a pure metric emits a score or error.
BeforeGenerate ¶
Bases: Protocol
Hook invoked before a generator runs.
BeforeJudge ¶
Bases: Protocol
Hook invoked before a workflow-backed metric begins judging.
BeforeParse ¶
Bases: Protocol
Hook invoked before parsing a reduced candidate.
BeforeReduce ¶
Bases: Protocol
Hook invoked before reduction starts.
BeforeScore ¶
Bases: Protocol
Hook invoked before a pure metric runs.
BenchmarkResult ¶
CandidateReducer ¶
Bases: Protocol
Protocol for reducers that collapse multiple candidates into one.
CandidateSelector ¶
Bases: Protocol
Protocol for selectors that choose candidates before reduction.
Case ¶
CaseResult ¶
ComponentRefs ¶
ConversationTrace ¶
Dataset ¶
DatasetRef ¶
DefaultWorkflowRunner ¶
Concurrent interpreter for Themis-owned evaluation workflows.
EvalScoreContext ¶
EvaluationBundle ¶
EvaluationBundleRecord ¶
EvaluationCompletedEvent ¶
Bases: CaseRunEvent
Event emitted when a workflow-backed metric finishes.
EvaluationConfig ¶
EvaluationFailedEvent ¶
Bases: CaseRunEvent
Event emitted when a workflow-backed metric fails.
EvaluationWorkflow ¶
Bases: Protocol
Protocol for workflow-backed metrics driven by judge model calls.
ExecutionState ¶
Experiment ¶
Bases: FrozenModel
Authoring model for a Themis experiment.
An experiment owns the compile-time inputs required to build a RunSnapshot
and provides sync and async helpers for running or rejudging that snapshot.
from_config
classmethod
¶
from_config(
path: str | Path, *, overrides: list[str] | None = None
) -> Experiment
Load an experiment definition from YAML or TOML configuration.
rejudge ¶
rejudge(
*,
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Re-run workflow-backed metrics synchronously.
rejudge_async
async
¶
rejudge_async(
*,
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Re-run workflow-backed metrics from stored upstream artifacts.
replay ¶
replay(
*,
stage: Literal["reduce", "parse", "score", "judge"],
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Replay persisted runs from a downstream stage synchronously.
replay_async
async
¶
replay_async(
*,
stage: Literal["reduce", "parse", "score", "judge"],
metric_ids: list[str] | None = None,
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Replay persisted runs from a downstream stage.
run ¶
run(
*,
until_stage: Literal[
"generate", "reduce", "parse", "score", "judge"
] = "judge",
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Run the compiled snapshot synchronously.
run_async
async
¶
run_async(
*,
until_stage: Literal[
"generate", "reduce", "parse", "score", "judge"
] = "judge",
runtime: RuntimeConfig | None = None,
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
)
Run the compiled snapshot asynchronously.
FrozenModel ¶
Bases: BaseModel
Base Pydantic model used by the immutable core.
GenerateContext ¶
GenerationBundle ¶
GenerationBundleRecord ¶
GenerationCompletedEvent ¶
Bases: CaseRunEvent
Event emitted when candidate generation finishes for a case.
GenerationConfig ¶
GenerationFailedEvent ¶
Bases: CaseRunEvent
Event emitted when candidate generation fails for a case.
GenerationResult ¶
GenerationWorkItem ¶
Generator ¶
Bases: Protocol
Protocol for generation components that produce candidate outputs.
HashableModel ¶
InMemoryRunStore ¶
Bases: ProjectionRefreshingStore
Simple in-memory store used by tests and local development.
JudgeModel ¶
Bases: Protocol
Protocol for judge models used inside evaluation workflows.
LLMMetric ¶
Bases: Protocol
Protocol for metrics that judge a reduced candidate set with an LLM.
LifecycleSubscriber ¶
Bases: BeforeGenerate, AfterGenerate, BeforeReduce, AfterReduce, BeforeParse, AfterParse, BeforeScore, AfterScore, BeforeJudge, AfterJudge, OnEvent, Protocol
Aggregate lifecycle subscriber protocol.
Message ¶
OnEvent ¶
Bases: Protocol
Hook invoked after an execution event is persisted.
ParseBundle ¶
ParseBundleRecord ¶
ParseCompletedEvent ¶
Bases: CaseRunEvent
Event emitted when parsing a reduced candidate succeeds.
ParseContext ¶
ParseFailedEvent ¶
Bases: CaseRunEvent
Event emitted when parsing a reduced candidate fails.
ParsedOutput ¶
Parser ¶
Bases: Protocol
Protocol for parsers that normalize reduced candidate outputs.
ProgressSnapshot ¶
PromptSpec ¶
Bases: HashableModel
Generic prompt instructions and structured prompt material.
PureMetric ¶
Bases: Protocol
Protocol for deterministic metrics that score parsed outputs directly.
ReduceContext ¶
ReducedCandidate ¶
ReductionBundle ¶
ReductionBundleRecord ¶
ReductionCompletedEvent ¶
Bases: CaseRunEvent
Event emitted when candidate reduction succeeds.
ReductionFailedEvent ¶
Bases: CaseRunEvent
Event emitted when candidate reduction fails.
Reporter ¶
Export persisted run projections in JSON, Markdown, CSV, or LaTeX.
export_json ¶
export_json(run_id: str) -> str
Export all major persisted projections for a run as formatted JSON.
export_latex ¶
export_latex(run_id: str) -> str
Export benchmark score rows as a compact LaTeX table.
export_markdown ¶
export_markdown(run_id: str) -> str
Export a human-readable Markdown summary for a persisted run.
export_score_table ¶
export_score_table(
run_id: str,
) -> list[dict[str, JSONValue]]
Return benchmark score rows in a normalized table structure.
RunEstimate ¶
RunEvent ¶
RunFailedEvent ¶
RunIdentity ¶
RunProvenance ¶
RunResult ¶
RunSnapshot ¶
RunStatus ¶
Bases: StrEnum
User-facing run status values.
RunStore ¶
Bases: Protocol
Persistence contract used by Themis runtime components.
RuntimeConfig ¶
Score ¶
ScoreBundle ¶
ScoreBundleRecord ¶
ScoreCompletedEvent ¶
Bases: CaseRunEvent
Event emitted when a pure metric succeeds.
ScoreContext ¶
ScoreError ¶
ScoreFailedEvent ¶
Bases: CaseRunEvent
Event emitted when a pure metric produces an error payload.
SelectContext ¶
SelectionMetric ¶
Bases: Protocol
Protocol for metrics that judge multiple generated candidates.
SqliteRunStore ¶
Bases: ProjectionRefreshingStore
Small SQLite-backed run store.
StorageConfig ¶
StoredRun ¶
TimelineView ¶
TraceMetric ¶
Bases: Protocol
Protocol for metrics that score traces or conversations.
TraceStep ¶
TraceView ¶
TracingProvider ¶
Bases: Protocol
Protocol for span-based tracing integrations.
WorkflowBuildError ¶
Bases: ValueError
Raised when a metric cannot build a valid evaluation workflow.
WorkflowRunner ¶
Bases: Protocol
Protocol for executing evaluation workflows and returning traces.
WorkflowTrace ¶
evaluate_async
async
¶
evaluate_async(
*,
model: object,
data: Dataset
| Sequence[Dataset]
| Sequence[Mapping[str, Any]],
metric: object | Sequence[object],
parser: object | Sequence[object] | None = None,
judge: object | Sequence[object] | None = None,
samples: int = 1,
reducer: object | None = None,
storage: StorageConfig | None = None,
runtime: RuntimeConfig | None = None,
seeds: list[int] | None = None,
workflow_overrides: dict[str, object] | None = None,
judge_config: dict[str, object] | None = None,
environment_metadata: dict[str, str] | None = None,
themis_version: str | None = None,
python_version: str = "3.12",
platform: str = "unknown",
store: RunStore | None = None,
subscribers: list[LifecycleSubscriber] | None = None,
tracing_provider: TracingProvider | None = None,
) -> RunResult
Compile and run a Themis experiment asynchronously through the Layer 1 API.
event_from_dict ¶
event_from_dict(payload: dict[str, Any]) -> RunEvent
Deserialize a stored event payload into the correct event model.
export_evaluation_bundle ¶
export_evaluation_bundle(
store: RunStore, run_id: str
) -> EvaluationBundle
Export stored evaluation artifacts into a portable bundle.
export_generation_bundle ¶
export_generation_bundle(
store: RunStore, run_id: str
) -> GenerationBundle
Export stored generation artifacts into a portable bundle.
export_parse_bundle ¶
export_parse_bundle(
store: RunStore, run_id: str
) -> ParseBundle
Export stored parse artifacts into a portable bundle.
export_reduction_bundle ¶
export_reduction_bundle(
store: RunStore, run_id: str
) -> ReductionBundle
Export stored reduction artifacts into a portable bundle.
export_score_bundle ¶
export_score_bundle(
store: RunStore, run_id: str
) -> ScoreBundle
Export stored score artifacts into a portable bundle.
get_evaluation_execution ¶
get_evaluation_execution(
store: RunStore,
run_id: str,
case_id: str,
metric_id: str,
*,
dataset_id: str | None = None,
case_key: str | None = None,
) -> EvaluationExecution | None
Return one stored workflow execution for a case and metric.
get_execution_state ¶
get_execution_state(
store: RunStore, run_id: str
) -> ExecutionState
Return the persisted execution state for a run.
get_run_snapshot ¶
get_run_snapshot(
store: RunStore, run_id: str
) -> RunSnapshot
Return the persisted snapshot for a run.
import_evaluation_bundle ¶
import_evaluation_bundle(
store: RunStore, bundle: EvaluationBundle
) -> None
Import evaluation artifacts from a bundle into a store.
import_generation_bundle ¶
import_generation_bundle(
store: RunStore, bundle: GenerationBundle
) -> None
Import generation artifacts from a bundle into a store.
import_parse_bundle ¶
import_parse_bundle(
store: RunStore, bundle: ParseBundle
) -> None
Import parse artifacts from a bundle into a store.
import_reduction_bundle ¶
import_reduction_bundle(
store: RunStore, bundle: ReductionBundle
) -> None
Import reduction artifacts from a bundle into a store.
import_score_bundle ¶
import_score_bundle(
store: RunStore, bundle: ScoreBundle
) -> None
Import score artifacts from a bundle into a store.
snapshot_report ¶
snapshot_report(
snapshot: RunSnapshot,
run_metadata: dict[str, JSONValue] | None = None,
) -> dict[str, JSONValue]
Return a JSON-serializable summary for a compiled snapshot.
Adapters:
themis.adapters ¶
Provider-backed generator adapters for Themis.