Python API reference¶

This page is the generated entry point into the public Python API. Use the smaller reference pages in this section when you already know the category of symbol you need.

Root exports¶

Name	Kind	Use when	Key constraints / notes
`__version__`	Constant	You want the installed package version	Useful for docs, debugging, and release checks
`Experiment`	Core class	You want the main reusable experiment authoring surface	Use for config-backed or Python-authored experiments
`InMemoryRunStore`	Store implementation	You want ephemeral local storage	No cross-process persistence
`PromptSpec`	Prompt model	You want prompt instructions, prefixes, suffixes, or prompt blocks as part of experiment identity	Shared across generation and builtin judge workflows
`Reporter`	Reporting API	You want exports such as JSON, Markdown, CSV, or LaTeX	Works from stored projections
`RunEstimate`	Data model	You want planned task counts and token estimates	Informational only; not pricing
`RunResult`	Data model	You want the top-level execution result returned by a run	Includes status and benchmark output
`RunSnapshot`	Data model	You want the compiled identity and provenance artifact	Produced by `compile()`
`RunStatus`	Enum-like status model	You want run lifecycle state values	Useful in automation and inspection
`RunStore`	Storage protocol	You are typing against or implementing custom stores	Abstract interface rather than a concrete backend
`RuntimeConfig`	Config model	You want runtime tuning without changing logical identity	Covers concurrency, retries, and deferred execution paths
`SqliteRunStore`	Store implementation	You want the default persistent local store	Good default for real runs
`StatsEngine`	Analysis helper	You want statistical comparison utilities	Used in comparison and reporting flows
`evaluate`	Convenience function	You want the shortest synchronous Python path to a run	Best for simple scripts; call only when no event loop is already running
`evaluate_async`	Convenience function	You want the shortest async Python path to a run	Use in notebooks, async apps, and any environment with a running event loop
`export_evaluation_bundle`	Artifact helper	You want portable evaluation workflow artifacts	Best for judge-backed replay or handoff
`export_generation_bundle`	Artifact helper	You want portable generation artifacts	Good for external evaluation pipelines
`export_parse_bundle`	Artifact helper	You want portable parsed-output artifacts	Python-only today
`export_reduction_bundle`	Artifact helper	You want portable reduction-stage artifacts	Python-only today
`export_score_bundle`	Artifact helper	You want portable pure-score artifacts	Python-only today
`get_evaluation_execution`	Inspection helper	You want one stored workflow execution	Judge-backed metrics only; pass `dataset_id` or `case_key` when duplicate `case_id`s exist across datasets
`get_execution_state`	Inspection helper	You want stored progress and failure details	Best before resume or replay decisions
`get_run_snapshot`	Inspection helper	You want compiled identity and provenance details	Read-only lookup
`import_evaluation_bundle`	Artifact helper	You want to ingest external evaluation artifacts into a store	Match bundle shape to the target run
`import_generation_bundle`	Artifact helper	You want to ingest generation artifacts into a store	Enables later replay without regeneration
`import_parse_bundle`	Artifact helper	You want to ingest parsed-output artifacts	Python-only today
`import_reduction_bundle`	Artifact helper	You want to ingest reduction-stage artifacts	Python-only today
`import_score_bundle`	Artifact helper	You want to ingest score artifacts	Python-only today
`quickcheck`	Inspection helper	You want a compact run summary	Smaller surface than full reporting
`snapshot_report`	Reporting helper	You want a concise Python report from stored snapshot data	Lighter than `Reporter`
`sqlite_store`	Store factory helper	You want a quick SQLite store constructor	Shortcut for the persistent local backend

Generated modules¶

Root package:

themis ¶

Public package surface for Themis.

Experiment ¶

Bases: FrozenModel

Authoring model for a Themis experiment.

An experiment owns the compile-time inputs required to build a RunSnapshot and provides sync and async helpers for running or rejudging that snapshot.

compile ¶

compile() -> RunSnapshot

Compile the experiment into an immutable RunSnapshot.

from_config `classmethod` ¶

from_config(
    path: str | Path, *, overrides: list[str] | None = None
) -> Experiment

Load an experiment definition from YAML or TOML configuration.

rejudge ¶

rejudge(
    *,
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Re-run workflow-backed metrics synchronously.

rejudge_async `async` ¶

rejudge_async(
    *,
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Re-run workflow-backed metrics from stored upstream artifacts.

replay ¶

replay(
    *,
    stage: Literal["reduce", "parse", "score", "judge"],
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Replay persisted runs from a downstream stage synchronously.

replay_async `async` ¶

replay_async(
    *,
    stage: Literal["reduce", "parse", "score", "judge"],
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Replay persisted runs from a downstream stage.

run ¶

run(
    *,
    until_stage: Literal[
        "generate", "reduce", "parse", "score", "judge"
    ] = "judge",
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Run the compiled snapshot synchronously.

run_async `async` ¶

run_async(
    *,
    until_stage: Literal[
        "generate", "reduce", "parse", "score", "judge"
    ] = "judge",
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Run the compiled snapshot asynchronously.

InMemoryRunStore ¶

Bases: ProjectionRefreshingStore

Simple in-memory store used by tests and local development.

PromptSpec ¶

Bases: HashableModel

Generic prompt instructions and structured prompt material.

render_input ¶

render_input(prompt_input: JSONValue) -> JSONValue

Render prompt-oriented input for provider adapters.

render_sections ¶

render_sections() -> list[str]

Render prompt sections that can prefix a prompt body.

Reporter ¶

Export persisted run projections in JSON, Markdown, CSV, or LaTeX.

export_csv ¶

export_csv(run_id: str) -> str

Export benchmark score rows as CSV.

export_json ¶

export_json(run_id: str) -> str

Export all major persisted projections for a run as formatted JSON.

export_latex ¶

export_latex(run_id: str) -> str

Export benchmark score rows as a compact LaTeX table.

export_markdown ¶

export_markdown(run_id: str) -> str

Export a human-readable Markdown summary for a persisted run.

export_score_table ¶

export_score_table(
    run_id: str,
) -> list[dict[str, JSONValue]]

Return benchmark score rows in a normalized table structure.

RunEstimate ¶

Bases: FrozenModel

Planner estimate for the work implied by a compiled run.

RunResult ¶

Bases: FrozenModel

Final run-level result returned from execution.

RunSnapshot ¶

Bases: FrozenModel

Immutable executable artifact produced by Experiment.compile().

RunStatus ¶

Bases: StrEnum

User-facing run status values.

RunStore ¶

Bases: Protocol

Persistence contract used by Themis runtime components.

RuntimeConfig ¶

Bases: HashableModel

Execution-time controls that do not affect snapshot identity.

SqliteRunStore ¶

Bases: ProjectionRefreshingStore

Small SQLite-backed run store.

evaluate_async `async` ¶

evaluate_async(
    *,
    model: object,
    data: Dataset
    | Sequence[Dataset]
    | Sequence[Mapping[str, Any]],
    metric: object | Sequence[object],
    parser: object | Sequence[object] | None = None,
    judge: object | Sequence[object] | None = None,
    samples: int = 1,
    reducer: object | None = None,
    storage: StorageConfig | None = None,
    runtime: RuntimeConfig | None = None,
    seeds: list[int] | None = None,
    workflow_overrides: dict[str, object] | None = None,
    judge_config: dict[str, object] | None = None,
    environment_metadata: dict[str, str] | None = None,
    themis_version: str | None = None,
    python_version: str = "3.12",
    platform: str = "unknown",
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
) -> RunResult

Compile and run a Themis experiment asynchronously through the Layer 1 API.

export_evaluation_bundle ¶

export_evaluation_bundle(
    store: RunStore, run_id: str
) -> EvaluationBundle

Export stored evaluation artifacts into a portable bundle.

export_generation_bundle ¶

export_generation_bundle(
    store: RunStore, run_id: str
) -> GenerationBundle

Export stored generation artifacts into a portable bundle.

export_parse_bundle ¶

export_parse_bundle(
    store: RunStore, run_id: str
) -> ParseBundle

Export stored parse artifacts into a portable bundle.

export_reduction_bundle ¶

export_reduction_bundle(
    store: RunStore, run_id: str
) -> ReductionBundle

Export stored reduction artifacts into a portable bundle.

export_score_bundle ¶

export_score_bundle(
    store: RunStore, run_id: str
) -> ScoreBundle

Export stored score artifacts into a portable bundle.

get_evaluation_execution ¶

get_evaluation_execution(
    store: RunStore,
    run_id: str,
    case_id: str,
    metric_id: str,
    *,
    dataset_id: str | None = None,
    case_key: str | None = None,
) -> EvaluationExecution | None

Return one stored workflow execution for a case and metric.

get_execution_state ¶

get_execution_state(
    store: RunStore, run_id: str
) -> ExecutionState

Return the persisted execution state for a run.

get_run_snapshot ¶

get_run_snapshot(
    store: RunStore, run_id: str
) -> RunSnapshot

Return the persisted snapshot for a run.

import_evaluation_bundle ¶

import_evaluation_bundle(
    store: RunStore, bundle: EvaluationBundle
) -> None

Import evaluation artifacts from a bundle into a store.

import_generation_bundle ¶

import_generation_bundle(
    store: RunStore, bundle: GenerationBundle
) -> None

Import generation artifacts from a bundle into a store.

import_parse_bundle ¶

import_parse_bundle(
    store: RunStore, bundle: ParseBundle
) -> None

Import parse artifacts from a bundle into a store.

import_reduction_bundle ¶

import_reduction_bundle(
    store: RunStore, bundle: ReductionBundle
) -> None

Import reduction artifacts from a bundle into a store.

import_score_bundle ¶

import_score_bundle(
    store: RunStore, bundle: ScoreBundle
) -> None

Import score artifacts from a bundle into a store.

snapshot_report ¶

snapshot_report(
    snapshot: RunSnapshot,
    run_metadata: dict[str, JSONValue] | None = None,
) -> dict[str, JSONValue]

Return a JSON-serializable summary for a compiled snapshot.

sqlite_store ¶

sqlite_store(path: str | Path) -> SqliteRunStore

Build a SQLite-backed store.

Catalog namespace:

themis.catalog ¶

Manifest-backed catalog entry points.

builtin_component_refs ¶

builtin_component_refs() -> dict[str, Any]

Return component references for the builtin shipped catalog entries.

get_benchmark ¶

get_benchmark(name: str) -> BenchmarkCatalogEntry

Return structured metadata for a shipped catalog benchmark.

list_benchmark_ids ¶

list_benchmark_ids() -> list[str]

List canonical benchmark identifiers from the shipped catalog.

list_benchmarks ¶

list_benchmarks() -> list[BenchmarkCatalogEntry]

Return structured metadata for shipped catalog benchmarks.

list_component_ids ¶

list_component_ids(*, kind: str | None = None) -> list[str]

List builtin component identifiers, optionally filtered by kind.

load ¶

load(name: str) -> object

Load a builtin component or named benchmark from the shipped catalog.

run ¶

run(
    name: str,
    *,
    model: object | None = None,
    store: RunStore | None = None,
) -> RunResult

Execute a named benchmark through the catalog convenience layer.

validate_benchmark ¶

validate_benchmark(name: str) -> BenchmarkValidationResult

Validate that a shipped benchmark can load, materialize, and score.

Core namespace:

themis.core ¶

Core namespace for Themis.

AfterGenerate ¶

Bases: Protocol

Hook invoked after a generator returns a candidate.

AfterJudge ¶

Bases: Protocol

Hook invoked after a workflow-backed metric finishes.

AfterParse ¶

Bases: Protocol

Hook invoked after parsing completes.

AfterReduce ¶

Bases: Protocol

Hook invoked after reduction produces a final candidate.

AfterScore ¶

Bases: Protocol

Hook invoked after a pure metric emits a score or error.

BeforeGenerate ¶

Bases: Protocol

Hook invoked before a generator runs.

BeforeJudge ¶

Bases: Protocol

Hook invoked before a workflow-backed metric begins judging.

BeforeParse ¶

Bases: Protocol

Hook invoked before parsing a reduced candidate.

BeforeReduce ¶

Bases: Protocol

Hook invoked before reduction starts.

BeforeScore ¶

Bases: Protocol

Hook invoked before a pure metric runs.

BenchmarkResult ¶

Bases: FrozenModel

Aggregate benchmark-style projection for a run.

CandidateReducer ¶

Bases: Protocol

Protocol for reducers that collapse multiple candidates into one.

CandidateSelector ¶

Bases: Protocol

Protocol for selectors that choose candidates before reduction.

Case ¶

Bases: HashableModel

One dataset case evaluated by the runtime.

CaseResult ¶

Bases: FrozenModel

Final case-level result returned from a run.

ComponentRefs ¶

Bases: FrozenModel

Resolved component refs stored with the snapshot.

ConversationTrace ¶

Bases: HashableModel

Conversation trace captured during generation.

Dataset ¶

Bases: HashableModel

A collection of cases evaluated together.

DatasetRef ¶

Bases: HashableModel

Identity-bearing reference to one dataset.

DefaultWorkflowRunner ¶

Concurrent interpreter for Themis-owned evaluation workflows.

EvalScoreContext ¶

Bases: ScoreContext

Score context extended with judge workflow configuration.

EvaluationBundle ¶

Bases: FrozenModel

Portable bundle of evaluation artifacts for a run.

EvaluationBundleRecord ¶

Bases: FrozenModel

One portable evaluation execution record.

EvaluationCompletedEvent ¶

Bases: CaseRunEvent

Event emitted when a workflow-backed metric finishes.

EvaluationConfig ¶

Bases: HashableModel

Evaluation-stage configuration for parsing, metrics, and judges.

EvaluationFailedEvent ¶

Bases: CaseRunEvent

Event emitted when a workflow-backed metric fails.

EvaluationWorkflow ¶

Bases: Protocol

Protocol for workflow-backed metrics driven by judge model calls.

ExecutionState ¶

Bases: FrozenModel

Persisted run state rebuilt from the run event stream.

Experiment ¶

Bases: FrozenModel

Authoring model for a Themis experiment.

An experiment owns the compile-time inputs required to build a RunSnapshot and provides sync and async helpers for running or rejudging that snapshot.

compile ¶

compile() -> RunSnapshot

Compile the experiment into an immutable RunSnapshot.

from_config `classmethod` ¶

from_config(
    path: str | Path, *, overrides: list[str] | None = None
) -> Experiment

Load an experiment definition from YAML or TOML configuration.

rejudge ¶

rejudge(
    *,
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Re-run workflow-backed metrics synchronously.

rejudge_async `async` ¶

rejudge_async(
    *,
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Re-run workflow-backed metrics from stored upstream artifacts.

replay ¶

replay(
    *,
    stage: Literal["reduce", "parse", "score", "judge"],
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Replay persisted runs from a downstream stage synchronously.

replay_async `async` ¶

replay_async(
    *,
    stage: Literal["reduce", "parse", "score", "judge"],
    metric_ids: list[str] | None = None,
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Replay persisted runs from a downstream stage.

run ¶

run(
    *,
    until_stage: Literal[
        "generate", "reduce", "parse", "score", "judge"
    ] = "judge",
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Run the compiled snapshot synchronously.

run_async `async` ¶

run_async(
    *,
    until_stage: Literal[
        "generate", "reduce", "parse", "score", "judge"
    ] = "judge",
    runtime: RuntimeConfig | None = None,
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
)

Run the compiled snapshot asynchronously.

FrozenModel ¶

Bases: BaseModel

Base Pydantic model used by the immutable core.

GenerateContext ¶

Bases: HashableModel

Context passed to generators for one case execution.

GenerationBundle ¶

Bases: FrozenModel

Portable bundle of generation artifacts for a run.

GenerationBundleRecord ¶

Bases: FrozenModel

One portable generation artifact record.

GenerationCompletedEvent ¶

Bases: CaseRunEvent

Event emitted when candidate generation finishes for a case.

GenerationConfig ¶

Bases: HashableModel

Generation-stage configuration for a run.

GenerationFailedEvent ¶

Bases: CaseRunEvent

Event emitted when candidate generation fails for a case.

GenerationResult ¶

Bases: HashableModel

The candidate artifact returned by a generator call.

GenerationWorkItem ¶

Bases: FrozenModel

Planner output for one generation task.

Generator ¶

Bases: Protocol

Protocol for generation components that produce candidate outputs.

HashableModel ¶

Bases: FrozenModel

Immutable model with stable content-addressable hashing.

InMemoryRunStore ¶

Bases: ProjectionRefreshingStore

Simple in-memory store used by tests and local development.

JudgeModel ¶

Bases: Protocol

Protocol for judge models used inside evaluation workflows.

LLMMetric ¶

Bases: Protocol

Protocol for metrics that judge a reduced candidate set with an LLM.

LifecycleSubscriber ¶

Bases: BeforeGenerate, AfterGenerate, BeforeReduce, AfterReduce, BeforeParse, AfterParse, BeforeScore, AfterScore, BeforeJudge, AfterJudge, OnEvent, Protocol

Aggregate lifecycle subscriber protocol.

Message ¶

Bases: HashableModel

One conversation message captured as an artifact.

OnEvent ¶

Bases: Protocol

Hook invoked after an execution event is persisted.

ParseBundle ¶

Bases: FrozenModel

Portable bundle of parse artifacts for a run.

ParseBundleRecord ¶

Bases: FrozenModel

One portable parse artifact record.

ParseCompletedEvent ¶

Bases: CaseRunEvent

Event emitted when parsing a reduced candidate succeeds.

ParseContext ¶

Bases: HashableModel

Context passed to parsers for a reduced candidate.

ParseFailedEvent ¶

Bases: CaseRunEvent

Event emitted when parsing a reduced candidate fails.

ParsedOutput ¶

Bases: HashableModel

Normalized output produced by a parser before scoring.

Parser ¶

Bases: Protocol

Protocol for parsers that normalize reduced candidate outputs.

ProgressSnapshot ¶

Bases: FrozenModel

Aggregate case progress for a run.

PromptSpec ¶

Bases: HashableModel

Generic prompt instructions and structured prompt material.

render_input ¶

render_input(prompt_input: JSONValue) -> JSONValue

Render prompt-oriented input for provider adapters.

render_sections ¶

render_sections() -> list[str]

Render prompt sections that can prefix a prompt body.

PureMetric ¶

Bases: Protocol

Protocol for deterministic metrics that score parsed outputs directly.

ReduceContext ¶

Bases: HashableModel

Context passed to reducers choosing a final candidate.

ReducedCandidate ¶

Bases: HashableModel

Candidate selected or synthesized by the reduction stage.

ReductionBundle ¶

Bases: FrozenModel

Portable bundle of reduction artifacts for a run.

ReductionBundleRecord ¶

Bases: FrozenModel

One portable reduction artifact record.

ReductionCompletedEvent ¶

Bases: CaseRunEvent

Event emitted when candidate reduction succeeds.

ReductionFailedEvent ¶

Bases: CaseRunEvent

Event emitted when candidate reduction fails.

Reporter ¶

Export persisted run projections in JSON, Markdown, CSV, or LaTeX.

export_csv ¶

export_csv(run_id: str) -> str

Export benchmark score rows as CSV.

export_json ¶

export_json(run_id: str) -> str

Export all major persisted projections for a run as formatted JSON.

export_latex ¶

export_latex(run_id: str) -> str

Export benchmark score rows as a compact LaTeX table.

export_markdown ¶

export_markdown(run_id: str) -> str

Export a human-readable Markdown summary for a persisted run.

export_score_table ¶

export_score_table(
    run_id: str,
) -> list[dict[str, JSONValue]]

Return benchmark score rows in a normalized table structure.

RunCompletedEvent ¶

Bases: RunEvent

Event emitted when orchestration completes successfully.

RunEstimate ¶

Bases: FrozenModel

Planner estimate for the work implied by a compiled run.

RunEvent ¶

Bases: HashableModel

Base event persisted for a compiled run.

RunFailedEvent ¶

Bases: RunEvent

Event emitted when orchestration aborts with an unrecoverable error.

RunIdentity ¶

Bases: HashableModel

Inputs that determine the logical identity and run_id of a run.

RunProvenance ¶

Bases: FrozenModel

Environment metadata recorded with a run but excluded from run_id.

RunResult ¶

Bases: FrozenModel

Final run-level result returned from execution.

RunSnapshot ¶

Bases: FrozenModel

Immutable executable artifact produced by Experiment.compile().

RunStartedEvent ¶

Bases: RunEvent

Event emitted when orchestration starts for a run.

RunStatus ¶

Bases: StrEnum

User-facing run status values.

RunStore ¶

Bases: Protocol

Persistence contract used by Themis runtime components.

RuntimeConfig ¶

Bases: HashableModel

Execution-time controls that do not affect snapshot identity.

Score ¶

Bases: HashableModel

Successful metric output.

ScoreBundle ¶

Bases: FrozenModel

Portable bundle of score artifacts for a run.

ScoreBundleRecord ¶

Bases: FrozenModel

One portable score artifact record.

ScoreCompletedEvent ¶

Bases: CaseRunEvent

Event emitted when a pure metric succeeds.

ScoreContext ¶

Bases: HashableModel

Context passed to deterministic scoring metrics.

ScoreError ¶

Bases: HashableModel

Structured score failure recorded by the runtime.

ScoreFailedEvent ¶

Bases: CaseRunEvent

Event emitted when a pure metric produces an error payload.

SelectContext ¶

Bases: HashableModel

Context passed to candidate selectors before reduction.

SelectionMetric ¶

Bases: Protocol

Protocol for metrics that judge multiple generated candidates.

SqliteRunStore ¶

Bases: ProjectionRefreshingStore

Small SQLite-backed run store.

StepCompletedEvent ¶

Bases: RunEvent

Event emitted when a workflow step completes.

StepFailedEvent ¶

Bases: RunEvent

Event emitted when a workflow step fails.

StepStartedEvent ¶

Bases: RunEvent

Event emitted when a workflow step starts.

StorageConfig ¶

Bases: HashableModel

Store backend configuration used for persistence.

StoredRun ¶

Bases: FrozenModel

Snapshot plus stored events loaded back from a run store.

TimelineView ¶

Bases: FrozenModel

Timeline projection for a run.

TraceMetric ¶

Bases: Protocol

Protocol for metrics that score traces or conversations.

TraceStep ¶

Bases: HashableModel

One structured step in a generation or evaluation trace.

TraceView ¶

Bases: FrozenModel

Trace-oriented projection for a run.

TracingProvider ¶

Bases: Protocol

Protocol for span-based tracing integrations.

WorkflowBuildError ¶

Bases: ValueError

Raised when a metric cannot build a valid evaluation workflow.

WorkflowRunner ¶

Bases: Protocol

Protocol for executing evaluation workflows and returning traces.

WorkflowTrace ¶

Bases: HashableModel

Trace emitted by a workflow-backed evaluation.

evaluate_async `async` ¶

evaluate_async(
    *,
    model: object,
    data: Dataset
    | Sequence[Dataset]
    | Sequence[Mapping[str, Any]],
    metric: object | Sequence[object],
    parser: object | Sequence[object] | None = None,
    judge: object | Sequence[object] | None = None,
    samples: int = 1,
    reducer: object | None = None,
    storage: StorageConfig | None = None,
    runtime: RuntimeConfig | None = None,
    seeds: list[int] | None = None,
    workflow_overrides: dict[str, object] | None = None,
    judge_config: dict[str, object] | None = None,
    environment_metadata: dict[str, str] | None = None,
    themis_version: str | None = None,
    python_version: str = "3.12",
    platform: str = "unknown",
    store: RunStore | None = None,
    subscribers: list[LifecycleSubscriber] | None = None,
    tracing_provider: TracingProvider | None = None,
) -> RunResult

Compile and run a Themis experiment asynchronously through the Layer 1 API.

event_from_dict ¶

event_from_dict(payload: dict[str, Any]) -> RunEvent

Deserialize a stored event payload into the correct event model.

export_evaluation_bundle ¶

export_evaluation_bundle(
    store: RunStore, run_id: str
) -> EvaluationBundle

Export stored evaluation artifacts into a portable bundle.

export_generation_bundle ¶

export_generation_bundle(
    store: RunStore, run_id: str
) -> GenerationBundle

Export stored generation artifacts into a portable bundle.

export_parse_bundle ¶

export_parse_bundle(
    store: RunStore, run_id: str
) -> ParseBundle

Export stored parse artifacts into a portable bundle.

export_reduction_bundle ¶

export_reduction_bundle(
    store: RunStore, run_id: str
) -> ReductionBundle

Export stored reduction artifacts into a portable bundle.

export_score_bundle ¶

export_score_bundle(
    store: RunStore, run_id: str
) -> ScoreBundle

Export stored score artifacts into a portable bundle.

get_evaluation_execution ¶

get_evaluation_execution(
    store: RunStore,
    run_id: str,
    case_id: str,
    metric_id: str,
    *,
    dataset_id: str | None = None,
    case_key: str | None = None,
) -> EvaluationExecution | None

Return one stored workflow execution for a case and metric.

get_execution_state ¶

get_execution_state(
    store: RunStore, run_id: str
) -> ExecutionState

Return the persisted execution state for a run.

get_run_snapshot ¶

get_run_snapshot(
    store: RunStore, run_id: str
) -> RunSnapshot

Return the persisted snapshot for a run.

import_evaluation_bundle ¶

import_evaluation_bundle(
    store: RunStore, bundle: EvaluationBundle
) -> None

Import evaluation artifacts from a bundle into a store.

import_generation_bundle ¶

import_generation_bundle(
    store: RunStore, bundle: GenerationBundle
) -> None

Import generation artifacts from a bundle into a store.

import_parse_bundle ¶

import_parse_bundle(
    store: RunStore, bundle: ParseBundle
) -> None

Import parse artifacts from a bundle into a store.

import_reduction_bundle ¶

import_reduction_bundle(
    store: RunStore, bundle: ReductionBundle
) -> None

Import reduction artifacts from a bundle into a store.

import_score_bundle ¶

import_score_bundle(
    store: RunStore, bundle: ScoreBundle
) -> None

Import score artifacts from a bundle into a store.

snapshot_report ¶

snapshot_report(
    snapshot: RunSnapshot,
    run_metadata: dict[str, JSONValue] | None = None,
) -> dict[str, JSONValue]

Return a JSON-serializable summary for a compiled snapshot.

sqlite_store ¶

sqlite_store(path: str | Path) -> SqliteRunStore

Build a SQLite-backed store.

Adapters:

themis.adapters ¶

Provider-backed generator adapters for Themis.

Python API reference¶

Root exports¶

Generated modules¶

themis ¶

Experiment ¶

compile ¶

from_config classmethod ¶

rejudge ¶

rejudge_async async ¶

replay ¶

replay_async async ¶

run ¶

run_async async ¶

InMemoryRunStore ¶

PromptSpec ¶

render_input ¶

render_sections ¶

Reporter ¶

export_csv ¶

export_json ¶

export_latex ¶

export_markdown ¶

export_score_table ¶

RunEstimate ¶

RunResult ¶

RunSnapshot ¶

RunStatus ¶

RunStore ¶

RuntimeConfig ¶

SqliteRunStore ¶

evaluate_async async ¶

export_evaluation_bundle ¶

export_generation_bundle ¶

export_parse_bundle ¶

export_reduction_bundle ¶

export_score_bundle ¶

get_evaluation_execution ¶

get_execution_state ¶

get_run_snapshot ¶

import_evaluation_bundle ¶

import_generation_bundle ¶

import_parse_bundle ¶

import_reduction_bundle ¶

import_score_bundle ¶

snapshot_report ¶

sqlite_store ¶

themis.catalog ¶

builtin_component_refs ¶

get_benchmark ¶

list_benchmark_ids ¶

list_benchmarks ¶

list_component_ids ¶

load ¶

run ¶

validate_benchmark ¶

themis.core ¶

AfterGenerate ¶

AfterJudge ¶

AfterParse ¶

AfterReduce ¶

AfterScore ¶

BeforeGenerate ¶

BeforeJudge ¶

BeforeParse ¶

BeforeReduce ¶

BeforeScore ¶

BenchmarkResult ¶

CandidateReducer ¶

CandidateSelector ¶

Case ¶

CaseResult ¶

ComponentRefs ¶

ConversationTrace ¶

Dataset ¶

DatasetRef ¶

DefaultWorkflowRunner ¶

EvalScoreContext ¶

EvaluationBundle ¶

EvaluationBundleRecord ¶

EvaluationCompletedEvent ¶

from_config `classmethod` ¶

rejudge_async `async` ¶

replay_async `async` ¶

run_async `async` ¶

evaluate_async `async` ¶

from_config `classmethod` ¶

rejudge_async `async` ¶

replay_async `async` ¶

run_async `async` ¶