Skip to content

CLI reference

Command groups

Command What it does When to use it Key inputs / constraints
quick-eval Runs inline examples, files, Hugging Face datasets, or catalog benchmarks with minimal setup You want the shortest path to an evaluation run Trades flexibility for convenience
run Compiles and executes a config-backed experiment You want the main config-driven runtime path Accepts --config and optional --until-stage
replay Re-runs downstream stages from stored upstream artifacts You want to regenerate reduction, parse, score, or judge results without fresh generation Requires persisted upstream artifacts
submit Writes deferred-execution manifests You want worker-pool or batch execution instead of immediate in-process execution Requires --mode worker-pool or --mode batch
resume Reopens a stored run and continues according to runtime policy You want to continue interrupted or partially completed persistent work Depends on a persistent store
estimate Prints planner and token-estimate output for a compiled snapshot You want execution counts and token assumptions before running Estimates are informational, not pricing
report Exports score and outcome reports in multiple formats You want shareable output from a stored run Requires a stored run and a format choice
inspect Reads snapshots, state, or evaluation executions from the store You want to diagnose or inspect persisted artifacts Uses subcommands for each payload type
quickcheck Prints a compact status summary for one stored run You want a fast health check instead of a full report Depends on persisted state
compare Compares two persisted benchmark results You want baseline vs candidate analysis across completed runs Requires two config-backed runs
export Writes stage-aware artifact bundles You want portable generation or evaluation artifacts CLI currently covers generation and evaluation bundles
init Scaffolds starter files You want help bootstrapping a new workflow from the shell Keep using Python authoring if you need live objects
worker Executes queued manifests in worker-pool mode You want queue-driven deferred execution Use with manifests produced by submit --mode worker-pool
batch Executes explicit request manifests in batch mode You want request-file driven deferred execution Use with manifests produced by submit --mode batch

Command behavior

Command What it does When to use it Key inputs / constraints
run --config ... [--until-stage ...] Executes the experiment and prints JSON with run_id, status, completed_through_stage, and metric_means You want the main config-driven execution path --until-stage stops intentionally at a stage boundary
replay --config ... --stage reduce|parse|score|judge Re-runs downstream stages from stored upstream artifacts You want fresh downstream scoring without new generation Requires stored upstream artifacts
resume --config ... Reopens the compiled run_id and continues if the store shows pending work You want to continue interrupted persistent work Depends on a persistent store
estimate --config ... Prints planner output, task counts, token estimates, and estimate assumptions You want pre-run sizing and cost-model inputs No pricing is applied by Themis
quickcheck --config ... Prints a compact status summary for a stored run You want a quick operational check Less detail than report or inspect
report --config ... --format ... Exports JSON, Markdown, CSV, or LaTeX projections You want a shareable report from stored state Requires a supported --format
inspect snapshot --config ... Prints the stored RunSnapshot You want identity and provenance details Snapshot inspection is read-only
inspect state --config ... Prints stored execution state You want stage completion, counters, and failure visibility Persistent storage is required outside a still-live memory store
inspect evaluation --config ... --case-id ... --metric-id ... Prints one stored workflow execution You want judge prompts, responses, or workflow artifacts for a specific case Only applies to workflow-backed metrics
compare --baseline-config ... --candidate-config ... Compares persisted benchmark projections from two runs You want paired benchmark analysis Both runs must already exist
submit --config ... --mode worker-pool|batch Writes a manifest for deferred execution You want worker-pool or batch handoff instead of direct execution Follow with worker run or batch run

Subcommands

Command What it does When to use it Key inputs / constraints
quick-eval inline Runs a small inline dataset or prompt set You want the shortest shell-driven smoke test Best for local examples
quick-eval file Loads input data from a file for a quick evaluation run You want a lightweight file-backed run Less structured than a full config-driven experiment
quick-eval huggingface Loads a dataset from Hugging Face for quick evaluation You want a short path to remote dataset-backed evaluation Requires the datasets extra
quick-eval benchmark Runs a shipped named benchmark recipe You want catalog convenience from the shell Requires benchmark-specific extras such as dataset access or code execution backends
inspect snapshot Prints the stored compiled snapshot You want identity and provenance details for a run Requires a stored run
inspect state Prints stored execution state You want progress and failure details Requires a stored run
inspect evaluation Prints one stored workflow evaluation artifact You want per-case judge execution details Requires workflow-backed metrics
export generation Exports generation artifacts to a portable bundle You want portable candidate outputs CLI export currently covers this stage directly
export evaluation Exports evaluation workflow artifacts to a portable bundle You want portable judge execution artifacts Best for workflow-backed metrics
worker run Pulls manifests from the worker queue and executes them You want queue-driven deferred execution Requires a queue root
batch run Executes one explicit batch request manifest You want request-file driven deferred execution Requires a request manifest path

Output notes

JSON-producing commands generally emit compact machine-readable JSON to stdout. Commands that inspect stored runs require a persistent store unless the current process still owns the original memory store.

report and exported score tables include outcome, error_category, error_message, and details columns alongside metric values.

Current CLI boundary

Surface Current behavior Use instead when Notes
run Exposes --until-stage directly in the CLI You want deliberate stage-boundary stopping from the shell Good fit for later replay or export
export Exposes only generation and evaluation bundle export in the CLI You need reduction, parse, or score bundles Use the Python export helpers for those intermediate stages
Reduction, parse, and score bundle handoff Supported by the runtime but not surfaced in the CLI You need portable intermediate-stage artifacts Use Python helpers until the CLI grows those commands