CLI reference¶

Command groups¶

Command	What it does	When to use it	Key inputs / constraints
`quick-eval`	Runs inline examples, files, Hugging Face datasets, or catalog benchmarks with minimal setup	You want the shortest path to an evaluation run	Trades flexibility for convenience
`run`	Compiles and executes a config-backed experiment	You want the main config-driven runtime path	Accepts `--config` and optional `--until-stage`
`replay`	Re-runs downstream stages from stored upstream artifacts	You want to regenerate reduction, parse, score, or judge results without fresh generation	Requires persisted upstream artifacts
`submit`	Writes deferred-execution manifests	You want worker-pool or batch execution instead of immediate in-process execution	Requires `--mode worker-pool` or `--mode batch`
`resume`	Reopens a stored run and continues according to runtime policy	You want to continue interrupted or partially completed persistent work	Depends on a persistent store
`estimate`	Prints planner and token-estimate output for a compiled snapshot	You want execution counts and token assumptions before running	Estimates are informational, not pricing
`report`	Exports score and outcome reports in multiple formats	You want shareable output from a stored run	Requires a stored run and a format choice
`inspect`	Reads snapshots, state, or evaluation executions from the store	You want to diagnose or inspect persisted artifacts	Uses subcommands for each payload type
`quickcheck`	Prints a compact status summary for one stored run	You want a fast health check instead of a full report	Depends on persisted state
`compare`	Compares two persisted benchmark results	You want baseline vs candidate analysis across completed runs	Requires two config-backed runs
`export`	Writes stage-aware artifact bundles	You want portable generation or evaluation artifacts	CLI currently covers `generation` and `evaluation` bundles
`init`	Scaffolds starter files	You want help bootstrapping a new workflow from the shell	Keep using Python authoring if you need live objects
`worker`	Executes queued manifests in worker-pool mode	You want queue-driven deferred execution	Use with manifests produced by `submit --mode worker-pool`
`batch`	Executes explicit request manifests in batch mode	You want request-file driven deferred execution	Use with manifests produced by `submit --mode batch`

Command behavior¶

Command	What it does	When to use it	Key inputs / constraints
`run --config ... [--until-stage ...]`	Executes the experiment and prints JSON with `run_id`, `status`, `completed_through_stage`, and `metric_means`	You want the main config-driven execution path	`--until-stage` stops intentionally at a stage boundary
`replay --config ... --stage reduce\|parse\|score\|judge`	Re-runs downstream stages from stored upstream artifacts	You want fresh downstream scoring without new generation	Requires stored upstream artifacts
`resume --config ...`	Reopens the compiled `run_id` and continues if the store shows pending work	You want to continue interrupted persistent work	Depends on a persistent store
`estimate --config ...`	Prints planner output, task counts, token estimates, and estimate assumptions	You want pre-run sizing and cost-model inputs	No pricing is applied by Themis
`quickcheck --config ...`	Prints a compact status summary for a stored run	You want a quick operational check	Less detail than `report` or `inspect`
`report --config ... --format ...`	Exports JSON, Markdown, CSV, or LaTeX projections	You want a shareable report from stored state	Requires a supported `--format`
`inspect snapshot --config ...`	Prints the stored `RunSnapshot`	You want identity and provenance details	Snapshot inspection is read-only
`inspect state --config ...`	Prints stored execution state	You want stage completion, counters, and failure visibility	Persistent storage is required outside a still-live memory store
`inspect evaluation --config ... --case-id ... --metric-id ...`	Prints one stored workflow execution	You want judge prompts, responses, or workflow artifacts for a specific case	Only applies to workflow-backed metrics
`compare --baseline-config ... --candidate-config ...`	Compares persisted benchmark projections from two runs	You want paired benchmark analysis	Both runs must already exist
`submit --config ... --mode worker-pool\|batch`	Writes a manifest for deferred execution	You want worker-pool or batch handoff instead of direct execution	Follow with `worker run` or `batch run`

Subcommands¶

Command	What it does	When to use it	Key inputs / constraints
`quick-eval inline`	Runs a small inline dataset or prompt set	You want the shortest shell-driven smoke test	Best for local examples
`quick-eval file`	Loads input data from a file for a quick evaluation run	You want a lightweight file-backed run	Less structured than a full config-driven experiment
`quick-eval huggingface`	Loads a dataset from Hugging Face for quick evaluation	You want a short path to remote dataset-backed evaluation	Requires the `datasets` extra
`quick-eval benchmark`	Runs a shipped named benchmark recipe	You want catalog convenience from the shell	Requires benchmark-specific extras such as dataset access or code execution backends
`inspect snapshot`	Prints the stored compiled snapshot	You want identity and provenance details for a run	Requires a stored run
`inspect state`	Prints stored execution state	You want progress and failure details	Requires a stored run
`inspect evaluation`	Prints one stored workflow evaluation artifact	You want per-case judge execution details	Requires workflow-backed metrics
`export generation`	Exports generation artifacts to a portable bundle	You want portable candidate outputs	CLI export currently covers this stage directly
`export evaluation`	Exports evaluation workflow artifacts to a portable bundle	You want portable judge execution artifacts	Best for workflow-backed metrics
`worker run`	Pulls manifests from the worker queue and executes them	You want queue-driven deferred execution	Requires a queue root
`batch run`	Executes one explicit batch request manifest	You want request-file driven deferred execution	Requires a request manifest path

Output notes¶

JSON-producing commands generally emit compact machine-readable JSON to stdout. Commands that inspect stored runs require a persistent store unless the current process still owns the original memory store.

report and exported score tables include outcome, error_category, error_message, and details columns alongside metric values.

Current CLI boundary¶

Surface	Current behavior	Use instead when	Notes
`run`	Exposes `--until-stage` directly in the CLI	You want deliberate stage-boundary stopping from the shell	Good fit for later replay or export
`export`	Exposes only `generation` and `evaluation` bundle export in the CLI	You need reduction, parse, or score bundles	Use the Python export helpers for those intermediate stages
Reduction, parse, and score bundle handoff	Supported by the runtime but not surfaced in the CLI	You need portable intermediate-stage artifacts	Use Python helpers until the CLI grows those commands