Compare, export, and report¶

Goal: generate portable output from stored runs and compare two completed experiments.

When to use this:

Use this guide when execution is already done and the next task is inspection, reporting, export, or comparison.

Procedure¶

Use:

Reporter.export_json(...), export_markdown(...), export_csv(...), and export_latex(...)
themis report --config ... --format ...
themis compare --baseline-config ... --candidate-config ...
themis export generation|evaluation --config ...

Portable artifact handoff is stage-aware:

generation artifacts: export_generation_bundle(...)
reduction artifacts: export_reduction_bundle(...)
parse artifacts: export_parse_bundle(...)
pure-score artifacts: export_score_bundle(...)
workflow execution artifacts: export_evaluation_bundle(...)

Reporting output is now outcome-aware. benchmark_result.score_rows and CSV exports include:

outcome: correct, incorrect, or error
error_category: for example parse_failure, parse_null, parse_invalid, evaluation_failure, evaluation_partial_failure, or score_failure
error_message: the stored failure reason when the row is an error
details: metric-specific structured payload for downstream qualitative analysis

Outside Themis workflows:

external leaderboard construction: export JSON or CSV, then aggregate benchmark_result across runs in your notebook, warehouse, or dashboard job
prompt sweep aggregation: run one experiment per prompt variant, then compare the exported benchmark_result payloads outside Themis
external LM-judge handoff: export generation artifacts, run your own judge or provider batch API outside Themis, then convert the results back with a custom script and import the matching evaluation-stage artifacts

Artifact-interop support for R3-R4 is intentional but scriptable. Themis owns the persistent stage artifacts and downstream replay path, while the mapping from an external job result into Themis-compatible bundle records still happens in your code.

Use this output shape when you build downstream leaderboards or prompt-sweep dashboards outside Themis. Themis owns the per-run read models; cross-run aggregation is expected to happen in your notebook, warehouse, or reporting job.

Variants¶

Variant	Best when	Tradeoff	Related APIs / commands
One-run reporting	You want human-readable or machine-readable output for one completed run	Does not compare multiple runs by itself	`Reporter`, `themis report --config ... --format ...`
Portable artifact handoff	Another system should consume stored stage artifacts	You need to manage exported bundles explicitly	`themis export generation\|evaluation`, `export_generation_bundle(...)`, `export_evaluation_bundle(...)`
Side-by-side benchmark comparison	You want baseline vs candidate analysis inside Themis	Requires two completed persisted runs	`themis compare --baseline-config ... --candidate-config ...`
External leaderboard or dashboard	Aggregation belongs in notebooks, warehouses, or dashboards outside Themis	You own cross-run aggregation logic	JSON or CSV exports, `BenchmarkResult` payloads
Prompt sweep analysis	You are comparing multiple prompt variants over repeated runs	Sweep aggregation still happens outside Themis	Exported `benchmark_result` payloads
External judge pipeline	Judging should happen outside Themis, then come back as imported artifacts	Requires custom mapping code at the handoff boundary	Export/import bundle helpers

Expected result¶

You should have machine-readable or human-readable output that can be shared without rerunning the experiment.

Compare, export, and report¶

Procedure¶

Variants¶

Expected result¶

Troubleshooting¶