Statistical Comparisons¶

The benchmark-first read side starts with semantic aggregation.

Aggregate First¶

rows = result.aggregate(
    group_by=["model_id", "slice_id", "metric_id", "prompt_variant_id"]
)

Use dimensions directly when they matter:

rows = result.aggregate(group_by=["model_id", "source", "metric_id"])

Pair By Benchmark Semantics¶

comparison = result.paired_compare(
    metric_id="exact_match",
    group_by="slice_id",
)

This compares models on shared benchmark items within the requested grouping key. It replaces the old public habit of thinking in task_id tables first.

CLI Analogs¶

Use themis-quickcheck scores with:

--slice qa
--dimension source=synthetic

Those filters read the same benchmark summary fields stored in SQLite.