Statistical Comparisons¶
The benchmark-first read side starts with semantic aggregation.
Aggregate First¶
Use dimensions directly when they matter:
Pair By Benchmark Semantics¶
This compares models on shared benchmark items within the requested grouping
key. It replaces the old public habit of thinking in task_id tables first.
CLI Analogs¶
Use themis-quickcheck scores with:
--slice qa--dimension source=synthetic
Those filters read the same benchmark summary fields stored in SQLite.