This benchmark measures the model's ability to suggest relevant category names for a given set of related concepts or items.
Each test is one prompt sent to every model in the benchmark.
10 tests × 110 models = 2200 arena votes for reliable rankings.