This benchmark measures the model's ability to suggest relevant category names for a given set of related concepts or items.
54 of 110 models on the leaderboard so far. More join with each arena vote.
Expand each prompt to see per-model responses and reasoning.
Compare performance across models and prompts.
Find models with the best balance of quality, cost, and speed.
Average tokens used per model across all prompts.