Categorization Bench

This benchmark measures the model's ability to suggest relevant category names for a given set of related concepts or items.

May 16, 2026
10 tasks
110 models
$2.1566
user_c636b9d7
Public

ResultsPreliminary

Vote in the arena

52 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

GPT-5.5
by OpenAI
99%
score
Gemini 2.5 Pro
by Google
91%
score
Claude Sonnet 4.6
by Anthropic
72%
score
4
Claude Opus 4.7
by Anthropic
70%
score
5
Llama 4 Scout
by Meta
68%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

GPT-5.5
by OpenAI on OpenRouter
5.6s
$0.0525
99%
Gemini 2.5 Pro
by Google on OpenRouter
15.3s
$0.1209
91%
Claude Sonnet 4.6
by Anthropic on OpenRouter
2.8s
$0.0109
72%
Claude Opus 4.7
by Anthropic on OpenRouter
3.3s
$0.0285
70%
Llama 4 Scout
by Meta on OpenRouter
1.1s
$0.0003
68%
Claude Opus 4.6
by Anthropic on OpenRouter
3.6s
$0.0220
68%
Llama 3.3 70B Instruct
by Meta on OpenRouter
3.1s
$0.0003
65%
GPT-5.5 Pro
by OpenAI on OpenRouter
34.8s
$1.2578
63%
Claude Haiku 4.5
by Anthropic on OpenRouter
1.8s
$0.0040
62%
Gemini 3.1 Pro Preview
by Google on OpenRouter
8.8s
$0.0778
62%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Hy3 previewOpenRouter
2,215 avg (116 in / 2,099 out)
GLM 4.7OpenRouter
1,892 avg (108 in / 1,784 out)
GLM 4.7 FlashOpenRouter
1,693 avg (106 in / 1,587 out)
Qwen3.6 PlusOpenRouter
1,588 avg (118 in / 1,470 out)
Qwen3.6 FlashOpenRouter
1,529 avg (119 in / 1,410 out)