Categorization Bench

This benchmark measures the model's ability to suggest relevant category names for a given set of related concepts or items.

May 16, 2026
10 tasks
110 models
$2.1665
karllorey
Public

ResultsPreliminary

Vote in the arena

54 of 110 models on the leaderboard so far. More join with each arena vote.

GPT-5.5
by OpenAI
100%
score
Gemini 2.5 Pro
by Google
90%
score
Claude Opus 4.6
by Anthropic
79%
score
4
Claude Sonnet 4.6
by Anthropic
77%
score
5
Llama 4 Scout
by Meta
76%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

GPT-5.5
by OpenAI on OpenRouter
5.6s
$0.0525
100%
Gemini 2.5 Pro
by Google on OpenRouter
15.3s
$0.1209
90%
Claude Opus 4.6
by Anthropic on OpenRouter
3.6s
$0.0220
79%
Claude Sonnet 4.6
by Anthropic on OpenRouter
2.8s
$0.0109
77%
Llama 4 Scout
by Meta on OpenRouter
1.1s
$0.0003
76%
Llama 3.3 70B Instruct
by Meta on OpenRouter
3.1s
$0.0003
74%
gpt-oss-20b
by OpenAI on OpenRouter
4.2s
$0.0005
73%
Claude Opus 4.7
by Anthropic on OpenRouter
3.3s
$0.0285
70%
GLM 5.1
by Z.ai on OpenRouter
24.6s
$0.0310
65%
DeepSeek V4 Pro
by DeepSeek on OpenRouter
16.1s
$0.0196
65%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Hy3 previewOpenRouter
2,215 avg (116 in / 2,099 out)
GLM 4.7OpenRouter
1,892 avg (108 in / 1,784 out)
GLM 4.7 FlashOpenRouter
1,693 avg (106 in / 1,587 out)
Qwen3.6 PlusOpenRouter
1,588 avg (118 in / 1,470 out)
Qwen3.6 FlashOpenRouter
1,529 avg (119 in / 1,410 out)