Categorization Bench

This benchmark measures the model's ability to suggest relevant category names for a given set of related concepts or items.

May 16, 2026

10 tasks

110 models

$2.1566

user_c636b9d7

Public

ResultsPreliminary

Vote in the arena

52 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

GPT-5.5

by OpenAI

5.6s

$0.0525

99%

score

Gemini 2.5 Pro

by Google

15.3s

$0.1209

91%

score

Claude Sonnet 4.6

by Anthropic

2.8s

$0.0109

72%

score

Claude Opus 4.7

by Anthropic

3.3s

$0.0285

70%

score

Llama 4 Scout

by Meta

1.1s

$0.0003

68%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

GPT-5.5

by OpenAI on OpenRouter

5.6s

$0.0525

99%

Gemini 2.5 Pro

by Google on OpenRouter

15.3s

$0.1209

91%

Claude Sonnet 4.6

by Anthropic on OpenRouter

2.8s

$0.0109

72%

Claude Opus 4.7

by Anthropic on OpenRouter

3.3s

$0.0285

70%

Llama 4 Scout

by Meta on OpenRouter

1.1s

$0.0003

68%

Claude Opus 4.6

by Anthropic on OpenRouter

3.6s

$0.0220

68%

Llama 3.3 70B Instruct

by Meta on OpenRouter

3.1s

$0.0003

65%

GPT-5.5 Pro

by OpenAI on OpenRouter

34.8s

$1.2578

63%

Claude Haiku 4.5

by Anthropic on OpenRouter

1.8s

$0.0040

62%

Gemini 3.1 Pro Preview

by Google on OpenRouter

8.8s

$0.0778

62%

Model	Duration	Cost	Score
GPT-5.5 by OpenAI on OpenRouter	5.6s	$0.0525	99%
Gemini 2.5 Pro by Google on OpenRouter	15.3s	$0.1209	91%
Claude Sonnet 4.6 by Anthropic on OpenRouter	2.8s	$0.0109	72%
Claude Opus 4.7 by Anthropic on OpenRouter	3.3s	$0.0285	70%
Llama 4 Scout by Meta on OpenRouter	1.1s	$0.0003	68%
Claude Opus 4.6 by Anthropic on OpenRouter	3.6s	$0.0220	68%
Llama 3.3 70B Instruct by Meta on OpenRouter	3.1s	$0.0003	65%
GPT-5.5 Pro by OpenAI on OpenRouter	34.8s	$1.2578	63%
Claude Haiku 4.5 by Anthropic on OpenRouter	1.8s	$0.0040	62%
Gemini 3.1 Pro Preview by Google on OpenRouter	8.8s	$0.0778	62%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Hy3 previewOpenRouter

2,215 avg (116 in / 2,099 out)

GLM 4.7OpenRouter

1,892 avg (108 in / 1,784 out)

GLM 4.7 FlashOpenRouter

1,693 avg (106 in / 1,587 out)

Qwen3.6 PlusOpenRouter

1,588 avg (118 in / 1,470 out)

Qwen3.6 FlashOpenRouter

1,529 avg (119 in / 1,410 out)