Karlsruhe Local Knowledge Benchmark

Tests the model's specific knowledge regarding the history, geography, transportation, and culture of the German city Karlsruhe.

Jan 7, 2026

9 tasks

110 models

$1.1126

user_c636b9d7

Link only

ResultsPreliminary

Vote in the arena

27 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

GPT-5.2

by OpenAI

11.2s

$0.1344

98%

score

Claude Opus 4.5

by Anthropic

6.3s

$0.1530

87%

score

Claude Sonnet 4.5

by Anthropic

6.7s

$0.0992

85%

score

Gemini 3 Flash Preview

by Google

3.8s

$0.0276

84%

score

Gemini 2.5 Pro

by Google

17.7s

$0.4417

81%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

GPT-5.2

by OpenAI on OpenRouter

11.2s

$0.1344

98%

Claude Opus 4.5

by Anthropic on OpenRouter

6.3s

$0.1530

87%

Claude Sonnet 4.5

by Anthropic on OpenRouter

6.7s

$0.0992

85%

Gemini 3 Flash Preview

by Google on OpenRouter

3.8s

$0.0276

84%

Gemini 2.5 Pro

by Google on OpenRouter

17.7s

$0.4417

81%

Claude 3 Haiku

by Anthropic on OpenRouter

4.3s

$0.0088

76%

GLM 4.7

by Z.ai on OpenRouter

38.1s

$0.0473

74%

DeepSeek V3.2

by DeepSeek on OpenRouter

22.9s

$0.0041

72%

Claude Haiku 4.5

by Anthropic on OpenRouter

3.1s

$0.0291

72%

Ministral 3 8B 2512

by Mistral on OpenRouter

3.2s

$0.0021

71%

Model	Duration	Cost	Score
GPT-5.2 by OpenAI on OpenRouter	11.2s	$0.1344	98%
Claude Opus 4.5 by Anthropic on OpenRouter	6.3s	$0.1530	87%
Claude Sonnet 4.5 by Anthropic on OpenRouter	6.7s	$0.0992	85%
Gemini 3 Flash Preview by Google on OpenRouter	3.8s	$0.0276	84%
Gemini 2.5 Pro by Google on OpenRouter	17.7s	$0.4417	81%
Claude 3 Haiku by Anthropic on OpenRouter	4.3s	$0.0088	76%
GLM 4.7 by Z.ai on OpenRouter	38.1s	$0.0473	74%
DeepSeek V3.2 by DeepSeek on OpenRouter	22.9s	$0.0041	72%
Claude Haiku 4.5 by Anthropic on OpenRouter	3.1s	$0.0291	72%
Ministral 3 8B 2512 by Mistral on OpenRouter	3.2s	$0.0021	71%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GPT-5 NanoOpenRouter

2,374 avg (19 in / 2,355 out)

GLM 4.7OpenRouter

2,253 avg (21 in / 2,232 out)

Gemini 2.5 ProOpenRouter

1,758 avg (14 in / 1,744 out)

gpt-oss-20bOpenRouter

1,441 avg (78 in / 1,363 out)

gpt-oss-120bOpenRouter

1,280 avg (80 in / 1,200 out)