Spatial Reasoning: Germany

This benchmark evaluates an LLM's knowledge of German and Central European geography, focusing on relative positioning, proximity, and spatial orientation between cities and landmarks.

Jan 9, 2026
11 tasks
110 models
$1.6412
karllorey
Link only

ResultsPreliminary

Vote in the arena

29 of 110 models on the leaderboard so far. More join with each arena vote.

Claude Sonnet 4.5
by Anthropic
97%
score
Claude Haiku 4.5
by Anthropic
97%
score
Claude Opus 4.5
by Anthropic
95%
score
4
Gemini 3 Flash Preview
by Google
95%
score
5
DeepSeek V3.2 Speciale
by DeepSeek
95%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Claude Sonnet 4.5
by Anthropic on OpenRouter
2.9s
$0.0254
97%
Claude Haiku 4.5
by Anthropic on OpenRouter
1.7s
$0.0075
97%
Claude Opus 4.5
by Anthropic on OpenRouter
3.4s
$0.0424
95%
Gemini 3 Flash Preview
by Google on OpenRouter
1.3s
$0.0032
95%
DeepSeek V3.2 Speciale
by DeepSeek on OpenRouter
111.1s
$0.0263
95%
Kimi K2 Thinking
by MoonshotAI on OpenRouter
26.6s
$0.0059
95%
GPT-5 Mini
by OpenAI on OpenRouter
12.0s
$0.0252
95%
GPT-5.2
by OpenAI on OpenRouter
4.2s
$0.0363
95%
Gemini 2.5 Pro
by Google on OpenRouter
10.4s
$0.2173
95%
Claude 3.5 Haiku
by Anthropic on OpenRouter
2.2s
$0.0059
94%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

DeepSeek V3.2 SpecialeOpenRouter
2,920 avg (29 in / 2,891 out)
GPT-5 NanoOpenRouter
1,312 avg (27 in / 1,284 out)
Gemini 2.5 ProOpenRouter
1,007 avg (22 in / 985 out)
GLM 4.7OpenRouter
947 avg (27 in / 920 out)
gpt-oss-20bOpenRouter
714 avg (87 in / 627 out)