Niederstetten Benchmark

Tests the LLMs knowledge about a specific German town in rural Germany.

Jan 28, 2026

5 tasks

110 models

$0.1588

user_c636b9d7

Link only

ResultsPreliminary

Vote in the arena

33 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Gemini 3 Flash Preview

by Google

1.6s

$0.0011

99%

score

Gemini 2.5 Pro

by Google

10.7s

$0.0484

96%

score

DeepSeek V3.2

by DeepSeek

6.6s

$0.0004

92%

score

Claude Sonnet 4.5

by Anthropic

3.8s

$0.0102

85%

score

Mistral Large 3 2512

by Mistral

3.9s

$0.0017

72%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Gemini 3 Flash Preview

by Google on OpenRouter

1.6s

$0.0011

99%

Gemini 2.5 Pro

by Google on OpenRouter

10.7s

$0.0484

96%

DeepSeek V3.2

by DeepSeek on OpenRouter

6.6s

$0.0004

92%

Claude Sonnet 4.5

by Anthropic on OpenRouter

3.8s

$0.0102

85%

Mistral Large 3 2512

by Mistral on OpenRouter

3.9s

$0.0017

72%

Claude 3 Haiku

by Anthropic on OpenRouter

1.4s

$0.0008

71%

DeepSeek V3 0324

by DeepSeek on OpenRouter

8.4s

$0.0011

66%

Claude Opus 4.5

by Anthropic on OpenRouter

3.8s

$0.0141

64%

Llama 3.3 70B Instruct

by Meta on OpenRouter

3.1s

$0.0002

61%

Claude Haiku 4.5

by Anthropic on OpenRouter

2.1s

$0.0033

60%

Model	Duration	Cost	Score
Gemini 3 Flash Preview by Google on OpenRouter	1.6s	$0.0011	99%
Gemini 2.5 Pro by Google on OpenRouter	10.7s	$0.0484	96%
DeepSeek V3.2 by DeepSeek on OpenRouter	6.6s	$0.0004	92%
Claude Sonnet 4.5 by Anthropic on OpenRouter	3.8s	$0.0102	85%
Mistral Large 3 2512 by Mistral on OpenRouter	3.9s	$0.0017	72%
Claude 3 Haiku by Anthropic on OpenRouter	1.4s	$0.0008	71%
DeepSeek V3 0324 by DeepSeek on OpenRouter	8.4s	$0.0011	66%
Claude Opus 4.5 by Anthropic on OpenRouter	3.8s	$0.0141	64%
Llama 3.3 70B Instruct by Meta on OpenRouter	3.1s	$0.0002	61%
Claude Haiku 4.5 by Anthropic on OpenRouter	2.1s	$0.0033	60%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GPT-5 NanoOpenRouter

1,594 avg (17 in / 1,577 out)

Gemini 2.5 ProOpenRouter

979 avg (12 in / 967 out)

gpt-oss-20bOpenRouter

894 avg (74 in / 820 out)

gpt-oss-120bOpenRouter

737 avg (74 in / 663 out)

GLM 4.7OpenRouter

675 avg (17 in / 658 out)