German Memelord Bench

Benchmarking LLMs capabilities to detect and understand German memes across a plethora of questions.

Jan 9, 2026

35 tasks

110 models

$1.1346

user_c636b9d7

Link only

ResultsPreliminary

Vote in the arena

27 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Gemini 3 Flash Preview

by Google

2.5s

$0.0192

93%

score

Gemini 2.5 Pro

by Google

15.8s

$0.4450

81%

score

Claude Sonnet 4.5

by Anthropic

5.5s

$0.0793

70%

score

GPT-5.2

by OpenAI

9.2s

$0.1775

65%

score

Claude Opus 4.5

by Anthropic

6.7s

$0.1277

53%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Gemini 3 Flash Preview

by Google on OpenRouter

2.5s

$0.0192

93%

Gemini 2.5 Pro

by Google on OpenRouter

15.8s

$0.4450

81%

Claude Sonnet 4.5

by Anthropic on OpenRouter

5.5s

$0.0793

70%

GPT-5.2

by OpenAI on OpenRouter

9.2s

$0.1775

65%

Claude Opus 4.5

by Anthropic on OpenRouter

6.7s

$0.1277

53%

GPT-5 Mini

by OpenAI on OpenRouter

12.3s

$0.0397

50%

Claude 3 Haiku

by Anthropic on OpenRouter

1.7s

$0.0060

49%

GPT-5 Nano

by OpenAI on OpenRouter

19.8s

$0.0255

43%

Claude Haiku 4.5

by Anthropic on OpenRouter

2.6s

$0.0234

43%

Gemini 2.5 Flash

by Google on OpenRouter

1.5s

$0.0112

42%

Model	Duration	Cost	Score
Gemini 3 Flash Preview by Google on OpenRouter	2.5s	$0.0192	93%
Gemini 2.5 Pro by Google on OpenRouter	15.8s	$0.4450	81%
Claude Sonnet 4.5 by Anthropic on OpenRouter	5.5s	$0.0793	70%
GPT-5.2 by OpenAI on OpenRouter	9.2s	$0.1775	65%
Claude Opus 4.5 by Anthropic on OpenRouter	6.7s	$0.1277	53%
GPT-5 Mini by OpenAI on OpenRouter	12.3s	$0.0397	50%
Claude 3 Haiku by Anthropic on OpenRouter	1.7s	$0.0060	49%
GPT-5 Nano by OpenAI on OpenRouter	19.8s	$0.0255	43%
Claude Haiku 4.5 by Anthropic on OpenRouter	2.6s	$0.0234	43%
Gemini 2.5 Flash by Google on OpenRouter	1.5s	$0.0112	42%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GLM 4.7OpenRouter

4,637 avg (24 in / 4,612 out)

gpt-oss-20bOpenRouter

2,323 avg (84 in / 2,239 out)

GPT-5 NanoOpenRouter

2,215 avg (23 in / 2,191 out)

Gemini 2.5 ProOpenRouter

1,550 avg (17 in / 1,532 out)

gpt-oss-120bOpenRouter

1,040 avg (84 in / 956 out)