Bad Idea Bench

Tests LLMs capabilities to spot bad ideas and nudge the user towards better ones.

May 21, 2026
4 tasks
110 models
$0.0752
user_c636b9d7
Public

ResultsPreliminary

Vote in the arena

23 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Claude Opus 4.5
by Anthropic
100%
score
Gemini 2.5 Pro
by Google
98%
score
DeepSeek V3
by DeepSeek
98%
score
4
GLM 4.5 Air
by Z.ai
93%
score
5
GPT-4.1
by OpenAI
89%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Claude Opus 4.5
by Anthropic on OpenRouter
5.0s
$0.0046
100%
Gemini 2.5 Pro
by Google on OpenRouter
32.2s
$0.0133
98%
DeepSeek V3
by DeepSeek on OpenRouter
6.5s
$0.0003
98%
GLM 4.5 Air
by Z.ai on OpenRouter
1.7s
$0.0001
93%
GPT-4.1
by OpenAI on OpenRouter
2.4s
$0.0026
89%
Qwen3.5-Flash
by Qwen on OpenRouter
12.9s
$0.0005
83%
Gemini 3.1 Pro Preview
by Google on OpenRouter
21.0s
$0.0131
83%
Gemma 4 26B A4B
by Google on OpenRouter
4.9s
$0.0002
82%
Claude 3 Haiku
by Anthropic on OpenRouter
2.7s
$0.0002
76%
Qwen3 235B A22B Instruct 2507
by Qwen on OpenRouter
2.1s
$0.0002
76%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Qwen3.6 PlusOpenRouter
2,315 avg (149 in / 2,167 out)
Qwen3.5-FlashOpenRouter
2,089 avg (132 in / 1,957 out)
GPT-5OpenRouter
1,732 avg (116 in / 1,616 out)
Step 3.5 FlashOpenRouter
1,540 avg (132 in / 1,408 out)
Gemini 2.5 ProOpenRouter
1,429 avg (117 in / 1,312 out)