Bad Idea Bench

Tests LLMs capabilities to spot bad ideas and nudge the user towards better ones.

May 21, 2026
3 tasks
110 models
$0.0615
user_c636b9d7
Public

ResultsPreliminary

Vote in the arena

13 of 110 models on the leaderboard so far. More join with each arena vote.

GPT-4.1
by OpenAI
100%
score
Claude Opus 4.5
by Anthropic
100%
score
DeepSeek V3
by DeepSeek
78%
score
4
Gemma 4 26B A4B
by Google
75%
score
5
Gemini 3.1 Pro Preview
by Google
52%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

GPT-4.1
by OpenAI on OpenRouter
2.2s
$0.0010
100%
Claude Opus 4.5
by Anthropic on OpenRouter
5.0s
$0.0046
100%
DeepSeek V3
by DeepSeek on OpenRouter
6.5s
$0.0003
78%
Gemma 4 26B A4B
by Google on OpenRouter
4.9s
$0.0002
75%
Gemini 3.1 Pro Preview
by Google on OpenRouter
21.0s
$0.0131
52%
GPT-5.5
by OpenAI on OpenRouter
8.4s
$0.0103
27%
gpt-oss-20b
by OpenAI on OpenRouter
9.6s
$0.0001
25%
Llama 3.2 3B Instruct
by Meta on OpenRouter
613ms
$0.0000
3%
GPT-5
by OpenAI on OpenRouter
27.8s
$0.0163
0%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GPT-5OpenRouter
1,732 avg (116 in / 1,616 out)
Gemini 3.1 Pro PreviewOpenRouter
1,183 avg (114 in / 1,069 out)
gpt-oss-20bOpenRouter
848 avg (175 in / 673 out)
Gemma 4 26B A4B OpenRouter
452 avg (137 in / 315 out)
GPT-5.5OpenRouter
445 avg (124 in / 321 out)