Bad Idea Bench

Tests LLMs capabilities to spot bad ideas and nudge the user towards better ones.

May 21, 2026

4 tasks

110 models

$0.0752

user_c636b9d7

Public

ResultsPreliminary

Vote in the arena

23 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Claude Opus 4.5

by Anthropic

5.0s

$0.0046

100%

score

Gemini 2.5 Pro

by Google

32.2s

$0.0133

98%

score

DeepSeek V3

by DeepSeek

6.5s

$0.0003

98%

score

GLM 4.5 Air

by Z.ai

1.7s

$0.0001

93%

score

GPT-4.1

by OpenAI

2.4s

$0.0026

89%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Claude Opus 4.5

by Anthropic on OpenRouter

5.0s

$0.0046

100%

Gemini 2.5 Pro

by Google on OpenRouter

32.2s

$0.0133

98%

DeepSeek V3

by DeepSeek on OpenRouter

6.5s

$0.0003

98%

GLM 4.5 Air

by Z.ai on OpenRouter

1.7s

$0.0001

93%

GPT-4.1

by OpenAI on OpenRouter

2.4s

$0.0026

89%

Qwen3.5-Flash

by Qwen on OpenRouter

12.9s

$0.0005

83%

Gemini 3.1 Pro Preview

by Google on OpenRouter

21.0s

$0.0131

83%

Gemma 4 26B A4B

by Google on OpenRouter

4.9s

$0.0002

82%

Claude 3 Haiku

by Anthropic on OpenRouter

2.7s

$0.0002

76%

Qwen3 235B A22B Instruct 2507

by Qwen on OpenRouter

2.1s

$0.0002

76%

Model	Duration	Cost	Score
Claude Opus 4.5 by Anthropic on OpenRouter	5.0s	$0.0046	100%
Gemini 2.5 Pro by Google on OpenRouter	32.2s	$0.0133	98%
DeepSeek V3 by DeepSeek on OpenRouter	6.5s	$0.0003	98%
GLM 4.5 Air by Z.ai on OpenRouter	1.7s	$0.0001	93%
GPT-4.1 by OpenAI on OpenRouter	2.4s	$0.0026	89%
Qwen3.5-Flash by Qwen on OpenRouter	12.9s	$0.0005	83%
Gemini 3.1 Pro Preview by Google on OpenRouter	21.0s	$0.0131	83%
Gemma 4 26B A4B by Google on OpenRouter	4.9s	$0.0002	82%
Claude 3 Haiku by Anthropic on OpenRouter	2.7s	$0.0002	76%
Qwen3 235B A22B Instruct 2507 by Qwen on OpenRouter	2.1s	$0.0002	76%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Qwen3.6 PlusOpenRouter

2,315 avg (149 in / 2,167 out)

Qwen3.5-FlashOpenRouter

2,089 avg (132 in / 1,957 out)

GPT-5OpenRouter

1,732 avg (116 in / 1,616 out)

Step 3.5 FlashOpenRouter

1,540 avg (132 in / 1,408 out)

Gemini 2.5 ProOpenRouter

1,429 avg (117 in / 1,312 out)