Character Frequency Bench

Tests an LLM's ability to accurately count and categorize specific characters, symbols, and patterns within strings. This benchmark evaluates tokenization-independent visual processing and precise sub-string analysis.

Jan 9, 2026

11 tasks

110 models

$0.2378

user_c636b9d7

Link only

ResultsPreliminary

Vote in the arena

27 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

gpt-oss-20b

by OpenAI

9.0s

$0.0002

100%

score

Kimi K2 Thinking

by MoonshotAI

15.2s

$0.0032

100%

score

DeepSeek V3 0324

by DeepSeek

26.7s

$0.0079

100%

score

GPT-5.2

by OpenAI

4.1s

$0.0185

100%

score

Claude Opus 4.5

by Anthropic

4.3s

$0.0437

100%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

gpt-oss-20b

by OpenAI on OpenRouter

9.0s

$0.0002

100%

Kimi K2 Thinking

by MoonshotAI on OpenRouter

15.2s

$0.0032

100%

DeepSeek V3 0324

by DeepSeek on OpenRouter

26.7s

$0.0079

100%

GPT-5.2

by OpenAI on OpenRouter

4.1s

$0.0185

100%

Claude Opus 4.5

by Anthropic on OpenRouter

4.3s

$0.0437

100%

Claude Haiku 4.5

by Anthropic on OpenRouter

1.9s

$0.0077

95%

gpt-oss-120b

by OpenAI on OpenRouter

2.7s

$0.0003

91%

Llama 4 Maverick

by Meta on OpenRouter

7.3s

$0.0018

91%

GLM 4.7

by Z.ai on OpenRouter

12.6s

$0.0063

91%

GPT-5 Mini

by OpenAI on OpenRouter

11.4s

$0.0079

91%

Model	Duration	Cost	Score
gpt-oss-20b by OpenAI on OpenRouter	9.0s	$0.0002	100%
Kimi K2 Thinking by MoonshotAI on OpenRouter	15.2s	$0.0032	100%
DeepSeek V3 0324 by DeepSeek on OpenRouter	26.7s	$0.0079	100%
GPT-5.2 by OpenAI on OpenRouter	4.1s	$0.0185	100%
Claude Opus 4.5 by Anthropic on OpenRouter	4.3s	$0.0437	100%
Claude Haiku 4.5 by Anthropic on OpenRouter	1.9s	$0.0077	95%
gpt-oss-120b by OpenAI on OpenRouter	2.7s	$0.0003	91%
Llama 4 Maverick by Meta on OpenRouter	7.3s	$0.0018	91%
GLM 4.7 by Z.ai on OpenRouter	12.6s	$0.0063	91%
GPT-5 Mini by OpenAI on OpenRouter	11.4s	$0.0079	91%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Llama 3.2 1B InstructOpenRouter

2,645 avg (39 in / 2,606 out)

DeepSeek V3 0324OpenRouter

842 avg (27 in / 815 out)

GPT-5 NanoOpenRouter

838 avg (29 in / 809 out)

GLM 4.7OpenRouter

745 avg (29 in / 715 out)

Gemini 2.5 ProOpenRouter

744 avg (24 in / 720 out)