Benchmarks

Browse what others are evaluating. Each benchmark is a set of prompts run against a group of models. Open one to see how they scored.

Untitled Benchmark
No description
50 tasks42 modelsMay 26
Bad Idea Bench
Tests LLMs capabilities to spot bad ideas and nudge the user towards better ones.
4 tasks110 modelsMay 21
Categorization Bench
This benchmark measures the model's ability to suggest relevant category names for a given set of related concepts or items.
10 tasks110 modelsMay 16
Explain Like I'm 5
This benchmark measures the ability to explain complex topics simply and concisely for a five-year-old.
7 tasks110 modelsMay 14
Niederstetten Benchmark
Tests the LLMs knowledge about a specific German town in rural Germany.
5 tasks110 modelsJan 28
Evalry Knowledge Benchmark
No description
4 tasks320 modelsJan 28
Product Recommendation Bench
Benchmarks product recommendations for a diverse set of SaaS products
9 tasks110 modelsJan 13
Venture Capital Terms Benchmark
LLM Benchmark on typical Venture Capital terms so you know which model to discuss your next fundraising with.
6 tasks88 modelsJan 10
German Memelord Bench
Benchmarking LLMs capabilities to detect and understand German memes across a plethora of questions.
35 tasks110 modelsJan 9
Character Frequency Bench
Tests an LLM's ability to accurately count and categorize specific characters, symbols, and patterns within strings. This benchmark evaluates tokenization-independent visual processing and precise sub-string analysis.
11 tasks110 modelsJan 9