What we've shipped lately.
Vote on answers in the Arena.
See two model answers side by side and pick the better one. Your votes feed into the benchmark's score.
A new landing page.
Clearer explanation of what Evalry does and how to pick the right model for what you're building.
Vote directly from a benchmark.
You can now vote on model answers right next to the results, without switching pages.
Benchmarks pages on phones.
The benchmarks list and individual benchmark pages now lay out properly on small screens.
Browse every model.
A new page lists every model we test, with its score, speed, and price.
Easier-to-read model answers.
Long answers are now formatted properly and collapsed by default, so the results page stays scannable.
Newer models in the recommended lists.
The pre-made model selections include the latest releases from OpenAI, Anthropic, Google, and others.
Edit benchmarks in place.
Change your tasks from the benchmark page itself, with suggestions to help you refine them.
Better-looking shared links.
When you paste a public benchmark link into Slack, X, or anywhere else, it shows a proper preview with title, description, and image.
Spending limits.
Set a credit cap on your account so a long-running benchmark can't surprise you on cost.
Featured benchmarks.
A short list of benchmarks we've hand-picked, so you don't have to start from scratch.
Faster public pages.
Public benchmark pages open near-instantly, even on a fresh visit.
Paged benchmark list.
The benchmarks page no longer shows everything at once, which keeps it fast as the catalog grows.
Overall model rankings.
See which models score best across every benchmark, not only within one.
Pick a model group instead of picking one by one.
Choose from ready-made model sets and run your benchmark against the whole group at once.
Results page on phones.
Tables and charts on the results page now fit small screens.
Public or private benchmarks.
Choose whether each benchmark is visible to anyone with the link or only to you.
A page for every model.
Each model has its own page with its scores, speed, and price.
Compare token usage and export results.
See how many tokens each model used per task, and download the full results.
Accounts and sharing.
Sign in to keep your benchmarks and share them with a link.
Evalry is live.
First public release.
Run against hundreds of models.
Benchmarks now run against any of the models from the major providers, all from one place.
Rankings page.
A single page that ranks models across every public benchmark.