AIAny - Arena Leaderboard (formerly LMArena)

Static benchmarks reward models that overfit to known test sets; this leaderboard sidesteps that by never telling voters which model they are judging. Each ranking is built from millions of blind pairwise human votes, so a model cannot game it without genuinely winning real preferences — which is why frontier labs treat a strong placement here as a launch milestone.

What Sets It Apart

Rankings come from human preference votes, not curated question banks, so they track what people actually prefer rather than what a model memorized.
Models are anonymized during voting, removing brand bias and making the Elo scores hard to manipulate.
Coverage spans text, coding, vision, image, and video arenas, letting you compare the same model's standing across very different tasks.
Scores carry confidence intervals, so a one-rank gap with overlapping margins honestly signals a statistical tie rather than a real difference.

Who It's For

Great fit if you want a directional, preference-based read on which frontier models people like right now, or a public reference point when picking between similar top-tier models. Look elsewhere if you need reproducible, task-specific scores: votes skew toward conversational style and presentation, ranks shift as new models arrive, and the methodology rewards what feels better, not necessarily what is most correct on your particular workload.

Arena Leaderboard (formerly LMArena)

Introduction

What Sets It Apart

Who It's For

Information

Categories

Tags

More Items

OpenAI/parameter-golf

VLMEvalKit

OpenCompass CompassRank