AI Leaderboard2023

OpenCompass CompassRank

Public leaderboard ranking LLMs and multimodal models across 70+ datasets — reasoning, knowledge, coding, math, and long-context. Blends open-source and proprietary benchmarks into one comparative view spanning GPT-4, Claude, Qwen, and InternLM.

Visit Website

Introduction

Most model leaderboards optimize for a single headline metric, which is exactly how a model can top a chart while being mediocre at the thing you actually need. CompassRank takes the opposite bet: it spreads each model across five capability dimensions and ~70 datasets, so a strong average score can no longer hide a weak spot in coding, long-context, or math reasoning. The interesting signal is often not who is on top, but where a given model quietly falls off.

What Sets It Apart

Mixes open-source benchmarks with proprietary, harder-to-game ones, so a model can't simply train on the public test sets and climb. That makes the relative ordering more trustworthy than single-benchmark leaderboards.
Reports per-dimension breakdowns (reasoning, knowledge, code, etc.) rather than one number — useful when you care about a specific skill instead of an aggregate.
Covers both API models (GPT-4, Claude) and open-weight families (Qwen, InternLM, Llama), letting you compare a closed frontier model against a deployable open one on the same yardstick.

Who It's For

Great fit if you're choosing a base model and want capability-level evidence rather than vibes, or if you're tracking how open-weight models close the gap with frontier APIs. Look elsewhere if you need a live arena of human preference votes, or if your use case is narrow enough that one targeted benchmark tells you more than a broad aggregate — the breadth that makes CompassRank fair also dilutes signal for very specific niches.

Back

Information

Websiterank.opencompass.org.cn
OrganizationsShanghai AI Laboratory
AuthorsOpenCompass Contributors
Published date2023/06/01

More Items

AI Train2026

OpenAI/parameter-golf

OpenAI

A challenge repository for training the best language model that fits inside a 16,000,000‑byte (16MB) submission artifact; provides baseline training code, FineWeb bpb evaluation, a public leaderboard, and compute-grant instructions for short 8×H100 runs.

openai ai-train ai-leaderboard github pytorch+2

AI Leaderboard2023

VLMEvalKit

open-compass (OpenCompass community)OpenCompass, Shanghai AI Laboratory

Runs one-command evaluation of vision-language models across 80+ multimodal benchmarks, handling data download, inference, and metric scoring in a single pass. Supports 220+ LMMs; adding a new model means writing one generate_inner() function.

vision ai-leaderboard huggingface github ai-tools+1

AI Leaderboard2023

Arena Leaderboard (formerly LMArena)

LMSYS Org, ArenaArena Intelligence Inc.

Blind side-by-side voting site where users send one prompt to two anonymous chat models, pick the winner, and millions of votes become Elo rankings across text, coding, vision, image, and video. Crowd preference, not static benchmarks, decides the order.

ai-leaderboard ai-rank LLM chatbot ai-tools

OpenCompass CompassRank

Introduction

What Sets It Apart

Who It's For

Information

Categories

Tags

More Items

OpenAI/parameter-golf

VLMEvalKit

Arena Leaderboard (formerly LMArena)