Updated 23 Mar 2026

Open Source LLM Leaderboard

Live benchmark performance for open-weight and open-source models. Data refreshed from Vellum.ai on every page load.

Top open source models per task

Best in Reasoning (GPQA Diamond)

Kimi K2.587.6%
Kimi K2 Thinking84.5%
GPT oss 120b80.1%
Nemotron Ultra 253B76%
Llama 4 Behemoth73.7%

Best in High School Math (AIME 2025)

Kimi K2 Thinking99.1%
GPT oss 20b98.7%
GPT oss 120b97.9%
Kimi K2.596.1%
DeepSeek-R174%

Best in Agentic Coding (SWE Bench)

Kimi K2.576.8%
Kimi K2 Thinking71.3%
DeepSeek-R149.2%
DeepSeek V3 032438.8%
Qwen2.5-VL-32B18.8%

Best Overall (Humanity's Last Exam)

Kimi K2 Thinking44.9%
Kimi K2.530.1%
GPT oss 120b14.9%
GPT oss 20b10.9%
DeepSeek-R18.6%

Best in Visual Reasoning (ARC-AGI 2)

Kimi K2.512%

Best in Multilingual Reasoning (MMMLU)

Llama 4 Behemoth85.8%
Llama 4 Maverick84.6%

Fastest & most affordable

Fastest (tokens/sec)

  1. 1Llama 4 Scout2600 t/s
  2. 2Llama 3.3 70b2500 t/s
  3. 3Llama 3.1 70b2100 t/s
  4. 4Llama 3.1 405b969 t/s
  5. 5GPT oss 20b564 t/s

Lowest Latency (TTFT)

  1. 1Llama 4 Scout0.33s
  2. 2Llama 4 Maverick0.45s
  3. 3Llama 3.3 70b0.52s
  4. 4Gemma 3 27b0.72s
  5. 5Llama 3.1 405b0.73s

Cheapest (per 1M tokens)

  1. 1Gemma 3 27b$0.07 / $0.07
  2. 2GPT oss 20b$0.08 / $0.35
  3. 3Llama 4 Scout$0.11 / $0.34
  4. 4GPT oss 120b$0.15 / $0.6
  5. 5Llama 4 Maverick$0.2 / $0.6

Model Comparison

ModelContextCutoffI/O CostMax OutputLatencySpeed
Kimi K2.5256,000Apr 2024$0.6 / $2.533,0001.2s45 t/s
Llama 3.1 70b128,000Dec 2023n/a4096n/a2100 t/s
Llama 3.1 405b128,000Dec 2023$3.5 / $3.540960.73s969 t/s
Llama 3.3 70b128,000July 2024$0.59 / $0.732,7680.52s2500 t/s
DeepSeek V3 0324128,000Dec 2024$0.27 / $1.18,0004s33 t/s
Qwen2.5-VL-32B131,000Dec 2024n/a8,000n/an/a
DeepSeek-R1128,000Dec 2024$0.55 / $2.198,0004s24 t/s
Gemma 3 27b128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
Llama 4 Maverick10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
Llama 4 Scout10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
Llama 4 Behemothn/aNovember 2024n/an/an/an/a
Nemotron Ultra 253Bn/an/an/an/an/an/a
GPT oss 120b131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
GPT oss 20b131,072April 2025$0.08 / $0.35131,0724s564 t/s
Kimi K2 Thinking256,000April 2025$0.6 / $2.516,40025.3s79 t/s

Benchmark Glossary

GPQA Diamond

Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.

AIME 2025

Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.

SWE-Bench Verified

Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.

Humanity's Last Exam

A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.

ARC-AGI 2

Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.

MMMLU

Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.