Updated 23 Mar 2026

Open Source LLM Leaderboard

Live benchmark performance for open-weight and open-source models. Data refreshed from Vellum.ai on every page load.

Top open source models per task

Best in Reasoning (GPQA Diamond)

Kimi K2.587.6%

Kimi K2 Thinking84.5%

GPT oss 120b80.1%

Nemotron Ultra 253B76%

Llama 4 Behemoth73.7%

Best in High School Math (AIME 2025)

Kimi K2 Thinking99.1%

GPT oss 20b98.7%

GPT oss 120b97.9%

Kimi K2.596.1%

DeepSeek-R174%

Best in Agentic Coding (SWE Bench)

Kimi K2.576.8%

Kimi K2 Thinking71.3%

DeepSeek-R149.2%

DeepSeek V3 032438.8%

Qwen2.5-VL-32B18.8%

Best Overall (Humanity's Last Exam)

Kimi K2 Thinking44.9%

Kimi K2.530.1%

GPT oss 120b14.9%

GPT oss 20b10.9%

DeepSeek-R18.6%

Best in Visual Reasoning (ARC-AGI 2)

Kimi K2.512%

Best in Multilingual Reasoning (MMMLU)

Llama 4 Behemoth85.8%

Llama 4 Maverick84.6%

Fastest & most affordable

Fastest (tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 405b969 t/s
5GPT oss 20b564 t/s

Lowest Latency (TTFT)

1Llama 4 Scout0.33s
2Llama 4 Maverick0.45s
3Llama 3.3 70b0.52s
4Gemma 3 27b0.72s
5Llama 3.1 405b0.73s

Cheapest (per 1M tokens)

1Gemma 3 27b$0.07 / $0.07
2GPT oss 20b$0.08 / $0.35
3Llama 4 Scout$0.11 / $0.34
4GPT oss 120b$0.15 / $0.6
5Llama 4 Maverick$0.2 / $0.6

Model Comparison

Model	Context	Cutoff	I/O Cost	Max Output	Latency	Speed
Kimi K2.5	256,000	Apr 2024	$0.6 / $2.5	33,000	1.2s	45 t/s
Llama 3.1 70b	128,000	Dec 2023	n/a	4096	n/a	2100 t/s
Llama 3.1 405b	128,000	Dec 2023	$3.5 / $3.5	4096	0.73s	969 t/s
Llama 3.3 70b	128,000	July 2024	$0.59 / $0.7	32,768	0.52s	2500 t/s
DeepSeek V3 0324	128,000	Dec 2024	$0.27 / $1.1	8,000	4s	33 t/s
Qwen2.5-VL-32B	131,000	Dec 2024	n/a	8,000	n/a	n/a
DeepSeek-R1	128,000	Dec 2024	$0.55 / $2.19	8,000	4s	24 t/s
Gemma 3 27b	128,000	Nov 2024	$0.07 / $0.07	8192	0.72s	59 t/s
Llama 4 Maverick	10,000,000	November 2024	$0.2 / $0.6	8,000	0.45s	126 t/s
Llama 4 Scout	10,000,000	November 2024	$0.11 / $0.34	8,000	0.33s	2600 t/s
Llama 4 Behemoth	n/a	November 2024	n/a	n/a	n/a	n/a
Nemotron Ultra 253B	n/a	n/a	n/a	n/a	n/a	n/a
GPT oss 120b	131,072	April 2025	$0.15 / $0.6	131,072	8.1s	260 t/s
GPT oss 20b	131,072	April 2025	$0.08 / $0.35	131,072	4s	564 t/s
Kimi K2 Thinking	256,000	April 2025	$0.6 / $2.5	16,400	25.3s	79 t/s

Benchmark Glossary

GPQA Diamond

Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.

AIME 2025

Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.

SWE-Bench Verified

Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.

Humanity's Last Exam

A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.

ARC-AGI 2

Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.

MMMLU

Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.