Updated 23 Mar 2026
Open Source LLM Leaderboard
Live benchmark performance for open-weight and open-source models. Data refreshed from Vellum.ai on every page load.
Top open source models per task
Best in Reasoning (GPQA Diamond)
Best in High School Math (AIME 2025)
Best in Agentic Coding (SWE Bench)
Best Overall (Humanity's Last Exam)
Best in Visual Reasoning (ARC-AGI 2)
Best in Multilingual Reasoning (MMMLU)
Fastest & most affordable
Fastest (tokens/sec)
- 1Llama 4 Scout2600 t/s
- 2Llama 3.3 70b2500 t/s
- 3Llama 3.1 70b2100 t/s
- 4Llama 3.1 405b969 t/s
- 5GPT oss 20b564 t/s
Lowest Latency (TTFT)
- 1Llama 4 Scout0.33s
- 2Llama 4 Maverick0.45s
- 3Llama 3.3 70b0.52s
- 4Gemma 3 27b0.72s
- 5Llama 3.1 405b0.73s
Cheapest (per 1M tokens)
- 1Gemma 3 27b$0.07 / $0.07
- 2GPT oss 20b$0.08 / $0.35
- 3Llama 4 Scout$0.11 / $0.34
- 4GPT oss 120b$0.15 / $0.6
- 5Llama 4 Maverick$0.2 / $0.6
Model Comparison
| Model | Context | Cutoff | I/O Cost | Max Output | Latency | Speed |
|---|---|---|---|---|---|---|
| 256,000 | Apr 2024 | $0.6 / $2.5 | 33,000 | 1.2s | 45 t/s | |
| 128,000 | Dec 2023 | n/a | 4096 | n/a | 2100 t/s | |
| 128,000 | Dec 2023 | $3.5 / $3.5 | 4096 | 0.73s | 969 t/s | |
| 128,000 | July 2024 | $0.59 / $0.7 | 32,768 | 0.52s | 2500 t/s | |
| 128,000 | Dec 2024 | $0.27 / $1.1 | 8,000 | 4s | 33 t/s | |
| 131,000 | Dec 2024 | n/a | 8,000 | n/a | n/a | |
| 128,000 | Dec 2024 | $0.55 / $2.19 | 8,000 | 4s | 24 t/s | |
| 128,000 | Nov 2024 | $0.07 / $0.07 | 8192 | 0.72s | 59 t/s | |
| 10,000,000 | November 2024 | $0.2 / $0.6 | 8,000 | 0.45s | 126 t/s | |
| 10,000,000 | November 2024 | $0.11 / $0.34 | 8,000 | 0.33s | 2600 t/s | |
| n/a | November 2024 | n/a | n/a | n/a | n/a | |
| n/a | n/a | n/a | n/a | n/a | n/a | |
| 131,072 | April 2025 | $0.15 / $0.6 | 131,072 | 8.1s | 260 t/s | |
| 131,072 | April 2025 | $0.08 / $0.35 | 131,072 | 4s | 564 t/s | |
| 256,000 | April 2025 | $0.6 / $2.5 | 16,400 | 25.3s | 79 t/s |
Benchmark Glossary
GPQA Diamond
Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
AIME 2025
Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
Humanity's Last Exam
A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
ARC-AGI 2
Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
MMMLU
Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.