Benchmark
A standardised test used to measure and compare the performance of AI models.
A benchmark is a set of tasks or questions with known correct answers that researchers use to evaluate how capable an AI model is. Common benchmarks test reasoning, coding, maths, language understanding, and factual knowledge. Models are scored and ranked against each other on these tests, which helps track progress in the field.
Think of standardised exams like the SAT or IELTS. They don't measure everything about a person's intelligence, but they provide a consistent, comparable score across many candidates. AI benchmarks work the same way — imperfect, but useful for comparison. And just like students can be coached to score well on exams without truly understanding the material, AI models can be optimised specifically for benchmarks without being genuinely more capable.
A high benchmark score doesn't mean a model is better for your use case. Many benchmarks test narrow academic tasks that don't reflect real-world performance. A model that tops a reasoning benchmark might still fail badly at following specific instructions or maintaining a consistent tone — things that matter far more in most products.