Benchmark

A standardized test used to compare model performance (e.g., SWE-bench for coding).