AI Benchmarks: A Quest for Meaningful Measurement

The pursuit of superior artificial intelligence (AI) is often fueled by benchmark scores, but are these scores truly indicative of real-world capabilities? The AI community is grappling with this question as traditional benchmarks face increasing scrutiny.

SWE-Bench, introduced in November 2024, rapidly gained traction as a popular tool for assessing an AI model’s coding prowess. It leverages over 2,000 authentic programming challenges extracted from public GitHub repositories across a dozen Python-based projects. A strong SWE-Bench score has become a coveted badge, prominently displayed in major model releases from leading AI developers such as OpenAI, Anthropic, and Google. Beyond these giants, AI firms specializing in fine-tuning constantly vie for supremacy on the SWE-Bench leaderboard.

However, the fervor surrounding these benchmarks may be misleading. John Yang, a researcher at Princeton University involved in SWE-Bench’s development, notes that the intense competition for the top spot has led to "gaming" of the system. This raises concerns about whether these benchmarks accurately reflect genuine AI achievement.

The issue isn’t necessarily overt cheating, but rather the development of strategies specifically tailored to exploit the benchmark’s limitations. For example, the initial SWE-Bench focused solely on Python code, incentivizing developers to train their models exclusively on Python. Yang observed that these high-scoring models often faltered when confronted with different programming languages, exposing a superficial understanding he describes as "gilded."

"It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart," Yang explains. "At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting."

This "SWE-Bench issue" reflects a broader challenge in AI evaluation. Benchmarks, once considered reliable indicators of progress, are increasingly detached from real-world capabilities. Compounding the problem, concerns about transparency have surfaced, further eroding trust in these metrics. Despite these issues, benchmarks continue to play a pivotal role in model development, even though many experts question their inherent value. OpenAI co-founder Andrej Karpathy has even termed the current situation an "evaluation crisis," lamenting the lack of trusted methods for measuring AI capabilities and the absence of a clear path forward.

Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI, asks, "Historically, benchmarks were the way we evaluated AI systems. Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?"

A growing contingent of academics and AI researchers advocates for a more focused approach, drawing inspiration from the social sciences. They propose prioritizing "validity," a concept central to quantitative social science, which assesses how well a measurement tool accurately captures the intended construct. This emphasis on validity could challenge benchmarks that evaluate vaguely defined concepts such as "reasoning" or "scientific knowledge." While it may temper the pursuit of artificial general intelligence (AGI), it would provide a more solid foundation for evaluating individual models.

Abigail Jacobs, a professor at the University of Michigan and a leading voice in the push for validity, asserts, "Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does. I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim."

The Limits of Traditional Testing

The AI industry’s reliance on benchmarks stems from their past successes, particularly in challenges like ImageNet.

ImageNet, launched in 2010, presented researchers with a database of over 3 million images categorized into 1,000 different classes. The challenge was method-agnostic, allowing any successful algorithm to gain credibility regardless of its underlying approach. The breakthrough of AlexNet in 2012, which utilized an unconventional form of GPU training, became a cornerstone of modern AI. While few could have predicted that AlexNet’s convolutional neural networks would unlock image recognition, its high score silenced any doubts. (Notably, one of AlexNet’s developers went on to co-found OpenAI.)

ImageNet’s effectiveness stemmed from the close alignment between the challenge and real-world image recognition tasks. Even with debates about methods, the highest-scoring model invariably demonstrated superior performance in practical applications.

However, in the years since, AI researchers have applied this same method-agnostic approach to increasingly general tasks. SWE-Bench, for example, is often used as a proxy for broader coding ability, while other exam-style benchmarks are used to gauge reasoning ability. This broad scope makes it difficult to rigorously define what a specific benchmark measures, hindering responsible interpretation of the findings.

Where Things Break Down

Anka Reuel, a PhD student at Stanford, argues that the push toward generality is at the root of the evaluation problem. "We’ve moved from task-specific models to general-purpose models," Reuel says. "It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder."

Like Jacobs, Reuel believes that "the main issue with benchmarks is validity, even more than the practical implementation," noting: "That’s where a lot of things break down." For complex tasks like coding, it’s nearly impossible to encompass every conceivable scenario in a problem set. Consequently, it becomes difficult to discern whether a model’s higher score reflects genuine coding skill or simply clever manipulation of the problem set. The intense pressure to achieve record scores further incentivizes shortcuts.

Developers hope that success across a multitude of specific benchmarks will translate into a generally capable model. However, the rise of agentic AI, where a single system can incorporate a complex array of models, makes it difficult to evaluate whether improvements on specific tasks will generalize. "There’s just many more knobs you can turn," says Sayash Kapoor, a computer scientist at Princeton and a critic of sloppy practices in the AI industry. "When it comes to agents, they have sort of given up on the best practices for evaluation."

In a paper published last July, Kapoor highlighted specific issues with how AI models approached the WebArena benchmark in 2024, which tests an AI agent’s ability to navigate the web. The benchmark consists of over 800 tasks performed on cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team discovered that the winning model, STeP, exploited the structure of Reddit URLs to directly access user profile pages, a frequent requirement in WebArena tasks.

While not outright cheating, Kapoor considers this a "serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time." Despite this, OpenAI’s web agent, Operator, has since adopted a similar policy.

Further illustrating the problems with AI benchmarks, Kapoor and a team of researchers recently published a paper revealing significant issues in Chatbot Arena, a popular crowdsourced evaluation system. Their findings indicated that the leaderboard was being manipulated, with some top foundation models engaging in undisclosed private testing and selectively releasing their scores.

Even ImageNet, the benchmark that started it all, is now facing validity problems. A 2023 study by researchers at the University of Washington and Google Research found that ImageNet-winning algorithms showed "little to no progress" when applied to six real-world datasets, suggesting that the test’s external validity had reached its limit.

Going Smaller

To address the validity problem, some researchers propose reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers "have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore."

In November 2024, Reuel launched BetterBench, a public ranking project that evaluates benchmarks based on various criteria, including the clarity of code documentation and, crucially, the validity of the benchmark in measuring its stated capability. BetterBench challenges designers to clearly define what their benchmark tests and how it relates to the tasks that comprise the benchmark.

"You need to have a structural breakdown of the capabilities," Reuel says. "What are the actual skills you care about, and how do you operationalize them into something we can measure?"

The results are revealing. The Arcade Learning Environment (ALE), established in 2013 to test models’ ability to learn how to play Atari 2600 games, emerges as one of the highest-scoring benchmarks. Conversely, the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills, receives one of the lowest scores due to a poorly defined connection between the questions and the underlying skill.

While BetterBench has yet to significantly impact the reputations of specific benchmarks, it has successfully brought validity into the forefront of discussions about how to improve AI benchmarks. Reuel has joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she will further develop her ideas on validity and AI model evaluation.

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. "There’s just so much hunger for a good benchmark off the shelf that already works," Solaiman says. "A lot of evaluations are trying to do too much."

The broader industry appears to be converging on this view. In a paper published in March, researchers from Google, Microsoft, Anthropic, and others outlined a new framework for improving evaluations, with validity as the cornerstone.

"AI evaluation science must," the researchers argue, "move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress."

Measuring the “Squishy” Things

To facilitate this shift, some researchers are turning to the tools of social science. A February position paper argued that "evaluating GenAI systems is a social science measurement challenge," specifically exploring how social science validity systems can be applied to AI benchmarking.

The authors, primarily from Microsoft’s research branch but also including academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, these same procedures could provide a way to measure concepts like "reasoning" and "math proficiency" without resorting to hazy generalizations.

Social science literature emphasizes the importance of rigorously defining the concept being measured. For example, a test designed to measure the level of democracy in a society must first establish a clear definition of a "democratic society" and then formulate questions relevant to that definition.

To apply this to a benchmark like SWE-Bench, designers would need to abandon the traditional machine learning approach of collecting programming problems from GitHub and creating a scheme to validate answers. Instead, they would first define what the benchmark aims to measure (e.g., "ability to resolve flagged issues in software"), break that down into subskills (e.g., different types of problems or program structures), and then construct questions that accurately cover those subskills.

For researchers like Jacobs, this profound shift from how AI researchers typically approach benchmarking is precisely the point. "There’s a mismatch between what’s happening in the tech industry and these tools from social science," she says. "We have decades and decades of thinking about how we want to measure these squishy things about humans."

Despite the growing impact of these ideas in the research community, their influence on how AI companies actually use benchmarks has been slow.

Recent model releases from OpenAI, Anthropic, Google, and Meta continue to rely heavily on multiple-choice knowledge benchmarks like MMLU, the very approach that validity researchers are attempting to move beyond. Model releases, for the most part, still focus on demonstrating increases in general intelligence, and broad benchmarks are used to support these claims.

Some observers find this satisfactory. Wharton professor Ethan Mollick suggests that benchmarks, despite being "bad measures of things, are also what we’ve got." He adds, "At the same time, the models are getting better. A lot of sins are forgiven by fast progress."

For now, the industry’s long-standing focus on artificial general intelligence appears to be overshadowing a more focused, validity-based approach. As long as AI models continue to advance in general intelligence, specific applications seem less compelling, even if practitioners are using tools they no longer fully trust.

"This is the tightrope we’re walking," says Hugging Face’s Solaiman. "It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations."