A team of AI researchers at startup METR is proposing a new metric to quantify the capabilities of AI systems in terms of human capabilities. They have published a paper on the arXiv preprint server describing the new metric, which they call “task-completion time horizon” (TCTH).
LLMs such as GPT-2 are getting better at producing reliable results with each new iteration. In this new study, the team in California noted that such models are still being described in ways that are not up to the task of fully describing a system’s capabilities. Because of that, they have come up with a metric to quantify the capabilities in ways that can be used across multiple fields, such as writing computer programs or generating the steps needed to carry out a task.
With TCTH, tasks can be quantified by testing them against humans. As one example, the researchers found that early versions of LLMs failed to complete any of a certain group of tasks given to human experts, who could get them done in one minute. In sharp contrast, the latest version of Claude 3.7 Sonnet can complete 50% of certain tasks that took humans on average 59 minutes to achieve.

By setting up a list of tasks and then seeing how long it takes a human to achieve them, the new metric could be used to develop a benchmark to measure how well AI models are stacking up. And such benchmarks, they suggest, should be based on a 50% success rate, because it has thus far been shown to be the most robust when used in data distribution analysis.
As part of their work with the new metric, the research team found that AI models are improving dramatically on completing long tasks, such as programming, carrying out cybersecurity assignments, general reasoning assignments and machine learning. Such progress suggests that they could soon be used to carry out major assignments like chemical discovery or even whole engineering projects.
More information:
Thomas Kwa et al, Measuring AI Ability to Complete Long Tasks, arXiv (2025). DOI: 10.48550/arxiv.2503.14499
© 2025 Science X Network
Citation:
A new metric to quantify capabilities of AI systems in terms of human capabilities (2025, March 20)
retrieved 21 March 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.