A new metric to quantify capabilities of AI systems in terms of human capabilities

A new metric to quantify capabilities of AI systems in terms of human capabilities

Our methodology for measuring AI agent time horizon. Credit: arXiv (2025). DOI: 10.48550/arxiv.2503.14499

A team of AI researchers at startup METR is proposing a new metric to quantify the capabilities of AI systems in terms of human capabilities. They have published a paper on the arXiv preprint server describing the new metric, which they call “task-completion time horizon” (TCTH).

LLMs such as GPT-2 are getting better at producing reliable results with each new iteration. In this new study, the team in California noted that such models are still being described in ways that are not up to the task of fully describing a system’s capabilities. Because of that, they have come up with a metric to quantify the capabilities in ways that can be used across multiple fields, such as writing computer programs or generating the steps needed to carry out a task.

With TCTH, tasks can be quantified by testing them against humans. As one example, the researchers found that early versions of LLMs failed to complete any of a certain group of tasks given to human experts, who could get them done in one minute. In sharp contrast, the latest version of Claude 3.7 Sonnet can complete 50% of certain tasks that took humans on average 59 minutes to achieve.

A new metric to quantify capabilities of AI systems in terms of human capabilities
The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approximately every 7 months for the last 6 years. Credit: arXiv (2025). DOI: 10.48550/arxiv.2503.14499

By setting up a list of tasks and then seeing how long it takes a human to achieve them, the new metric could be used to develop a benchmark to measure how well AI models are stacking up. And such benchmarks, they suggest, should be based on a 50% success rate, because it has thus far been shown to be the most robust when used in data distribution analysis.

As part of their work with the new metric, the research team found that AI models are improving dramatically on completing long tasks, such as programming, carrying out cybersecurity assignments, general reasoning assignments and machine learning. Such progress suggests that they could soon be used to carry out major assignments like chemical discovery or even whole engineering projects.

More information:
Thomas Kwa et al, Measuring AI Ability to Complete Long Tasks, arXiv (2025). DOI: 10.48550/arxiv.2503.14499

Journal information:
arXiv


© 2025 Science X Network

Citation:
A new metric to quantify capabilities of AI systems in terms of human capabilities (2025, March 20)
retrieved 21 March 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.




Source link

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every week.

We don’t spam! Read our privacy policy for more info.

More From Author

How rookie goaltender Dustin Wolf saved the Calgary Flames’ season

How rookie goaltender Dustin Wolf saved the Calgary Flames’ season

Iceland minister who had a child with a teenager 30 years ago quits

Iceland minister who had a child with a teenager 30 years ago quits

Leave a Reply

Your email address will not be published. Required fields are marked *