A new metric to quantify capabilities of AI systems in terms of human capabilities

Our methodology for measuring AI agent time horizon. Credit: arXiv (2025). DOI: 10.48550/arxiv.2503.14499

A team of AI researchers at startup METR is proposing a new metric to quantify the capabilities of AI systems in terms of human capabilities. They have published a paper on the arXiv preprint server describing the new metric, which they call “task-completion time horizon” (TCTH).

LLMs such as GPT-2 are getting better at producing reliable results with each new iteration. In this new study, the team in California noted that such models are still being described in ways that are not up to the task of fully describing a system’s capabilities. Because of that, they have come up with a metric to quantify the capabilities in ways that can be used across multiple fields, such as writing computer programs or generating the steps needed to carry out a task.

With TCTH, tasks can be quantified by testing them against humans. As one example, the researchers found that early versions of LLMs failed to complete any of a certain group of tasks given to human experts, who could get them done in one minute. In sharp contrast, the latest version of Claude 3.7 Sonnet can complete 50% of certain tasks that took humans on average 59 minutes to achieve.

A new metric to quantify capabilities of AI systems in terms of human capabilities — The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approximately every 7 months for the last 6 years. Credit: *arXiv* (2025). DOI: 10.48550/arxiv.2503.14499

By setting up a list of tasks and then seeing how long it takes a human to achieve them, the new metric could be used to develop a benchmark to measure how well AI models are stacking up. And such benchmarks, they suggest, should be based on a 50% success rate, because it has thus far been shown to be the most robust when used in data distribution analysis.

As part of their work with the new metric, the research team found that AI models are improving dramatically on completing long tasks, such as programming, carrying out cybersecurity assignments, general reasoning assignments and machine learning. Such progress suggests that they could soon be used to carry out major assignments like chemical discovery or even whole engineering projects.

More information:
Thomas Kwa et al, Measuring AI Ability to Complete Long Tasks, arXiv (2025). DOI: 10.48550/arxiv.2503.14499

Journal information:
arXiv

Citation:
A new metric to quantify capabilities of AI systems in terms of human capabilities (2025, March 20)
retrieved 21 March 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link

Trump gambles it all on global tariffs he’s wanted for decades

600-Year-Old Fresco by Fra Angelico Gets Some Necessary Love

What to Know About Israel’s Expanded Offensive in Gaza

Accusations Fly in South Korean Dating Scandal After Actress’s Suicide

The Country Was Fake. But Its Land Grab in Bolivia Was Real.

Rubio Visits NATO Amid European Alarm Over Trump’s Agenda

Netanyahu arrives in Hungary for talks despite ICC arrest warrant

Surfers Take What They Can Get in Hong Kong’s Unexciting Waters

President Trump announces new tariffs on all imports to US

Netanyahu Arrives in Hungary, Finding a Rare Welcome in Europe

IGET.NEWS

Oh hi there
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every week.

More From Author

Trump gambles it all on global tariffs he’s wanted for decades

USWNT thrilled to welcome back ‘unique’ Rodman

Mourinho grabs manager’s nose after cup defeat

How rookie goaltender Dustin Wolf saved the Calgary Flames’ season

Iceland minister who had a child with a teenager 30 years ago quits

Leave a Reply Cancel reply

Recent News

About Us - Terms of use - Privace Policy

Oh hi there It’s nice to meet you.

Sign up to receive awesome content in your inbox, every week.

Leave a Reply Cancel reply

Recent News

Top of the day

About Us - Terms of use - Privace Policy

Oh hi there
It’s nice to meet you.