Given the recent explosion of large language models (LLMs) that can make convincingly human-like statements, it makes sense that there’s been a deepened focus on developing the models to be able to explain how they make decisions. But how can we be sure that what they’re saying is the truth?
In a new paper, researchers from Microsoft and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) propose a novel method for measuring LLM explanations with respect to their “faithfulness”—that is, how accurately an explanation represents the reasoning process behind the model’s answer.
As lead author and Ph.D. student Katie Matton explains, faithfulness is no minor concern: if an LLM produces explanations that are plausible but unfaithful, users might develop false confidence in its responses and fail to recognize when recommendations are misaligned with their own values, like avoiding bias in hiring.
In areas like health care or law, unfaithful explanations could have serious consequences: the researchers specifically call out an example in which GPT-3.5 gave higher ratings to female nursing candidates compared to male ones even when genders were swapped, but explained its answers to be affected only by age, skills, and traits.
Prior methods for measuring faithfulness produce quantitative scores that can be difficult for users to interpret—what does it mean for an explanation to be, say, 0.63 faithful? Matton and colleagues focused on developing a faithfulness metric that could help users to understand the ways in which explanations are misleading.
To accomplish this, they introduced “causal concept faithfulness,” which measures the difference between the set of concepts in the input text that the LLM explanations implies were influential to those that truly had a causal effect on the model’s answer. Examining the discrepancy between these two concept sets reveals interpretable patterns of unfaithfulness—for example, that an LLM’s explanations don’t mention gender when they should.
The researchers first used an auxiliary LLM to identify the key concepts in the input question. Next, to assess the causal effect of each concept on the primary LLM’s answer, they examine whether changing the concept changes the LLM’s answer.
To do this, they use the auxiliary LLM to generate realistic counterfactual questions in which the value of a concept is modified—for example, changing a candidate’s gender or removing a piece of clinical information. They then collect the primary LLM’s responses to the counterfactual questions and examine how its answers change.
Estimating concept effects can be expensive because it involves repeated calls to the LLM to collect its answers to the counterfactual questions. To address this, the team employs a Bayesian hierarchical model to estimate the concept’s effects for multiple questions jointly.
In empirical tests, the researchers compared GPT-3.5, GPT-4o, and Claude-3.5-Sonnet on two question-answering datasets. Matton cites two particularly important findings:
- On a dataset of questions designed to test for social biases in language models, they found cases in which LLMs provide explanations that mask their reliance on social biases. In other words, the LLMs make decisions that are influenced by social identity information, such as race, income, and gender—but then they justify their decisions based on other factors, such as an individual’s behavior.
- On a dataset of medical questions involving hypothetical patient scenarios, the team’s method revealed cases in which LLM explanations omit pieces of evidence that have a large effect on the model’s answers regarding patient treatment and care.
The authors do note some limitations to their method and analysis, including their reliance on the auxiliary LLM, which can make occasional errors. Their approach can also sometimes underestimate the causal effects of concepts that are highly correlated with other concepts in the input; they suggest multi-concept interventions as a future improvement.
The research team says that, by uncovering specific patterns in misleading explanations, their method can enable a targeted response to unfaithful explanations. For example, a user that sees that an LLM exhibits gender bias may avoid using it to compare candidates of different genders—and a model developer could deploy a tailored fix to correct the bias. Matton says that she sees their method as an important step toward building more trustworthy and transparent AI systems.
More information:
Katie Matton et al. Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations. ICLR 2025 Spotlight. openreview.net/forum?id=4ub9gpx9xw
Citation:
How can we tell if AI is lying? New method tests whether AI explanations are truthful (2025, June 5)
retrieved 5 June 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.