As large language model (LLM) systems grow in complexity, the challenge of ensuring their outputs align with human intentions has become critical. Interpretability—the ability to explain how models reach their decisions—and control—the ability to steer them toward desired outcomes—are two sides of the same coin.
“Towards Unifying Interpretability and Control: Evaluation via Intervention”—research by Usha Bhalla, Graduate Fellow PhD student at Harvard University Kempner Institute and the Digital Data Design Institute (D^3) Trustworthy AI Lab; Suraj Srinivas, Research Scientist at Bosch AI; Himabindu Lakkaraju, Assistant Professor of Business Administration at Harvard Business School and PI in D^3’s Trustworthy AI Lab; and Asma Ghandeharioun, Senior Research Scientist at Google DeepMind—found that many methods developed to address these issues focus on one aspect, neglecting the other. The study introduces a new approach that unifies interpretability and control and proposes intervention as the primary goal, and evaluates how well different methods enable control through intervention.
Key Insight: Intervention as a Fundamental Goal of Interpretability
“[W]e view intervention as a fundamental goal of interpretability, and propose to measure the correctness of interpretability methods by their ability to successfully edit model behaviour.” [1]
The authors define intervention as the deliberate modification of specific human-interpretable features within a model’s latent representations1 to achieve desired changes in its outputs, or its responses to prompts. They argue that the ability to intervene in a model’s behavior this way should be a core objective of interpretability methods. By focusing on intervention, they provide a practical way to assess the effectiveness of various interpretability techniques. This approach shifts the focus from understanding a model’s inner workings to actively influencing its outputs, bridging the gap between theory and application.
Key Insight: A Unified Framework for Interpretability and Control
“[W]e present an encoder-decoder framework that unifies four popular mechanistic interpretability methods: sparse autoencoders, logit lens, tuned lens, and probing.” [2]
The study uncovered a critical limitation in current interpretability methods: their performance varies significantly across different models and features. To address these performance issues, Bhalla et al. present a new approach to unifying diverse interpretability methods under a single framework—the encoder-decoder model. Their framework maps intermediate latent representations to feature spaces that are understandable by humans, allowing interventions to these features. These changes can then be translated back into latent representations to influence the model’s outputs.The study evaluates four methods within its unified framework to determine their relative strengths and weaknesses for both interpretability and control:
- Logit Lens: Easy to use, requires no training, maps features directly to individual tokens in the model’s vocabulary, and generally has high causal fidelity2, but is limited by predefined, static features
- Tuned Lens: Extends Logit Lens with additional learned linear transformation3, which improves its flexibility and effectiveness, but requires additional training and tuning
- Sparse autoencoders (SAEs): Can learn a large dictionary of low-level and high-level or abstract features, but are difficult to train and label and shows lower causal fidelity
- Probing: Trains simple classifiers (often linear) on top of model representations to predict specific features or concepts, but is prone to spurious correlations, leading to low causal fidelity
Key Insight: Measuring Success Through Interventions
“[W]e propose two evaluation metrics for encoder-decoder interpretability methods, namely (1) intervention success rate; and (2) the coherence-intervention tradeoff to evaluate the ability of interpretability methods to control model behavior.” [3]
The authors introduce two metrics to determine if interventions are accurate and maintain the integrity and functionality of AI systems in real-world applications:
- Intervention success rate: Measures the effectiveness, or whether the intervention achieves its goal
- Coherence-intervention tradeoff: Measures practical utility, ensuring the intervention does not make the model’s outputs unusable by affecting its coherence and quality
Among the methods evaluated, the two lens-based approaches had the highest intervention success rates. However, due to current shortcomings, such as inconsistency across models and features, and the potential compromising of performance and coherence, the authors found that, when it comes to directing model behavior, simpler options, such as prompting, prevail over intervention methods.
Why This Matters
For business professionals and C-suite executives, the insights presented by Bhalla and her team represent a pivotal development in the practical application of AI technologies. As organizations increasingly rely on AI for tasks ranging from low-level to critical, understanding how to align these systems with human and organizational values is paramount. The proposed framework and metrics provide actionable tools to ensure AI systems are both correct and usable. The study also underscores the need to select and evaluate interpretability methods carefully based on the specific models used and tasks involved.
Footnotes
(1) Latent representation refers to the internal, abstract representation of data within a machine learning model. These representations are not directly interpretable by humans but encode meaningful patterns or features of the input data.
(2) Causal fidelity is the extent to which intervening on a specific feature of an explanation results in the corresponding change in the model’s output.
(3) A linear transformation is a mathematical function that converts one vector into another while maintaining the properties of vector addition and scalar multiplication. Put simply, it changes the direction and size of vectors without warping or distorting the structure of the space they occupy.
References
[1] Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, and Himabindu Lakkaraju, “Towards Unifying Interpretability and Control: Evaluation via Intervention”, arXiv preprint arXiv:2411.04430v1 (November 7, 2024): 2.
[2] Bhalla, et al. “Towards Unifying Interpretability and Control: Evaluation via Intervention”, 3.
[3] Bhalla, et al. “Towards Unifying Interpretability and Control: Evaluation via Intervention”, 3.
Meet the Authors

Usha Bhalla, is a PhD student in the Harvard Computer Science program at Harvard University Kempner Institute, and a fellow at the Digital Data Design Institute (D^3) Trustworthy AI Lab. Advised by Hima Lakkaraju, her research focuses on machine learning interpretability. Bhalla is also a dedicated advocate for diversity in computer science, mentoring early-career minority students to support their growth in the field.

Suraj Srinivas is a Research Scientist at Bosch AI with a focus on model interpretability, data-centric machine learning, and the “science” of deep learning. They completed their Ph.D. with François Fleuret at Idiap Research Institute & EPFL, Switzerland, and were a postdoctoral research fellow with Hima Lakkaraju at Harvard University. They have organized workshops and seminars on interpretable AI, including sessions at NeurIPS 2023 and 2024, and contributed to teaching an explainable AI course at Harvard. Their work bridges theoretical advancements and practical applications of explainable AI.

Himabindu Lakkaraju is an Assistant Professor of Business Administration at Harvard Business School and PI in D^3’s Trustworthy AI Lab. She is also a faculty affiliate in the Department of Computer Science at Harvard University, the Harvard Data Science Initiative, Center for Research on Computation and Society, and the Laboratory of Innovation Science at Harvard. She teaches the first year course on Technology and Operations Management, and has previously offered multiple courses and guest lectures on a diverse set of topics pertaining to Artificial Intelligence (AI) and Machine Learning (ML), and their real world implications.

Asma Ghandeharioun is a Senior Research Scientist at Google DeepMind, where she focuses on aligning AI with human values by understanding, controlling, and demystifying language models. She earned her Ph.D. from the MIT Media Lab’s Affective Computing Group and has conducted research at Google Research, Microsoft Research, and EPFL. Previously, she worked in digital mental health, collaborating with Harvard medical professionals and publishing in leading journals.