As artificial intelligence systems become increasingly complex, understanding their behavior has become a critical challenge for businesses and researchers alike. In a recent preprint paper, “Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability,” authors Shichang Zhang, a postdoctoral fellow in the Trustworthy AI Lab at the Digital Data Design (D^3) Institute at Harvard, Tessa Han, a PhD student in Bioinformatics and Integrative Genomics at Harvard Medical School, Usha Bhalla, a PhD student in computer science at Harvard, and Himabindu Lakkaraju, Assistant Professor of Business Administration at HBS and lead researcher at D^3’s Trustworthy AI Lab, propose a unified view of three traditionally separate model behavior attribution methods. This approach aims to bridge the fragmented landscape of AI interpretability, offering new insights into enhancing holistic model understanding.
Key Insight: The Unified Attribution Framework
“We take the position that […] feature, data, and component attribution share core techniques despite their different perspectives.” [1]
In this paper, Zhang and colleagues propose a unified framework that brings together three traditionally separate attribution methods: feature attribution (FA), which refers to the process of identifying which input features are most important in an AI model’s output, data attribution (DA), which involves understanding how specific training-data points influence an AI model’s behavior, and component attribution (CA), which focuses on understanding how internal parts of an AI model contribute to its output. This innovative approach recognizes that while these methods have evolved independently, they share fundamental techniques such as perturbations, gradients, and linear approximations. By unifying these methods, the researchers aim to provide a more comprehensive understanding of AI systems’ behavior.
Key Insight: Supporting Further Research
“Attribution methods also hold immense potential to benefit broader AI research for other applications.” [2]
The unified framework offers multiple advantages for advancing AI interpretability research. For example, by promoting conceptual coherence through less fragmented terminology, it facilitates more effective communication and collaboration. The framework enables cross-attribution innovation, allowing researchers to adapt solutions developed for one attribution type to others, such as applying efficient sampling techniques from perturbation-based FA, which changes input parts to measure the effect on AI’s answers, to improve DA methods. It also simplifies theoretical analysis by identifying common mathematical underpinnings, streamlining research efforts and paving the way for more robust and generalizable techniques.
Key Insight: Implications for AI Regulation and Ethics
“FA reveals input processing patterns, DA exposes training data influences, and CA illuminates architectural roles. This multi-faceted understanding enables more targeted and effective regulation.” [3]
By providing a comprehensive view of AI system behavior, the unified attribution framework enables more informed and targeted regulatory approaches. The authors illustrate this with a real-world example: when tackling issues of bias in AI, the framework enables regulators to pinpoint potentially discriminating features in the input data, identify and track problematic or copyrighted training materials, and highlight specific components within the AI’s architecture that may contribute to biased outcomes.
The authors note that regulation and policy frequently stress the need for transparency in AI systems and users’ right to an explanation. The unified attribution framework provides a powerful tool for practitioners to meet these legal and ethical requirements by offering detailed insights into both overall AI system behavior and specific input-output relationships.
Why This Matters
For business leaders, this unification method means gaining more comprehensive and reliable insights into how your AI systems function. Instead of fragmented views, leaders get a holistic understanding of what drives AI decisions. This is essential for building trust, ensuring regulatory compliance, and effectively identifying and addressing issues like bias or errors, whether they stem from data, inputs, or the model’s structure. Ultimately, the unified attribution framework proposed in this research supports more informed model management and governance, directly impacting an organization’s bottom line through cost savings and enhanced value.
References
[1] Shichang Zhang et al., “Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability,” arXiv preprint arXiv:2501.18887v3 (May 29, 2025): 1.
[2] Zhang et al., “Towards Unified Attribution,” 8.
[3] Zhang et al., “Towards Unified Attribution,” 8.
Meet the Authors

Shichang Zhang is a postdoctoral fellow at the D^3 Institute at Harvard University working with Professor Hima Lakkaraju. He received his Ph.D. in Computer Science from University of California, Los Angeles (UCLA).

Tessa Han is a PhD student in the Bioinformatics and Integrative Genomics Program at Harvard Medical School.

Usha Bhalla is a PhD student in the Harvard Computer Science program, working on machine learning interpretability and advised by Hima Lakkaraju. She is a strong advocate for increasing diversity in CS through direct mentorship of early-career minority students.

Himabindu Lakkaraju is an Assistant Professor of Business Administration at Harvard Business School and PI in D^3’s Trustworthy AI Lab. She is also a faculty affiliate in the Department of Computer Science at Harvard University, the Harvard Data Science Initiative, Center for Research on Computation and Society, and the Laboratory of Innovation Science at Harvard. Professor Lakkaraju’s research focuses on the algorithmic, practical, and ethical implications of deploying AI models in domains involving high-stakes decisions such as healthcare, business, and policy.