Visit hbs.edu

Data Contracts: Data Quality for AI

The following insights are derived from a recent Insights from the Field event featuring Chad Sanderson, covering the topic of Data Contracts for achieving high-quality data for Artificial Intelligence (AI).

Overview

Chad began this session by demystifying some hyperbolic claims in the world of Artificial Intelligence (AI), including the idea that AI models could replace analysts, data scientists, and even product teams—human input remains critical to the successful manipulation of AI. Following this, Chad emphasized the importance of high-quality data in the success of AI systems, stating that the quality of data is as crucial as the AI model itself. He pointed out a few challenges in AI models related to a lack of high quality data, including incorrect predictions, model outages, and “hallucinations”.

During this talk, Chad explored the complexity of data ecosystems, presenting a data lineage graph to illustrate the challenges in understanding where data come from and how it’s transformed. Replication of data across microservices, and constant changes in data contribute to the difficulty in maintaining high-quality data across the board. Chad highlighted four primary reasons for the difficulty in achieving high data quality:

  1. Lack of ownership by data producers
  2. Limited awareness of where data is used and its significance
  3. Absence of effective change management processes
  4. Lack of agreement on semantic truth, leading to “garbage in, garbage out” scenarios

To tackle some of the aforementioned challenges in achieving high data quality, conventional solutions like data catalogs and monitoring are often put in place. Such solutions, however, are inherently reactive, and preventive mechanisms that can actively deter low-quality data from entering an AI system need to be prioritized instead.

Introduction to Data Contracts

Data contracts are a preventive and scalable solution for ensuring data quality for AI systems. A data contract is analogous to an API which represents the state of the data that consumers need, including schema, semantics/business logic, and service level agreements (SLAs). The goal of data contracts is not to prevent change but to establish a feedback loop, making data producers aware of the impact of changes on data consumers and helping consumers anticipate and prepare for upcoming changes.

Data contracts consist two main components—the definition or spec, and the enforcement mechanism. One cannot effectively function without the other. The ideal workflow in a data contract involves defining the data contract in advance using human-friendly languages like YAML or JSON, and enforcing it during integration and delivery of the data. This involves comparing the new data version to the contract, and troubleshooting as necessary by either breaking the build for critical issues or providing informational alerts for potential impacts.

Data contracts create incentives for data producers to care about data quality, fostering a culture of data ownership and management that leads to more explainable and trustworthy data. This establishes visibility into data usage and impact, which, in turn, bridges communication gaps between data producers and consumers. When implementing data contracts, a vertical implementation approach (instead of a horizontal) is preferred. Starting with valuable data products (e.g., machine learning models), then moving on to identifying their upstream sources and creating contracts that align with the data product’s expectations, is a sound process to follow. A vertical approach in data contract implementation also has implications for the overall culture on data quality on an organizational level. Rather than trying to first change the organizational culture, implementing data contracts and associated tools in small doses is often more successful. Once it becomes easier for teams to do the right thing, cultural change follows as a natural progression.

In summary, the benefits of considering data contracts for businesses and organizations are manifold. Data contracts facilitate advocating for data products, thereby providing visibility into data usage and benefiting data engineers, producers, and consumers alike.


Disclaimer

The “Insights from the Field” initiative is a platform for guest contributors – who are industry leaders, subject-matter experts, and academics – to share their expert opinions and valuable perspectives on topics related to the fields of Business, Artificial Intelligence (AI), and Machine Learning (ML). Our guest contributors bring a wealth of knowledge and experiences in their respective fields, and we believe that their insights can significantly enrich our community’s understanding of the dynamic and intertwined spaces of business, technology, and society.

It’s important to note, however, that the Digital, Data, and Design (D^3) Institute does not explicitly endorse opinions expressed by our guest contributors. With this initiative, we hope to facilitate the exchange of diverse perspectives and encourage critical thinking, with an overarching goal of fostering meaningful and informed discussions on topics we consider are important to our community.  

Engage With Us

Join Our Community

Ready to dive deeper with the Digital Data Design Institute at Harvard? Subscribe to our newsletter, contribute to the conversation and begin to invent the future for yourself, your business and society as a whole.