Visit hbs.edu

Croissant: A Game-Changing Metadata Format for Machine Learning Datasets

Data is a vital input for machine learning (ML), but data management remains a significant challenge. A new metadata format called Croissant, introduced by Mubashara Akhtar, a PhD student at King’s College London, Satyapriya Krishna, a PhD student at the Harvard School of Engineering and Applied Sciences (SEAS) and the Trustworthy AI Lab at the Digital Data Design (D^3) Institute at Harvard Business School, and 29 other researchers (see the Meet the Authors section for details), promises to revolutionize how datasets are discovered, shared, and used across various ML tools and platforms. The team’s research, “Croissant: A Metadata Format for ML-Ready Datasets,” describes how their innovation addresses key friction points in ML data management, potentially accelerating progress in the field and making advanced ML applications more accessible to businesses of all sizes.

Key Insight: Standardizing Dataset Metadata

“Croissant makes datasets ‘ML-ready’ by recording ML-specific metadata that enables them to be loaded directly into ML frameworks and tools.”[1]

Croissant aims to create a unified language for describing ML datasets. This standardization allows datasets to be easily shared and used across different ML platforms and tools. By providing a consistent format for metadata, Croissant enables researchers and developers to quickly understand and use new datasets, potentially saving hours of data preparation time. Major repositories like Hugging Face Datasets, Kaggle Datasets, and OpenML have integrated Croissant, making it immediately useful to a wide range of ML practitioners.

Key Insight: Enhancing Dataset Discoverability and Portability

“Croissant improves the discoverability, portability, and interoperability of ML datasets across data repositories, ML tools, frameworks, and platforms.” [2]

One of the key challenges in ML is finding and employing appropriate datasets for specific tasks. Croissant addresses this by making datasets more discoverable and portable. Its standardized format allows for better indexing and searching of datasets, making it easier for researchers and businesses to find the right data for their projects. The authors conducted a user study where nine expert ML practitioners annotated ten widely used ML datasets using Croissant, demonstrating its applicability across various types of datasets.

Key Insight: Promoting Responsible AI (RAI) Practices

“Croissant-RAI is an extension of the Croissant format that builds on existing responsible AI (RAI) dataset documentation approaches, such as Data Cards and Datasheets for Datasets, making it easier to publish, discover, and reuse RAI metadata.” [3]

In an era where AI ethics and responsibility are increasingly important, Croissant incorporates features to support RAI practices. The Croissant-RAI extension allows for the documentation of important ethical considerations, such as data collection methods, potential biases, and intended use cases.

Key Insight: User-Friendly Tools for Adoption

“We developed the Croissant Editor, (also on GitHub), a tool that lets users visually create and modify Croissant datasets.” [4]

To facilitate widespread adoption, the Croissant team developed user-friendly tools, such as the Croissant Editor, which provides a visual interface for creating and modifying Croissant metadata, making it accessible even to those without deep technical knowledge. In the user study, the majority of participants took 15-30 minutes to create a Croissant description of a dataset, indicating its ease of use.

Why This Matters

For business professionals and executives, Croissant represents a significant advancement in ML data management. By standardizing dataset metadata and improving discoverability, Croissant can potentially reduce the time and resources required for ML projects. Moreover, the emphasis on responsible AI practices aligns with growing regulatory and ethical concerns, helping businesses navigate the complex landscape of AI governance. As ML continues to play an increasingly crucial role in business operations and decision-making, tools like Croissant that streamline the data management process collectively address critical challenges in the ML ecosystem, potentially accelerating research and development while fostering more ethical and efficient use of data.

References

[1] Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Pieter Gijsbers, Joan Giner-Miguelez, Sujata Goswami, et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv preprint arXiv:2403.19546v3 (December 9, 2024): 1-26, 1.

[2] Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv, 10.

[3] Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv, 5.

[4] Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv, 6.

Meet the Authors

* Core contributors

Engage With Us

Join Our Community

Ready to dive deeper with the Digital Data Design Institute at Harvard? Subscribe to our newsletter, contribute to the conversation and begin to invent the future for yourself, your business and society as a whole.