Croissant: A Game-Changing Metadata Format for Machine Learning Datasets

Data is a vital input for machine learning (ML), but data management remains a significant challenge. A new metadata format called Croissant, introduced by Mubashara Akhtar, a PhD student at King’s College London, Satyapriya Krishna, a PhD student at the Harvard School of Engineering and Applied Sciences (SEAS) and the Trustworthy AI Lab at the Harvard Business School AI Institute, and 29 other researchers (see the Meet the Authors section for details), promises to revolutionize how datasets are discovered, shared, and used across various ML tools and platforms. The team’s research, “Croissant: A Metadata Format for ML-Ready Datasets,” describes how their innovation addresses key friction points in ML data management, potentially accelerating progress in the field and making advanced ML applications more accessible to businesses of all sizes.

Key Insight: Standardizing Dataset Metadata

“Croissant makes datasets ‘ML-ready’ by recording ML-specific metadata that enables them to be loaded directly into ML frameworks and tools.”[1]

Croissant aims to create a unified language for describing ML datasets. This standardization allows datasets to be easily shared and used across different ML platforms and tools. By providing a consistent format for metadata, Croissant enables researchers and developers to quickly understand and use new datasets, potentially saving hours of data preparation time. Major repositories like Hugging Face Datasets, Kaggle Datasets, and OpenML have integrated Croissant, making it immediately useful to a wide range of ML practitioners.

Key Insight: Enhancing Dataset Discoverability and Portability

“Croissant improves the discoverability, portability, and interoperability of ML datasets across data repositories, ML tools, frameworks, and platforms.” [2]

One of the key challenges in ML is finding and employing appropriate datasets for specific tasks. Croissant addresses this by making datasets more discoverable and portable. Its standardized format allows for better indexing and searching of datasets, making it easier for researchers and businesses to find the right data for their projects. The authors conducted a user study where nine expert ML practitioners annotated ten widely used ML datasets using Croissant, demonstrating its applicability across various types of datasets.

Key Insight: Promoting Responsible AI (RAI) Practices

“Croissant-RAI is an extension of the Croissant format that builds on existing responsible AI (RAI) dataset documentation approaches, such as Data Cards and Datasheets for Datasets, making it easier to publish, discover, and reuse RAI metadata.” [3]

In an era where AI ethics and responsibility are increasingly important, Croissant incorporates features to support RAI practices. The Croissant-RAI extension allows for the documentation of important ethical considerations, such as data collection methods, potential biases, and intended use cases.

Key Insight: User-Friendly Tools for Adoption

“We developed the Croissant Editor, (also on GitHub), a tool that lets users visually create and modify Croissant datasets.” [4]

To facilitate widespread adoption, the Croissant team developed user-friendly tools, such as the Croissant Editor, which provides a visual interface for creating and modifying Croissant metadata, making it accessible even to those without deep technical knowledge. In the user study, the majority of participants took 15-30 minutes to create a Croissant description of a dataset, indicating its ease of use.

Why This Matters

For business professionals and executives, Croissant represents a significant advancement in ML data management. By standardizing dataset metadata and improving discoverability, Croissant can potentially reduce the time and resources required for ML projects. Moreover, the emphasis on responsible AI practices aligns with growing regulatory and ethical concerns, helping businesses navigate the complex landscape of AI governance. As ML continues to play an increasingly crucial role in business operations and decision-making, tools like Croissant that streamline the data management process collectively address critical challenges in the ML ecosystem, potentially accelerating research and development while fostering more ethical and efficient use of data.

References

[1] Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Pieter Gijsbers, Joan Giner-Miguelez, Sujata Goswami, et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv preprint arXiv:2403.19546v3 (December 9, 2024): 1-26, 1.

[2] Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv, 10.

[3] Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv, 5.

[4] Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”, arXiv, 6.

Meet the Authors

* Core contributors

*Mubashara Akhtar, PhD student at King’s College London
*Omar Benjelloun, Software Engineer at Google
*Costanza Conforti, Software Engineer at Google
*Luca Foschini, President and CEO, Sage Bionetworks
Pieter Gijsbers, ICT Developer, Eindhoven University of Technology
*Joan Giner-Miguelez, Researcher at Universitat Oberta de Catalunya and Barcelona Supercomputing Center (BSC)
Sujata Goswami, Software Engineer at Oak Ridge National Laboratory
*Nitisha Jain, Postdoctoral Research Associate at King’s College London
Michalis Karamousadakis, Software Engineer and Co-Founder of Plaixus Ltd
Satyapriya Krishna, PhD student at Harvard University
*Michael Kuchnik, Research Scientist at Meta
*Sylvain Lesage, Software Engineer at Hugging Face
*Quentin Lhoest, Open Source and Machine Learning Engineer atHugging Face
*Pierre Marcenac, Senior Software Engineer at Google
Manil Maskey, Senior Research Scientist at NASA
Peter Mattson, Senior Staff Engineer at Google
*Luis Oala, Head of Machine Learning at Dotphoton
Hamidah Oderinwale, Fellow at McGill University
*Pierre Ruyssen, Software Engineer at Google
Tim Santos, Director of Product, AI Cloud Solutions at Graphcore
*Rajat Shinde, Computer Scientist at NASA IMPACT and University of Alabama in Huntsville
*Elena Simperl, Professor of Computer Science at King’s College London and Open Data Institute
Arjun Suresh, Co-Founder of GATE Overflow, India
*Goeff Thomas, Google and Kaggle
*Slava Tykhonov, Senior Information Scientist, Data Archiving and Networked Services (DANS) at the Royal Netherlands Academy of Arts and Sciences (KNAW)
*Joaquin Vanschoren, Associate Professor at Eindhoven University of Technology
Susheel Varma, Chief Data Officer, Sage Bionetworks
*Jos van der Velde, Researcher at Eindhoven University of Technology
Steffen Vogler, Principal Data Scientist at Bayer
Carole-Jean Wu, Research Scientist at Meta
Luyao Zhang, Assistant Professor of Economics at Duke Kunshan University