Navigating the challenges of artificial intelligence

We’ve all observed the complexities that AI is introducing into our modern world. You might wonder, what’s the role of Software Heritage? Our original motivation was to create a long-term archive, a « telescope » to explore source code, not necessarily to aid AI. But, it has become evident that Software Heritage can play a crucial role in providing better AI. That’s why the French government has funded CodeCommons, to build a common reference platform for large-scale source code analysis, for AI and beyond.
To understand how we got here, here’s a brief recap. When we started nearly 10 years ago, we didn’t envision generative AI, since it wasn’t common back then. However, about a year and a half ago, we were approached by researchers interested in building models using our archive. This raised the question: should we say yes or no? After extensive discussions, we published principles to guide our collaborations: open models (at least open weights), full transparency on training data, and an opt-out mechanism for developers.

These principles proved viable. In February 2024, the BigCode project, working with Hugging Face, built StarCoder2, a leading coding model, using a subset of Software Heritage data with an opt-out mechanism.
However, we encountered several challenges in the AI development process. Firstly, there’s a redundant downloading of data, primarily from GitHub. Secondly, data cleaning and preprocessing are performed inconsistently. Thirdly, training datasets lack transparency. Extracting subsets for specific functions is difficult. And finally, there is a lack of attribution.
CodeCommons aims to address these issues, making source code and metadata available in a single, accessible location. It will implement standardized data pipelines for cleaning and preprocessing, provide traceability through identifiers, and incorporate ethical considerations, such as attribution and similarity checks.
Technically, the project builds upon Software Heritage, partnering with GENCI for scalable infrastructure. Metadata enrichment, including issues, pull requests, and research article connections, will yield an attribution graph and a unified data model, in turn streamlining dataset selection and similarity checks.
Our vision is to provide a platform that empowers model builders and users, fostering transparency and trust. The recent CodeCommons project kickoff, with participation from numerous research communities, underscores the project’s potential to address critical societal challenges.
Software Heritage’s engagement with AI is not a deviation from our core mission. Rather, it represents an adaptation to the evolving technological landscape. Our commitment to preserving and providing access to our shared software heritage remains paramount, as it is foundational to responsible technological advancement.
This post is adapted from a talk given by Software Heritage Co-Founder Roberto Di Cosmo at the latest Software Heritage Symposium. You can catch the 13-minute video on YouTube.