December 4, 2023

Hugging Face

Data is the powerhouse of modern machine learning and the foundation of every large language model. The Software Heritage archive allows us to collect vast amounts of code to train state-of-the-art large language models, especially for code. At the same time, their principles ensure that the resulting models benefit all of humanity: the models are released openly, the training data is made transparent, and data owners can opt out of the training.

— Leandro von Werra,
Machine Learning Engineer at
Hugging Face and co-lead of BigCode

Software Heritage

Follow us