Publications

Olivier Barais, Roberto Di Cosmo, Ludovic Mé, Stefano Zacchiroli, Olivier Zendra

Software Identification for Cybersecurity: Survey and Recommendations for Regulators Unpublished

2025, (working paper or preprint).

BibTeX | Links:

Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli

Source Code Archiving to the Rescue of Reproducible Deployment Proceedings Article

In: 2024 ACM Conference on Reproducibility and Replicability, pp. 10 pages, ACM, 2024.

Abstract | BibTeX | Links:

Tommaso Fontana, Sebastiano Vigna, Stefano Zacchiroli

WebGraph: The Next Generation (Is in Rust) Proceedings Article

In: Companion Proceedings of the ACM Web Conference 2024 (WWW '24 Companion), pp. 686-689, ACM, 2024.

Abstract | BibTeX | Links:

Annalí Casanueva, Davide Rossi, Stefano Zacchiroli, Théo Zimmermann

The Impact of the COVID-19 Pandemic on Women’s Contribution to Public Code Journal Article

In: Empirical Software Engineering, 2024.

BibTeX | Links:

Mathilde Fichen, Morane Gruenpeter , Jérémy Bobbio , Sabrina Granger , Roberto Di Cosmo, Jean-François Abramatic, Isabelle Astic , Emmanuelle Bermès, Camille Françoise, Claude Gomez, Wendy Hagenmaier, Grégory Miura, Carlo Montangero, Simon Phipps, Kenneth Seals-Nutt

SWHAP Workshop, September 14th and 15th, 2023 Proceedings

HAL, 2023.

Abstract | BibTeX | Links:

Valentin Lorentz, Di Cosmo, Roberto, Stefano Zacchiroli

The Popular Content Filenames Dataset: Deriving Most Likely Filenames from the Software Heritage Archive Unpublished

2023, (working paper or preprint).

Abstract | BibTeX | Links:

Romain Lefeuvre, Jessie Galasso, Benoit Combemale, Houari Sahraoui, Stefano Zacchiroli

Fingerprinting and Building Large Reproducible Datasets Proceedings Article

In: 2023 ACM Conference on Reproducibility and Replicability, pp. 27-36, ACM, 2023.

Abstract | BibTeX | Links:

Jesus M. Gonzalez-Barahona, Sergio Montes-Leon, Gregorio Robles, Stefano Zacchiroli

The Software Heritage License Dataset (2022 Edition) Journal Article

In: Empirical Software Engineering, 2023, ISSN: 1382-3256.

Abstract | BibTeX | Links:

@article{emse-2023-swh-license-dataset,

title = {The Software Heritage License Dataset (2022 Edition)},

author = {Jesus M. Gonzalez-Barahona and Sergio Montes-Leon and Gregorio Robles and Stefano Zacchiroli},

doi = {10.1007/s10664-023-10377-w},

issn = {1382-3256},

year  = {2023},

date = {2023-01-01},

journal = {Empirical Software Engineering},

publisher = {Springer},

abstract = {Context: When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories. Objective: To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis. Method: Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses. Results: The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums. Conclusions: Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Context: When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories. Objective: To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis. Method: Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses. Results: The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums. Conclusions: Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code.

Close

Roberto Di Cosmo, Stefano Zacchiroli

The Software Heritage Open Science Ecosystem Book Chapter

In: Mens, Tom; Roover, Coen De; Cleve, Anthony (Ed.): Software Ecosystems: Tooling and Analytics, pp. 33–61, Springer International Publishing, Cham, 2023, ISBN: 978-3-031-36060-2.

Abstract | BibTeX | Links:

Roberto Di Cosmo

Code Source Book Section

In: Dictionnaire du Numérique, vol. February, 2022.

BibTeX | Links:

Kevin Wellenzohn, Michael H. Böhlen, Sven Helmer, Antoine Pietri, Stefano Zacchiroli

Robust and Scalable Content-and-Structure Indexing Journal Article

In: the VLDB Journal, 2022, ISSN: 1066-8888.

Abstract | BibTeX | Links:

Davide Rossi, Stefano Zacchiroli

Worldwide Gender Differences in Public Code Contributions (and How They Have Been Affected by the COVID-19 Pandemic) Proceedings Article

In: 44th International Conference on Software Engineering (ICSE 2022) – Software Engineering in Society (SEIS) Track, pp. 172-183, ACM, 2022.

Abstract | BibTeX | Links:

Stefano Zacchiroli

A Large-scale Dataset of (Open Source) License Text Variants Proceedings Article

In: The 2022 Mining Software Repositories Conference (MSR 2022), pp. 757-761, ACM, 2022.

Abstract | BibTeX | Links:

Davide Rossi, Stefano Zacchiroli

Geographic Diversity in Public Code Contributions: An Exploratory Large-Scale Study Over 50 Years Proceedings Article

In: The 2022 Mining Software Repositories Conference (MSR 2022), pp. 80-85, ACM, 2022.

Abstract | BibTeX | Links:

Daniele Serafini, Stefano Zacchiroli

Efficient Prior Publication Identification for Open Source Code Proceedings Article

In: 18th International Conference on Open Source Systems (OSS 2022), ACM, 2022.

Abstract | BibTeX | Links:

@inproceedings{oss-2022-swh-scanner,

title = {Efficient Prior Publication Identification for Open Source Code},

author = {Daniele Serafini and Stefano Zacchiroli},

url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/oss-2022-swh-scanner.pdf},

doi = {10.1145/3555051.3555068},

year  = {2022},

date = {2022-01-01},

urldate = {2022-01-01},

booktitle = {18th International Conference on Open Source Systems (OSS 2022)},

publisher = {ACM},

abstract = {Free/Open Source Software (FOSS) enables large-scale reuse of preexisting software components. The main drawback is increased complexity in software supply chain management. A common approach to tame such complexity is automated open source compliance, which consists in automating the verification of adherence to various open source management best practices about license obligation fulfillment, vulnerability tracking, software composition analysis, and nearby concerns. We consider the problem of auditing a source code base to determine which of its parts have been published before, which is an important building block of automated open source compliance toolchains. Indeed, if source code allegedly developed in house is recognized as having been previously published elsewhere, alerts should be raised to investigate where it comes from and whether this entails that additional obligations shall be fulfilled before product shipment. We propose an efficient approach for prior publication identification that relies on a knowledge base of known source code artifacts linked together in a global Merkle direct acyclic graph and a dedicated discovery protocol. We introduce swh-scanner, a source code scanner that realizes the proposed approach in practice using as knowledge base Software Heritage, the largest public archive of source code artifacts. We validate experimentally the proposed approach, showing its efficiency in both abstract (number of queries) and concrete terms (wall-clock time), performing benchmarks on 16'845 real-world public code bases of various sizes, from small to very large.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Roberto Di Cosmo

Should We Preserve the World's Software History, And Can We? Proceedings Article

In: Silvello, Gianmaria; Corcho, Óscar; Manghi, Paolo; Nunzio, Giorgio Maria Di; Golub, Koraljka; Ferro, Nicola; Poggi, Antonella (Ed.): Linking Theory and Practice of Digital Libraries – 26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20-23, 2022, Proceedings, pp. 3–7, Springer, 2022.

BibTeX | Links:

Roberto Di Cosmo

Building the software pillar of Open Science Proceedings Article

In: Open Science European Conferencem (OSEC 2022), pp. 183–193, OpenEdition Press, 2022, ISBN: 9791036545627.

BibTeX | Links:

Antoine Pietri

Organizing the graph of public software development for large-scale mining PhD Thesis

Université Paris Cité, 2021.

BibTeX | Links:

Morane Gruenpeter, Roberto Di Cosmo, Katherine Thornton, Kenneth Seals-Nutt, Carlo Montangero, Guido Scatena

Software Stories for landmark legacy code Technical Report

Inria 2021.

BibTeX | Links:

Laura Bussi, Roberto Di Cosmo, Carlo Montangero, Guido Scatena

Preserving landmark legacy software with the Software Heritage Acquisition Process Proceedings Article

In: iPres2021 – 17th International Conference on Digital Preservation, Beijing, China, 2021.

BibTeX | Links:

Stefano Zacchiroli

Gender Differences in Public Code Contributions: a 50-year Perspective Journal Article

In: IEEE Software, 2021, ISSN: 0740-7459.

Abstract | BibTeX | Links:

Thibault Allançon, Antoine Pietri, Stefano Zacchiroli

The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development Proceedings Article

In: ICSE 2021: The 43rd International Conference on Software Engineering, pp. 45-48, IEEE, 2021.

Abstract | BibTeX | Links:

Morane Gruenpeter, Roberto Di Cosmo, Alice Allen, Anita Bandrowski, Peter Chan, Martin Fenner, Leyla Garcia, Catherine M Jones, Daniel S Katz, John Kunze, Moritz Schubotz, Ilian T Todorov

Use cases and identifier schemes for persistent software source code identification Technical Report

2020, (Output from the Research Data Alliance/FORCE11 Software Source Code Identification Working group).

BibTeX | Links:

Morane Gruenpeter, Roberto Di Cosmo, Hylke Koers, Patricia Herterich, Rob Hooft, Jessica Parland-von Essen, Jonas Tana, Tero Aalto, Sarah Jones

M2.15 Assessment report on 'FAIRness of software' Miscellaneous

2020.

BibTeX | Links:

Roberto Di Cosmo

Archiving and Referencing Source Code with Software Heritage Proceedings Article

In: ICMS, pp. 362–373, Springer, 2020, ISBN: 978-3-030-52200-1.

Abstract | BibTeX | Links:

Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli

Software provenance tracking at the scale of public source code Journal Article

In: Empirical Software Engineering, pp. 1-30, 2020, ISSN: 1573-7616.

Abstract | BibTeX | Links:

Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli

Determining the Intrinsic Structure of Public Software Development History Proceedings Article

In: MSR 2020: The 17th International Conference on Mining Software Repositories, pp. 602-605, IEEE, 2020.

Abstract | BibTeX | Links:

Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli

Forking Without Clicking: on How to Identify Software Repository Forks Proceedings Article

In: MSR 2020: The 17th International Conference on Mining Software Repositories, pp. 277-287, IEEE, 2020.

Abstract | BibTeX | Links:

@inproceedings{msr-2020-forks,

title = {Forking Without Clicking: on How to Identify Software Repository Forks},

author = {Antoine Pietri and Guillaume Rousseau and Stefano Zacchiroli},

url = {https://arxiv.org/abs/2011.07821 

https://www.softwareheritage.org/wp-content/uploads/2021/03/msr-2020-forks.pdf},

doi = {10.1145/3379597.3387450},

year  = {2020},

date = {2020-05-01},

booktitle = {MSR 2020: The 17th International Conference on Mining Software Repositories},

pages = {277-287},

publisher = {IEEE},

abstract = {The notion of software "fork" has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These “forge forks” however can only identify as forks repositories that have been created on the platform, e.g., by clicking a "fork" button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of "software forks", trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks, observing how they are affected by the proposed definitions and discuss the potential impact on empirical research.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli

The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History Proceedings Article

In: MSR 2020: The 17th International Conference on Mining Software Repositories, pp. 1-5, IEEE, 2020.

Abstract | BibTeX | Links:

Roberto Di Cosmo, Marco Danelutto

[Rp] Reproducing and replicating the OCamlP3l experiment Journal Article

In: ReScience C, vol. 6, no. 1, 2020.

Abstract | BibTeX | Links:

Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli

Ultra-Large-Scale Repository Analysis via Graph Compression Proceedings Article

In: SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, pp. 184-194, IEEE, 2020.

Abstract | BibTeX | Links:

Pierre Alliez, Roberto Di Cosmo, Benjamin Guedj, Alain Girault, Mohand-Said Hacid, Arnaud Legrand, Nicolas Rougier

Attributing and Referencing (Research) Software: Best Practices and Outlook From Inria Journal Article

In: Computing in Science Engineering, vol. 22, no. 1, pp. 39-52, 2020, ISSN: 1558-366X.

Abstract | BibTeX | Links:

Roberto Di Cosmo, Jose Benito Gonzalez Lopez, Jean-François Abramatic, Kay Graf, Miguel Colom, Paolo Manghi, Melissa Harrison, Yannick Barborini, Ville Tenhunen, Michael Wagner, Wolfgang Dalitz, Jason Maassen, Carlos Martinez-Ortiz, Elisabetta Ronchieri, Sam Yates, Moritz Schubotz, Leonardo Candela, Martin Fenner, Eric Jeangirard

Scholarly Infrastructures for Research Software Book

European Commission. Directorate General for Research and Innovation., 2020, ISBN: 978-92-76-25568-0.

BibTeX | Links:

Roberto Di Cosmo

Announcing biblatex-software Journal Article

In: ACM SIGSOFT Software Engineering Notes, vol. 45, no. 4, pp. 22–23, 2020.

BibTeX | Links:

Roberto Di Cosmo, Morane Gruenpeter, Bruno Marmol, Alain Monteil, Laurent Romary, Jozefina Sadowska

Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL) Journal Article

In: International Journal of Digital Curation, vol. 15, no. 1, pp. 16, 2020.

Abstract | BibTeX | Links:

Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli

Referencing Source Code Artifacts: a Separate Concern in Software Citation Journal Article

In: Computing in Science & Engineering, 2020, ISSN: 1521-9615.

Abstract | BibTeX | Links:

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli

The Software Heritage Graph Dataset: Public software development under one roof Proceedings Article

In: Proceedings of the 16th International Conference on Mining Software Repositories, pp. 138-142, IEEE Press, 2019.

Abstract | BibTeX | Links:

@inproceedings{msr-2019-swh,

title = {The Software Heritage Graph Dataset: Public software development under one roof},

author = {Antoine Pietri and Diomidis Spinellis and Stefano Zacchiroli},

url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/msr-2019-swh.pdf 

https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf},

doi = {10.1109/MSR.2019.00030},

year  = {2019},

date = {2019-05-27},

booktitle = {Proceedings of the 16th International Conference on Mining Software Repositories},

pages = {138-142},

publisher = {IEEE Press},

series = {MSR '19},

abstract = {Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Mélanie Clément-Fontaine, Roberto Di Cosmo, Bastien Guerry, Patrick Moreau, François Pellegrini

Encouraging a wider usage of software derived from research Online

2019, (Position paper of the software working group of the French National Council for Open Science).

BibTeX | Links:

Antoine Pietri, Stefano Zacchiroli

Towards Universal Software Evolution Analysis Proceedings Article

In: BENEVOL 2018: The 17th Belgium-Netherlands Software Evolution Workshop, pp. 6-10, 2018, ISSN: 1613-0073.

Abstract | BibTeX | Links:

Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli

Building the Universal Archive of Source Code Journal Article

In: Communications of the ACM, vol. 61, no. 10, pp. 29-31, 2018, ISSN: 0001-0782.

BibTeX | Links:

Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli

Identifiers for Digital Objects: the Case of Software Source Code Preservation Proceedings Article

In: iPRES 2018 – 15th International Conference on Digital Preservation, 2018.

BibTeX | Links:

Yannick Barborini, Roberto Di Cosmo, Antoine R. Dumont, Morane Gruenpeter, Bruno P. Marmol, Alain Monteil, Jozefina Sadowska, Stefano Zacchiroli

The creation of a new type of scientific deposit: Software Miscellaneous

RDA Eleventh Plenary Meeting, Berlin, Germany, 2018, (poster).

BibTeX | Links:

Yannick Barborini, Roberto Di Cosmo, Antoine R. Dumont, Morane Gruenpeter, Bruno P. Marmol, Alain Monteil, Jozefina Sadowska, Stefano Zacchiroli

La création du nouveau type de dépôt scientifique – Le logiciel Miscellaneous

JSO 2018 – 7es journées Science Ouverte Couperin : 100 % open access : initiatives pour une transition réussie, 2018, (poster).

BibTeX | Links:

Roberto Di Cosmo, Stefano Zacchiroli

Software Heritage: Why and How to Preserve Software Source Code Proceedings Article

In: iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 2017.

BibTeX | Links:

Software Heritage

Acknowledging and referencing Software Heritage

Publication policy

Publications

2025

2024

2023

2022

2021

2020

2019

2018

2017

Software Heritage

Publications

Acknowledging and referencing Software Heritage

Publication policy

Publications

2025

2024

2023

2022

2021

2020

2019

2018

2017

Follow us