Acknowledging and Referencing Software Heritage
Below you find a list of relevant scientific publications produced as part of our mission.
If your scientific work benefited from Software Heritage, we encourage you to acknowledge it in your publications. The preferred way of doing that is to: (1) add a footnote to the title page of your papers like this: “This work was made possible by Software Heritage, the great library of source code: https://www.softwareheritage.org”; and (2) cite at least one of the iPres 2017 and CACM 2018 papers (from the list below) in the References section of your scientific publications.
Publication policy
We are committed to Open Access, and we strive to make available openly all publications funded by or for Software Heritage, if possible under a CC-BY-4.0 license. When needed, we make a copy of the (pre)publication available through links on this page.
Publications
2024
Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli
Source Code Archiving to the Rescue of Reproducible Deployment Proceedings Article
In: 2024 ACM Conference on Reproducibility and Replicability, pp. 10 pages, ACM, 2024.
@inproceedings{acm-rep-2024-guix-swh,
title = {Source Code Archiving to the Rescue of Reproducible Deployment},
author = {Ludovic Courtès and Timothy Sample and Simon Tournier and Stefano Zacchiroli},
doi = {10.1145/3641525.3663622},
year = {2024},
date = {2024-07-11},
urldate = {2024-01-01},
booktitle = {2024 ACM Conference on Reproducibility and Replicability},
pages = {10 pages},
publisher = {ACM},
abstract = {The ability to verify research results and to experiment with methodologies are core tenets of science. As research results are increasingly the outcome of computational processes, software plays a central role. GNU Guix is a software deployment tool that supports reproducible software deployment, making it a foundation for computational research workflows. To achieve reproducibility, we must first ensure the source code of software packages Guix deploys remains available. We describe our work connecting Guix with Software Heritage, the universal source code archive, making Guix the first free soft- ware distribution and tool backed by a stable archive. Our contribution is twofold: we explain the rationale and present the design and implementation we came up with; second, we report on the archival coverage for package source code with data collected over five years and discuss remaining challenges.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Tommaso Fontana, Sebastiano Vigna, Stefano Zacchiroli
WebGraph: The Next Generation (Is in Rust) Proceedings Article
In: Companion Proceedings of the ACM Web Conference 2024 (WWW '24 Companion), pp. 686-689, ACM, 2024.
@inproceedings{www-2024-webgraph-rs,
title = {WebGraph: The Next Generation (Is in Rust)},
author = {Tommaso Fontana and Sebastiano Vigna and Stefano Zacchiroli},
doi = {10.1145/3589335.3651581},
year = {2024},
date = {2024-05-13},
booktitle = {Companion Proceedings of the ACM Web Conference 2024 (WWW '24 Companion)},
pages = {686-689},
publisher = {ACM},
abstract = {We report the results of a yearlong effort to port the WebGraph framework from Java to Rust. For two decades WebGraph has been instrumental in the analysis and distribution of large graphs for the research community of TheWebConf, but the intrinsic limitations of the Java Virtual Machine had become a bottleneck for very large use cases, such as the Software Heritage Merkle graph with its half a trillion arcs. As part of this clean-slate implementation of WebGraph in Rust, we developed a few ancillary projects bringing to the Rust ecosystem some missing features of independent interest, such as easy, consistent and zero-cost memory mapping of data structures. WebGraph in Rust offers impressive performance improvements over the previous implementation, enabling open-source graph analytics on very large datasets on top of a modern system programming language.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2023
Mathilde Fichen, Morane Gruenpeter , Jérémy Bobbio , Sabrina Granger , Roberto Di Cosmo, Jean-François Abramatic, Isabelle Astic , Emmanuelle Bermès, Camille Françoise, Claude Gomez, Wendy Hagenmaier, Grégory Miura, Carlo Montangero, Simon Phipps, Kenneth Seals-Nutt
SWHAP Workshop, September 14th and 15th, 2023 Proceedings
HAL, 2023.
@proceedings{Fichen2023,
title = {SWHAP Workshop, September 14th and 15th, 2023},
author = {Mathilde Fichen, Morane Gruenpeter , Jérémy Bobbio , Sabrina Granger , Roberto Di Cosmo, Jean-François Abramatic, Isabelle Astic , Emmanuelle Bermès, Camille Françoise, Claude Gomez, Wendy Hagenmaier, Grégory Miura, Carlo Montangero, Simon Phipps, Kenneth Seals-Nutt
},
url = {https://hal.science/hal-04251779},
year = {2023},
date = {2023-10-25},
urldate = {2023-10-25},
abstract = {In October 2022, Software Heritage hosted its inaugural SWHAP Days, a two-day conference dedicated to software preservation. In 2023, the Software Heritage team decided to organize a two-day, hands-on workshop in a closed committee format, scheduled for September 14th and 15th, 2023, at the Inria Paris centre. The workshop aimed to bring together professionals from diverse backgrounds, including conservation and heritage experts, researchers, and engineers. The objective was to foster collaboration, leveraging their collective knowledge and expertise to generate tangible and valuable outcomes for the community. Two topics were selected for this workshop: (1) building a guidebook on legacy software preservation and (2) telling the stories of legacy software.},
howpublished = {HAL},
keywords = {},
pubstate = {published},
tppubtype = {proceedings}
}
Valentin Lorentz, Di Cosmo, Roberto, Stefano Zacchiroli
The Popular Content Filenames Dataset: Deriving Most Likely Filenames from the Software Heritage Archive Unpublished
2023, (working paper or preprint).
@unpublished{lorentz:hal-04171177,
title = {The Popular Content Filenames Dataset: Deriving Most Likely Filenames from the Software Heritage Archive},
author = {Valentin Lorentz and Di Cosmo, Roberto and Stefano Zacchiroli},
url = {https://inria.hal.science/hal-04171177},
year = {2023},
date = {2023-07-01},
urldate = {2023-07-01},
abstract = {The Popular Content Filenames Dataset provides for each unique file content present in the Software Heritage Graph dataset its most popular filename. For the 2022-04-25 version, it contains over 12 billion entries and weights 413 gigabytes. This dataset allows to easily select subsets of the file contents from the Software Heritage archive based on file name patterns, facilitating reseach tasks in areas like data compression and machine learning.},
note = {working paper or preprint},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
Romain Lefeuvre, Jessie Galasso, Benoit Combemale, Houari Sahraoui, Stefano Zacchiroli
Fingerprinting and Building Large Reproducible Datasets Proceedings Article
In: 2023 ACM Conference on Reproducibility and Replicability, pp. 27-36, ACM, 2023.
@inproceedings{acm-rep-2023-reproducible-datasets,
title = {Fingerprinting and Building Large Reproducible Datasets},
author = {Romain Lefeuvre and Jessie Galasso and Benoit Combemale and Houari Sahraoui and Stefano Zacchiroli},
doi = {10.1145/3589806.3600043},
year = {2023},
date = {2023-01-01},
booktitle = {2023 ACM Conference on Reproducibility and Replicability},
pages = {27-36},
publisher = {ACM},
abstract = {Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Jesus M. Gonzalez-Barahona, Sergio Montes-Leon, Gregorio Robles, Stefano Zacchiroli
The Software Heritage License Dataset (2022 Edition) Journal Article
In: Empirical Software Engineering, 2023, ISSN: 1382-3256.
@article{emse-2023-swh-license-dataset,
title = {The Software Heritage License Dataset (2022 Edition)},
author = {Jesus M. Gonzalez-Barahona and Sergio Montes-Leon and Gregorio Robles and Stefano Zacchiroli},
doi = {10.1007/s10664-023-10377-w},
issn = {1382-3256},
year = {2023},
date = {2023-01-01},
journal = {Empirical Software Engineering},
publisher = {Springer},
abstract = {Context: When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories. Objective: To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis. Method: Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses. Results: The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums. Conclusions: Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Roberto Di Cosmo, Stefano Zacchiroli
The Software Heritage Open Science Ecosystem Book Chapter
In: Mens, Tom; Roover, Coen De; Cleve, Anthony (Ed.): Software Ecosystems: Tooling and Analytics, pp. 33–61, Springer International Publishing, Cham, 2023, ISBN: 978-3-031-36060-2.
@inbook{SWHecosystems2023,
title = {The Software Heritage Open Science Ecosystem},
author = {Roberto Di Cosmo and Stefano Zacchiroli},
editor = {Tom Mens and Coen De Roover and Anthony Cleve},
url = {https://doi.org/10.1007/978-3-031-36060-2_2},
doi = {10.1007/978-3-031-36060-2_2},
isbn = {978-3-031-36060-2},
year = {2023},
date = {2023-01-01},
booktitle = {Software Ecosystems: Tooling and Analytics},
pages = {33–61},
publisher = {Springer International Publishing},
address = {Cham},
abstract = {Software Heritage is the largest public archive of software
source code and associated development history, as captured by
modern version control systems. As of July 2023, it has
archived more than 16 billion unique source code files coming
from more than 250 million collaborative development
projects. In this chapter, we describe the Software Heritage
ecosystem, focusing on research and open science use cases.},
keywords = {},
pubstate = {published},
tppubtype = {inbook}
}
source code and associated development history, as captured by
modern version control systems. As of July 2023, it has
archived more than 16 billion unique source code files coming
from more than 250 million collaborative development
projects. In this chapter, we describe the Software Heritage
ecosystem, focusing on research and open science use cases.
2022
Roberto Di Cosmo
Code Source Book Section
In: Dictionnaire du Numérique, vol. February, 2022.
BibTeX | Links:
@incollection{dicosmo-hal-03587026,
title = {Code Source},
author = {Roberto Di Cosmo},
url = {https://hal.inria.fr/hal-03587026
http://www.dicosmo.org/Articles/2022-02-code-source_EN.pdf},
year = {2022},
date = {2022-02-01},
urldate = {2022-02-01},
booktitle = {Dictionnaire du Numérique},
volume = {February},
series = {Dictionnaire du Numérique},
keywords = {},
pubstate = {published},
tppubtype = {incollection}
}
Kevin Wellenzohn, Michael H. Böhlen, Sven Helmer, Antoine Pietri, Stefano Zacchiroli
Robust and Scalable Content-and-Structure Indexing Journal Article
In: the VLDB Journal, 2022, ISSN: 1066-8888.
@article{vldb-2022-rscas-swh,
title = {Robust and Scalable Content-and-Structure Indexing},
author = {Kevin Wellenzohn and Michael H. Böhlen and Sven Helmer and Antoine Pietri and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/vldb-2022-rscas-swh.pdf},
doi = {10.1007/s00778-022-00764-y},
issn = {1066-8888},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
journal = {the VLDB Journal},
publisher = {Springer},
abstract = {Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Davide Rossi, Stefano Zacchiroli
Worldwide Gender Differences in Public Code Contributions (and How They Have Been Affected by the COVID-19 Pandemic) Proceedings Article
In: 44th International Conference on Software Engineering (ICSE 2022) – Software Engineering in Society (SEIS) Track, pp. 172-183, ACM, 2022.
@inproceedings{icse-seis-2022-gender,
title = {Worldwide Gender Differences in Public Code Contributions (and How They Have Been Affected by the COVID-19 Pandemic)},
author = {Davide Rossi and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/icse-seis-2022-gender.pdf},
doi = {10.1109/ICSE-SEIS55304.2022.9794118},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {44th International Conference on Software Engineering (ICSE 2022) - Software Engineering in Society (SEIS) Track},
pages = {172-183},
publisher = {ACM},
abstract = {Gender imbalance is a well-known phenomenon observed throughout sciences which is particularly severe in software development and Free/Open Source Software communities. Little is know yet about the geography of this phenomenon in particular when considering large scales for both its time and space dimensions. We contribute to fill this gap with a longitudinal study of the population of contributors to publicly available software source code. We analyze the development history of 160 million software projects for a total of 2.2 billion commits contributed by 43 million distinct authors over a period of 50 years. We classify author names by gender using name frequencies and author geographical locations using heuristics based on email addresses and time zones. We study the evolution over time of contributions to public code by gender and by world region. For the world overall, we confirm previous findings about the low but steadily increasing ratio of contributions by female authors. When breaking down by world regions we find that the long-term growth of female participation is a world-wide phenomenon. We also observe a decrease in the ratio of female participation during the COVID-19 pandemic, suggesting that women’s ability to contribute to public code has been more hindered than that of men.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Roberto Di Cosmo
Building the software pillar of Open Science Proceedings Article
In: Open Science European Conferencem (OSEC 2022), pp. 183–193, OpenEdition Press, 2022, ISBN: 9791036545627.
BibTeX | Links:
@inproceedings{osec_2022_en,
title = {Building the software pillar of Open Science},
author = {Roberto Di Cosmo},
url = {http://www.dicosmo.org/Articles/2022-osec-en.pdf},
doi = {10.4000/books.oep.15829},
isbn = {9791036545627},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {Open Science European Conferencem (OSEC 2022)},
pages = {183--193},
publisher = {OpenEdition Press},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Roberto Di Cosmo
Should We Preserve the World's Software History, And Can We? Proceedings Article
In: Silvello, Gianmaria; Corcho, Óscar; Manghi, Paolo; Nunzio, Giorgio Maria Di; Golub, Koraljka; Ferro, Nicola; Poggi, Antonella (Ed.): Linking Theory and Practice of Digital Libraries – 26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20-23, 2022, Proceedings, pp. 3–7, Springer, 2022.
BibTeX | Links:
@inproceedings{dicosmo_tpdl_2022,
title = {Should We Preserve the World's Software History, And Can We?},
author = {Roberto Di Cosmo},
editor = {Gianmaria Silvello and Óscar Corcho and Paolo Manghi and Giorgio Maria Di Nunzio and Koraljka Golub and Nicola Ferro and Antonella Poggi},
url = {http://www.dicosmo.org/Articles/2022-TPDL.pdf},
doi = {10.1007/978-3-031-16802-4-1},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {Linking Theory and Practice of Digital Libraries - 26th International
Conference on Theory and Practice of Digital Libraries, TPDL 2022,
Padua, Italy, September 20-23, 2022, Proceedings},
volume = {13541},
pages = {3--7},
publisher = {Springer},
series = {Lecture Notes in Computer Science},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Daniele Serafini, Stefano Zacchiroli
Efficient Prior Publication Identification for Open Source Code Proceedings Article
In: 18th International Conference on Open Source Systems (OSS 2022), ACM, 2022.
@inproceedings{oss-2022-swh-scanner,
title = {Efficient Prior Publication Identification for Open Source Code},
author = {Daniele Serafini and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/oss-2022-swh-scanner.pdf},
doi = {10.1145/3555051.3555068},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {18th International Conference on Open Source Systems (OSS 2022)},
publisher = {ACM},
abstract = {Free/Open Source Software (FOSS) enables large-scale reuse of preexisting software components. The main drawback is increased complexity in software supply chain management. A common approach to tame such complexity is automated open source compliance, which consists in automating the verification of adherence to various open source management best practices about license obligation fulfillment, vulnerability tracking, software composition analysis, and nearby concerns. We consider the problem of auditing a source code base to determine which of its parts have been published before, which is an important building block of automated open source compliance toolchains. Indeed, if source code allegedly developed in house is recognized as having been previously published elsewhere, alerts should be raised to investigate where it comes from and whether this entails that additional obligations shall be fulfilled before product shipment. We propose an efficient approach for prior publication identification that relies on a knowledge base of known source code artifacts linked together in a global Merkle direct acyclic graph and a dedicated discovery protocol. We introduce swh-scanner, a source code scanner that realizes the proposed approach in practice using as knowledge base Software Heritage, the largest public archive of source code artifacts. We validate experimentally the proposed approach, showing its efficiency in both abstract (number of queries) and concrete terms (wall-clock time), performing benchmarks on 16'845 real-world public code bases of various sizes, from small to very large.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Davide Rossi, Stefano Zacchiroli
Geographic Diversity in Public Code Contributions: An Exploratory Large-Scale Study Over 50 Years Proceedings Article
In: The 2022 Mining Software Repositories Conference (MSR 2022), pp. 80-85, ACM, 2022.
@inproceedings{msr-2022-foss-geography,
title = {Geographic Diversity in Public Code Contributions: An Exploratory Large-Scale Study Over 50 Years},
author = {Davide Rossi and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/msr-2022-foss-geography.pdf},
doi = {10.1145/3524842.3528471},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {The 2022 Mining Software Repositories Conference (MSR 2022)},
pages = {80-85},
publisher = {ACM},
abstract = {We conduct an exploratory, large-scale, longitudinal study of 50 years of commits to publicly available version control system repositories, in order to characterize the geographic diversity of contributors to public code and its evolution over time. We analyze in total 2.2 billion commits collected by Software Heritage from 160 million projects and authored by 43 million authors during the 1971–2021 time period. We geolocate developers to 12 world regions derived from the United Nation geoscheme, using as signals email top-level domains, author names compared with names distributions around the world, and UTC offsets mined from commit metadata. We find evidence of the early dominance of North America in open source software, later joined by Europe. After that period, the geographic diversity in public code has been constantly increasing. We also identify relevant historical shifts related to the UNIX wars, the increase of coding literacy in Central and South Asia, and broader phenomena like colonialism and people movement across countries (immigration/emigration).},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stefano Zacchiroli
A Large-scale Dataset of (Open Source) License Text Variants Proceedings Article
In: The 2022 Mining Software Repositories Conference (MSR 2022), pp. 757-761, ACM, 2022.
@inproceedings{msr-2022-foss-licenses,
title = {A Large-scale Dataset of (Open Source) License Text Variants},
author = {Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/msr-2022-foss-licenses.pdf},
doi = {10.1145/3524842.3528491},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {The 2022 Mining Software Repositories Conference (MSR 2022)},
pages = {757-761},
publisher = {ACM},
abstract = {We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2021
Antoine Pietri
Organizing the graph of public software development for large-scale mining PhD Thesis
Université Paris Cité, 2021.
BibTeX | Links:
@phdthesis{pietri:tel-03515795,
title = {Organizing the graph of public software development for large-scale mining},
author = {Antoine Pietri},
editor = {Université Paris Cité},
url = {https://hal.science/tel-03515795
https://hal.science/tel-03515795v2/file/va_Pietri_Antoine.pdf},
year = {2021},
date = {2021-11-01},
urldate = {2021-11-01},
number = {2021UNIP7183},
school = {Université Paris Cité},
key = {2021UNIP7183},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
Morane Gruenpeter, Roberto Di Cosmo, Katherine Thornton, Kenneth Seals-Nutt, Carlo Montangero, Guido Scatena
Software Stories for landmark legacy code Technical Report
Inria 2021.
BibTeX | Links:
@techreport{gruenpeter:hal-03483982,
title = {Software Stories for landmark legacy code},
author = {Morane Gruenpeter and Roberto Di Cosmo and Katherine Thornton and Kenneth Seals-Nutt and Carlo Montangero and Guido Scatena},
url = {https://hal.archives-ouvertes.fr/hal-03483982},
year = {2021},
date = {2021-11-01},
institution = {Inria},
keywords = {},
pubstate = {published},
tppubtype = {techreport}
}
Laura Bussi, Roberto Di Cosmo, Carlo Montangero, Guido Scatena
Preserving landmark legacy software with the Software Heritage Acquisition Process Proceedings Article
In: iPres2021 – 17th International Conference on Digital Preservation, Beijing, China, 2021.
BibTeX | Links:
@inproceedings{bussi:hal-03375572,
title = {Preserving landmark legacy software with the Software Heritage Acquisition Process},
author = {Laura Bussi and Roberto Di Cosmo and Carlo Montangero and Guido Scatena},
url = {https://hal.archives-ouvertes.fr/hal-03375572},
year = {2021},
date = {2021-10-01},
booktitle = {iPres2021 - 17th International Conference on Digital Preservation},
address = {Beijing, China},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stefano Zacchiroli
Gender Differences in Public Code Contributions: a 50-year Perspective Journal Article
In: IEEE Software, 2021, ISSN: 0740-7459.
@article{ieee-sw-gender-swh,
title = {Gender Differences in Public Code Contributions: a 50-year Perspective},
author = {Stefano Zacchiroli},
url = {https://arxiv.org/abs/2011.08488
https://www.softwareheritage.org/wp-content/uploads/2021/03/ieee-sw-gender-swh.pdf},
doi = {10.1109/MS.2020.3038765},
issn = {0740-7459},
year = {2021},
date = {2021-01-01},
journal = {IEEE Software},
publisher = {IEEE Computer Society},
abstract = {Gender imbalance in information technology in general, and Free/Open Source Software specifically, is a well-known problem in the field. Still, little is known yet about the large-scale extent and long-term trends that underpin the phenomenon. We contribute to fill this gap by conducting a longitudinal study of the population of contributors to publicly available software source code. We analyze 1.6 billion commits corresponding to the development history of 120 million projects, contributed by 33 million distinct authors over a period of 50 years. We classify author names by gender and study their evolution over time. We show that, while the amount of commits by female authors remains low overall, there is evidence of a stable long-term increase in their proportion over all contributions, providing hope of a more gender-balanced future for collaborative software development.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Thibault Allançon, Antoine Pietri, Stefano Zacchiroli
The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development Proceedings Article
In: ICSE 2021: The 43rd International Conference on Software Engineering, pp. 45-48, IEEE, 2021.
@inproceedings{swh-fuse-icse2021,
title = {The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development},
author = {Thibault Allançon and Antoine Pietri and Stefano Zacchiroli},
url = {https://arxiv.org/abs/2102.06390
https://www.softwareheritage.org/wp-content/uploads/2021/03/swh-fuse-icse2021.pdf},
doi = {10.1109/ICSE-Companion52605.2021.00032},
year = {2021},
date = {2021-01-01},
urldate = {2021-01-01},
booktitle = {ICSE 2021: The 43rd International Conference on Software Engineering},
pages = {45-48},
publisher = {IEEE},
abstract = {We introduce the Software Heritage filesystem (SwhFS), a user-space filesystem that integrates large-scale open source software archival with development workflows. SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history. Using SwhFS, developers can quickly “checkout” any of the 2 billion commits archived by Software Heritage, even after they disappear from their previous known location and without incurring the performance cost of repository cloning. SwhFS works across unrelated repositories and different VCS technologies. Other source code artifacts archived by Software Heritage—individual source code files and trees, releases, and branches—can also be accessed using common programming tools and custom scripts, as if they were locally available. A screencast of SwhFS is available online at dx.doi.org/10.5281/zenodo.4531411.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2020
Morane Gruenpeter, Roberto Di Cosmo, Alice Allen, Anita Bandrowski, Peter Chan, Martin Fenner, Leyla Garcia, Catherine M Jones, Daniel S Katz, John Kunze, Moritz Schubotz, Ilian T Todorov
Use cases and identifier schemes for persistent software source code identification Technical Report
2020, (Output from the Research Data Alliance/FORCE11 Software Source Code Identification Working group).
BibTeX | Links:
@techreport{SCIDWG2020,
title = {Use cases and identifier schemes for persistent software source code identification},
author = {Morane Gruenpeter and Roberto Di Cosmo and Alice Allen and Anita Bandrowski and Peter Chan and Martin Fenner and Leyla Garcia and Catherine M Jones and Daniel S Katz and John Kunze and Moritz Schubotz and Ilian T Todorov},
editor = {Morane Gruenpeter},
url = {https://doi.org/10.15497/RDA00053},
doi = {10.15497/RDA00053},
year = {2020},
date = {2020-10-06},
publisher = {Zenodo},
note = {Output from the Research Data Alliance/FORCE11 Software Source Code Identification Working group},
keywords = {},
pubstate = {published},
tppubtype = {techreport}
}
Morane Gruenpeter, Roberto Di Cosmo, Hylke Koers, Patricia Herterich, Rob Hooft, Jessica Parland-von Essen, Jonas Tana, Tero Aalto, Sarah Jones
M2.15 Assessment report on 'FAIRness of software' Miscellaneous
2020.
BibTeX | Links:
@misc{gruenpeter_morane_2020_5472911,
title = {M2.15 Assessment report on 'FAIRness of software'},
author = {Morane Gruenpeter and Roberto Di Cosmo and Hylke Koers and Patricia Herterich and Rob Hooft and Jessica Parland-von Essen and Jonas Tana and Tero Aalto and Sarah Jones},
url = {https://www.softwareheritage.org/wp-content/uploads/2022/12/M2.15_FAIRsFAIR_Assessment_report_on_FAIRness_of_software_20201016_v1.1.pdf},
doi = {10.5281/zenodo.5472911},
year = {2020},
date = {2020-10-01},
urldate = {2020-10-01},
publisher = {Zenodo},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
Roberto Di Cosmo
Archiving and Referencing Source Code with Software Heritage Proceedings Article
In: ICMS, pp. 362–373, Springer, 2020, ISBN: 978-3-030-52200-1.
@inproceedings{DBLP:conf/icms/Cosmo20,
title = {Archiving and Referencing Source Code with Software Heritage},
author = {Roberto Di Cosmo},
doi = {10.1007/978-3-030-52200-1_36},
isbn = {978-3-030-52200-1},
year = {2020},
date = {2020-07-15},
booktitle = {ICMS},
volume = {12097},
pages = {362--373},
publisher = {Springer},
series = {Lecture Notes in Computer Science},
abstract = {Software, and software source code in particular, is widely used in modern research. It must be properly archived, referenced, described and cited in order to build a stable and long lasting corpus of scientific knowledge. In this article we show how the Software Heritage universal source code archive provides a means to fully address the first two concerns, by archiving seamlessly all publicly available software source code, and by providing intrinsic persistent identifiers that allow to reference it at various granularities in a way that is at the same time convenient and effective.
We call upon the research community to adopt widely this approach.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We call upon the research community to adopt widely this approach.
Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli
Software provenance tracking at the scale of public source code Journal Article
In: Empirical Software Engineering, pp. 1-30, 2020, ISSN: 1573-7616.
@article{Rousseau:2020,
title = {Software provenance tracking at the scale of public source code},
author = {Guillaume Rousseau and Roberto Di Cosmo and Stefano Zacchiroli},
url = {https://hal.archives-ouvertes.fr/hal-02543794},
doi = {10.1007/s10664-020-09828-5},
issn = {1573-7616},
year = {2020},
date = {2020-05-29},
journal = {Empirical Software Engineering},
pages = {1-30},
abstract = {We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli
Determining the Intrinsic Structure of Public Software Development History Proceedings Article
In: MSR 2020: The 17th International Conference on Mining Software Repositories, pp. 602-605, IEEE, 2020.
@inproceedings{msr-2020-topology,
title = {Determining the Intrinsic Structure of Public Software Development History},
author = {Antoine Pietri and Guillaume Rousseau and Stefano Zacchiroli},
url = {https://arxiv.org/abs/2011.07914
https://www.softwareheritage.org/wp-content/uploads/2021/03/msr-2020-topology.pdf},
doi = {10.1145/3379597.3387506},
year = {2020},
date = {2020-05-01},
booktitle = {MSR 2020: The 17th International Conference on Mining Software Repositories},
pages = {602-605},
publisher = {IEEE},
abstract = {Background: Collaborative software development has produced a wealth of version control system (VCS) data that can now be analyzed in full. Little is known about the intrinsic structure of the entire corpus of publicly available VCS as an interconnected graph. Understanding its structure is needed to determine the best approach to analyze it in full and to avoid methodological pitfalls when doing so. Objective: We intend to determine the most salient network topology properties of public software development history as captured by VCS. We will explore: degree distributions, determining whether they are scale-free or not; distribution of connect component sizes; distribution of shortest path lengths. Method: We will use Software Heritage---which is the largest corpus of public VCS data---compress it using webgraph compression techniques, and analyze it in-memory using classic graph algorithms. Analyses will be performed both on the full graph and on relevant subgraphs. Limitations: The study is exploratory in nature; as such no hypotheses on the findings is stated at this time. Chosen graph algorithms are expected to scale to the corpus size, but it will need to be confirmed experimentally. External validity will depend on how representative Software Heritage is of the software commons.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli
Forking Without Clicking: on How to Identify Software Repository Forks Proceedings Article
In: MSR 2020: The 17th International Conference on Mining Software Repositories, pp. 277-287, IEEE, 2020.
@inproceedings{msr-2020-forks,
title = {Forking Without Clicking: on How to Identify Software Repository Forks},
author = {Antoine Pietri and Guillaume Rousseau and Stefano Zacchiroli},
url = {https://arxiv.org/abs/2011.07821
https://www.softwareheritage.org/wp-content/uploads/2021/03/msr-2020-forks.pdf},
doi = {10.1145/3379597.3387450},
year = {2020},
date = {2020-05-01},
booktitle = {MSR 2020: The 17th International Conference on Mining Software Repositories},
pages = {277-287},
publisher = {IEEE},
abstract = {The notion of software "fork" has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These “forge forks” however can only identify as forks repositories that have been created on the platform, e.g., by clicking a "fork" button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of "software forks", trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks, observing how they are affected by the proposed definitions and discuss the potential impact on empirical research.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli
The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History Proceedings Article
In: MSR 2020: The 17th International Conference on Mining Software Repositories, pp. 1-5, IEEE, 2020.
@inproceedings{msr-2020-challenge,
title = {The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History},
author = {Antoine Pietri and Diomidis Spinellis and Stefano Zacchiroli},
url = {https://arxiv.org/abs/2011.07824
https://www.softwareheritage.org/wp-content/uploads/2021/03/msr-2020-challenge.pdf},
doi = {10.1145/3379597.3387510},
year = {2020},
date = {2020-05-01},
booktitle = {MSR 2020: The 17th International Conference on Mining Software Repositories},
pages = {1-5},
publisher = {IEEE},
abstract = {Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on "most starred" repositories as it often happens.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Roberto Di Cosmo, Marco Danelutto
[Rp] Reproducing and replicating the OCamlP3l experiment Journal Article
In: ReScience C, vol. 6, no. 1, 2020.
@article{dicosmo-rescience-2020,
title = {[Rp] Reproducing and replicating the OCamlP3l experiment},
author = {Roberto Di Cosmo and Marco Danelutto},
url = {https://www.softwareheritage.org/wp-content/uploads/2021/03/dicosmo-rescience-2020.pdf
https://zenodo.org/record/3763416/files/article.pdf
https://rescience.github.io/read/#volume-6-2020},
doi = {10.5281/zenodo.3763416},
year = {2020},
date = {2020-04-30},
journal = {ReScience C},
volume = {6},
number = {1},
abstract = {This article provides a full report on the effort to reproduce the work described in the article “Parallel Functional Programming with Skeletons: the OCamlP3L experiment”, written in 1998. It presented OCamlP3L, a parallel programming system written in the OCaml programming language. It turns out that we found the source code of the OCamlP3L system only in Software Heritage: since it was saved with all its development history, we could perform this reproduction experiment.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli
Ultra-Large-Scale Repository Analysis via Graph Compression Proceedings Article
In: SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, pp. 184-194, IEEE, 2020.
@inproceedings{saner-2020-swh-graph,
title = {Ultra-Large-Scale Repository Analysis via Graph Compression},
author = {Paolo Boldi and Antoine Pietri and Sebastiano Vigna and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/02/saner-2020-swh-graph.pdf
https://upsilon.cc/~zack/research/publications/saner-2020-swh-graph.pdf},
doi = {10.1109/SANER48275.2020.9054827},
year = {2020},
date = {2020-02-21},
booktitle = {SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering},
pages = {184-194},
publisher = {IEEE},
abstract = {We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects—encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Pierre Alliez, Roberto Di Cosmo, Benjamin Guedj, Alain Girault, Mohand-Said Hacid, Arnaud Legrand, Nicolas Rougier
Attributing and Referencing (Research) Software: Best Practices and Outlook From Inria Journal Article
In: Computing in Science Engineering, vol. 22, no. 1, pp. 39-52, 2020, ISSN: 1558-366X.
@article{2020GtCitation,
title = {Attributing and Referencing (Research) Software: Best Practices and Outlook From Inria},
author = {Pierre Alliez and Roberto Di Cosmo and Benjamin Guedj and Alain Girault and Mohand-Said Hacid and Arnaud Legrand and Nicolas Rougier},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/2020GtCitation.pdf
https://hal.archives-ouvertes.fr/hal-02135891},
doi = {10.1109/MCSE.2019.2949413},
issn = {1558-366X},
year = {2020},
date = {2020-01-01},
journal = {Computing in Science Engineering},
volume = {22},
number = {1},
pages = {39-52},
abstract = {Software is a fundamental pillar of modern scientific research, across all fields and disciplines. However, there is a lack of adequate means to cite and reference software due to the complexity of the problem in terms of authorship, roles, and credits. This complexity is further increased when it is considered over the lifetime of a software that can span up to several decades. Building upon the internal experience of Inria, the French research institute for digital sciences, we provide in this article a contribution to the ongoing efforts in order to develop proper guidelines and recommendations for software citation and reference. Namely, we recommend: first, a richer taxonomy for software contributions with a qualitative scale; second, to put humans at the heart of the evaluation; and third, to distinguish citation from reference.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Roberto Di Cosmo, Jose Benito Gonzalez Lopez, Jean-François Abramatic, Kay Graf, Miguel Colom, Paolo Manghi, Melissa Harrison, Yannick Barborini, Ville Tenhunen, Michael Wagner, Wolfgang Dalitz, Jason Maassen, Carlos Martinez-Ortiz, Elisabetta Ronchieri, Sam Yates, Moritz Schubotz, Leonardo Candela, Martin Fenner, Eric Jeangirard
Scholarly Infrastructures for Research Software Book
European Commission. Directorate General for Research and Innovation., 2020, ISBN: 978-92-76-25568-0.
BibTeX | Links:
@book{SIRSReport2020,
title = {Scholarly Infrastructures for Research Software},
author = {Roberto Di Cosmo and Jose Benito Gonzalez Lopez and Jean-François Abramatic and Kay Graf and Miguel Colom and Paolo Manghi and Melissa Harrison and Yannick Barborini and Ville Tenhunen and Michael Wagner and Wolfgang Dalitz and Jason Maassen and Carlos Martinez-Ortiz and Elisabetta Ronchieri and Sam Yates and Moritz Schubotz and Leonardo Candela and Martin Fenner and Eric Jeangirard},
url = {https://data.europa.eu/doi/10.2777/28598},
doi = {10.2777/28598},
isbn = {978-92-76-25568-0},
year = {2020},
date = {2020-01-01},
publisher = {European Commission. Directorate General for Research and Innovation.},
keywords = {},
pubstate = {published},
tppubtype = {book}
}
Roberto Di Cosmo
Announcing biblatex-software Journal Article
In: ACM SIGSOFT Software Engineering Notes, vol. 45, no. 4, pp. 22–23, 2020.
BibTeX | Links:
@article{DiCosmo2020b,
title = {Announcing biblatex-software},
author = {Roberto Di Cosmo},
url = {https://hal.archives-ouvertes.fr/hal-02977711},
doi = {10.1145/3417564.3417570},
year = {2020},
date = {2020-01-01},
journal = {ACM SIGSOFT Software Engineering Notes},
volume = {45},
number = {4},
pages = {22--23},
publisher = {Association for Computing Machinery (ACM)},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Roberto Di Cosmo, Morane Gruenpeter, Bruno Marmol, Alain Monteil, Laurent Romary, Jozefina Sadowska
Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL) Journal Article
In: International Journal of Digital Curation, vol. 15, no. 1, pp. 16, 2020.
@article{DiCosmo2020,
title = {Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL)},
author = {Roberto Di Cosmo and Morane Gruenpeter and Bruno Marmol and Alain Monteil and Laurent Romary and Jozefina Sadowska},
url = {https://doi.org/10.2218/ijdc.v15i1.698},
doi = {10.2218/ijdc.v15i1.698},
year = {2020},
date = {2020-01-01},
journal = {International Journal of Digital Curation},
volume = {15},
number = {1},
pages = {16},
publisher = {Edinburgh University Library},
abstract = {oftware has become an indissociable support of technical and scientific knowledge. The preservation of this universal body of knowledge is as essential as preserving research articles and data sets. In the quest to make scientific results reproducible, and pass knowledge to future generations, we must preserve these three main pillars: research articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation.
The collaboration between Software Heritage (SWH), the Center for Direct Scientific Communication (CCSD) and the scientific and technical information services (IES) of The French Institute for Research in Computer Science and Automation (Inria) has resulted in a specified moderation and curation workflow for research software artifacts deposited in the HAL the French global open access repository. The curation workflow was developed to help digital librarians and archivists handle this new and peculiar artifact - software source code. While implementing the workflow, a set of guidelines has emerged from the challenges and the solutions put in place to help all actors involved in the process.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The collaboration between Software Heritage (SWH), the Center for Direct Scientific Communication (CCSD) and the scientific and technical information services (IES) of The French Institute for Research in Computer Science and Automation (Inria) has resulted in a specified moderation and curation workflow for research software artifacts deposited in the HAL the French global open access repository. The curation workflow was developed to help digital librarians and archivists handle this new and peculiar artifact – software source code. While implementing the workflow, a set of guidelines has emerged from the challenges and the solutions put in place to help all actors involved in the process.
Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli
Referencing Source Code Artifacts: a Separate Concern in Software Citation Journal Article
In: Computing in Science & Engineering, 2020, ISSN: 1521-9615.
@article{cise-2020-doi,
title = {Referencing Source Code Artifacts: a Separate Concern in Software Citation},
author = {Roberto Di Cosmo and Morane Gruenpeter and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/2020-CiSE-swhid-1.pdf
http://www.dicosmo.org/Articles/2020-CiSE-swhid.pdf
https://hal.archives-ouvertes.fr/hal-02446202},
doi = {10.1109/MCSE.2019.2963148},
issn = {1521-9615},
year = {2020},
date = {2020-01-01},
journal = {Computing in Science & Engineering},
publisher = {IEEE},
abstract = {Among the entities involved in software citation, software
source code requires special attention, due to the role it
plays in ensuring scientific reproducibility. To reference
source code we need identifiers that are not only unique
and persistent, but also support integrity checking
intrinsically. Suitable iden- tifiers must guarantee that
denoted objects will always stay the same, without relying
on external third parties and administrative processes. We
analyze the role of identifiers for digital objects (IDOs),
whose properties are different from, and complementary to,
those of the various digital identifiers of objects (DIOs)
that are today popular building blocks of software and data
citation toolchains. We argue that both kinds of
identifiers are needed and detail the syntax, semantics,
and practical implementation of the persistent identifiers
(PIDs) adopted by the Software Heritage project to
reference billions of software source code artifacts such
as source code files, directories, and commits.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
source code requires special attention, due to the role it
plays in ensuring scientific reproducibility. To reference
source code we need identifiers that are not only unique
and persistent, but also support integrity checking
intrinsically. Suitable iden- tifiers must guarantee that
denoted objects will always stay the same, without relying
on external third parties and administrative processes. We
analyze the role of identifiers for digital objects (IDOs),
whose properties are different from, and complementary to,
those of the various digital identifiers of objects (DIOs)
that are today popular building blocks of software and data
citation toolchains. We argue that both kinds of
identifiers are needed and detail the syntax, semantics,
and practical implementation of the persistent identifiers
(PIDs) adopted by the Software Heritage project to
reference billions of software source code artifacts such
as source code files, directories, and commits.
2019
Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli
The Software Heritage Graph Dataset: Public software development under one roof Proceedings Article
In: Proceedings of the 16th International Conference on Mining Software Repositories, pp. 138-142, IEEE Press, 2019.
@inproceedings{msr-2019-swh,
title = {The Software Heritage Graph Dataset: Public software development under one roof},
author = {Antoine Pietri and Diomidis Spinellis and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/msr-2019-swh.pdf
https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf},
doi = {10.1109/MSR.2019.00030},
year = {2019},
date = {2019-05-27},
booktitle = {Proceedings of the 16th International Conference on Mining Software Repositories},
pages = {138-142},
publisher = {IEEE Press},
series = {MSR '19},
abstract = {Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Mélanie Clément-Fontaine, Roberto Di Cosmo, Bastien Guerry, Patrick Moreau, François Pellegrini
Encouraging a wider usage of software derived from research Online
2019, (Position paper of the software working group of the French National Council for Open Science).
BibTeX | Links:
@online{gplo-note-2020,
title = {Encouraging a wider usage of software derived from research},
author = {Mélanie Clément-Fontaine and Roberto Di Cosmo and Bastien Guerry and Patrick Moreau and François Pellegrini},
url = {https://hal.archives-ouvertes.fr/hal-02545142},
year = {2019},
date = {2019-01-01},
institution = {Committee for Open Science's Free Software and Open Source Project Group},
note = {Position paper of the software working group of the French National Council for Open Science},
keywords = {},
pubstate = {published},
tppubtype = {online}
}
2018
Antoine Pietri, Stefano Zacchiroli
Towards Universal Software Evolution Analysis Proceedings Article
In: BENEVOL 2018: The 17th Belgium-Netherlands Software Evolution Workshop, pp. 6-10, 2018, ISSN: 1613-0073.
@inproceedings{benevol-2018-swh,
title = {Towards Universal Software Evolution Analysis},
author = {Antoine Pietri and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/benevol-2018-swh.pdf
https://upsilon.cc/~zack/research/publications/benevol-2018-swh.pdf},
issn = {1613-0073},
year = {2018},
date = {2018-12-01},
booktitle = {BENEVOL 2018: The 17th Belgium-Netherlands Software Evolution Workshop},
volume = {2361},
pages = {6-10},
series = {CEUR Workshop Proceedings (CEUR-WS)},
abstract = {Software evolution studies have mostly focused on individual software products, generally developed as Free/Open Source Software (FOSS) projects, and more sparingly on software collections like component and package ecosystems. We argue in this paper that the next step in this organic scale expansion is universal software evolution analysis, i.e., the study of software evolution at the scale of the whole body of publicly available software. We consider the case of Software Heritage, the largest existing archive of publicly available software source code artifacts (more than 5 B unique files archived and 1 B commits, coming from more than 80 M software projects). We propose research requirements that would allow to leverage the Software Heritage archive to study universal software evolution. We discuss the challenges that need to be overcome to address such requirements and outline a research roadmap to do so.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli
Building the Universal Archive of Source Code Journal Article
In: Communications of the ACM, vol. 61, no. 10, pp. 29-31, 2018, ISSN: 0001-0782.
BibTeX | Links:
@article{cacm-2018-software-heritage,
title = {Building the Universal Archive of Source Code},
author = {Jean-François Abramatic and Roberto Di Cosmo and Stefano Zacchiroli},
editor = {ACM},
url = {https://cacm.acm.org/magazines/2018/10/231366-building-the-universal-archive-of-source-code/fulltext},
doi = {10.1145/3183558},
issn = {0001-0782},
year = {2018},
date = {2018-10-01},
journal = {Communications of the ACM},
volume = {61},
number = {10},
pages = {29-31},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli
Identifiers for Digital Objects: the Case of Software Source Code Preservation Proceedings Article
In: iPRES 2018 – 15th International Conference on Digital Preservation, 2018.
BibTeX | Links:
@inproceedings{dicosmo:hal-01865790,
title = {Identifiers for Digital Objects: the Case of Software Source Code Preservation},
author = {Roberto Di Cosmo and Morane Gruenpeter and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2018-swh.pdf
https://hal.archives-ouvertes.fr/hal-01865790},
doi = {10.17605/OSF.IO/KDE56},
year = {2018},
date = {2018-09-01},
booktitle = {iPRES 2018 - 15th International Conference on Digital Preservation},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Yannick Barborini, Roberto Di Cosmo, Antoine R. Dumont, Morane Gruenpeter, Bruno P. Marmol, Alain Monteil, Jozefina Sadowska, Stefano Zacchiroli
The creation of a new type of scientific deposit: Software Miscellaneous
RDA Eleventh Plenary Meeting, Berlin, Germany, 2018, (poster).
BibTeX | Links:
@misc{barborini:hal-01738741,
title = {The creation of a new type of scientific deposit: Software},
author = {Yannick Barborini and Roberto Di Cosmo and Antoine R. Dumont and Morane Gruenpeter and Bruno P. Marmol and Alain Monteil and Jozefina Sadowska and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/barborini-rda-poster.pdf
https://hal.inria.fr/hal-01738741},
year = {2018},
date = {2018-03-21},
howpublished = {RDA Eleventh Plenary Meeting, Berlin, Germany},
note = {poster},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
Yannick Barborini, Roberto Di Cosmo, Antoine R. Dumont, Morane Gruenpeter, Bruno P. Marmol, Alain Monteil, Jozefina Sadowska, Stefano Zacchiroli
La création du nouveau type de dépôt scientifique – Le logiciel Miscellaneous
JSO 2018 – 7es journées Science Ouverte Couperin : 100 % open access : initiatives pour une transition réussie, 2018, (poster).
BibTeX | Links:
@misc{barborini:hal-01688726,
title = {La création du nouveau type de dépôt scientifique - Le logiciel},
author = {Yannick Barborini and Roberto Di Cosmo and Antoine R. Dumont and Morane Gruenpeter and Bruno P. Marmol and Alain Monteil and Jozefina Sadowska and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/barborini-jso2018-poster.pdf
https://hal.inria.fr/hal-01688726},
year = {2018},
date = {2018-01-22},
howpublished = {JSO 2018 - 7es journées Science Ouverte Couperin : 100 % open access : initiatives pour une transition réussie},
note = {poster},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
2017
Roberto Di Cosmo, Stefano Zacchiroli
Software Heritage: Why and How to Preserve Software Source Code Proceedings Article
In: iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 2017.
BibTeX | Links:
@inproceedings{dicosmo:hal-01590958,
title = {Software Heritage: Why and How to Preserve Software Source Code},
author = {Roberto Di Cosmo and Stefano Zacchiroli},
url = {https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2017-swh.pdf
https://hal.archives-ouvertes.fr/hal-01590958},
year = {2017},
date = {2017-09-25},
booktitle = {iPRES 2017: 14th International Conference on Digital Preservation},
address = {Kyoto, Japan},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}