Software Heritage in 2019: a progress report
Pursuing our mission to collect, preserve and share the source code of all software ever written, 2019 was a year of great achievements for Software Heritage.
Today is a good time to look back and talk about what has been accomplished in 2019 since our last activity report, and give some perspective on the future.
Policy and Awareness
An important part of our mission is to raise awareness about the importance of software, and software source code, in all aspects of human activity.
This year we pursued and intensified
our collaboration with UNESCO, leading to the publication of the Paris Call on Software Source Code, signed by more than 40 international experts.
The Paris Call provides a strong basis to support a variety of policy actions, ranging from source code preservation, to sustainability of Free and Open Source Software communities.
Pursuing our efforts to have software source code recognised as a pillar of Open Science, we welcomed the French Ministry of Research and Innovation, and started our work in the FAIRsFAIR and EOSC-Pillar european research projects.
On the key issue of attributing and referencing research software, we intensified our engagement with the RDA and FORCE international communities, and published key recommendations leveraging Inria’s 50 years experience in the field.
In order to help researchers in all disciplines to improve the reproducibility of their work, and enhance their articles using the Software Heritage intrinsic identifiers, we published detailed and actionable guidelines for saving and referencing research software source code and reachead out to artefact evaluation committees of major international conferences.
We also joined forces with GNU Guix to enable long term reproducibility.
Last, but not least, together with many international organisations, we played an essential albeit inconspicuous role in protecting software development from extremely damaging provisions in the european copyright reform adopted on April 15th 2019.
Progress on the roadmap
A significant effort went into making progress in our technical and strategic roadmap, continuing our work to collect, preserve and share an ever growing part of the source code of all software ever written.
Collect
The mission of Software Heritage is to collect the source code of all software ever written, and this is a complex undertaking: some software is easily available (online), some is not (offline), and while a growing part of it is open, a lot is still behind closed doors.
We have clearly exposed the different strategies we envision to adopt, depending on the kind of source code at stake, summarised in the diagram shown here: automation, crowdsourcing, focused search and escrow.
This year we made progress on the automation side by adding the npm and PyPI to the list of software origins that are harvested systematically. In order to support crowdsourcing, Software Heritage now allows you to issue save code now requests to archive version control systems on forges it does not yet harvest systematically. This allows you to ensure that work you cherish is preserved, and you can trigger new archivals whenever you want.
A remarkable use of this new feature is made by the French Digital Directorate, that maintains a list of public sector software source codes and leverages the save code now API to ensure that each and every one of them is archived in Software Heritage.
The Software Heritage Acquisition Process
We also made an important step forward to kickstart the focused search work needed to collect and curate landmark legacy source code written by pioneers of the digital age, many of which are still around and willing to contribute their knowledge.
In collaboration with Unesco and the University of Pisa we develop the Software Heritage Acquisition Process (SWHAP), intended to support and empower all those that are interested to contribute to this effort.
The first set of SWHAP guidelines are available, providing concrete, actionable instructions, as well as a detailed walkthrough of the process on a medium sized landmark legacy software developed over twenty years ago at the Department of Computing of the University of Pisa.
You can contribute to this important mission: use the SWHAP process, and start your curation journey today!
Preserve
An important part of our long-term strategy to ensure that the precious source code we collect is preserved and passed over to future generations is the development of a geographically distributed network of mirrors, implemented using a variety of storage technologies, running in various administrative domains, controlled by different institutions, and located in different jurisdictions.
We are delighted to report that the mirror network has grown : after the first industry member from Sweden, this year we have been thrilled to welcome ENEA, from Italy, as the first institutional partner. This is an important step forward for the Software Heritage mirror network, that we hope to see starting to operate next year.
Another part of the long-term strategy is to establish collaborations with institutional archives to store regular offline snapshots of the archive contents: this year we made a first step in this direction by partnering with Cines in the framework of the EOSC-Pillar european research project.
Share
This year has also seen significant progress in our efforts to make the contents of the archive easily accessible and referenceable for a variety of users.
Software Heritage intrinsic identifiers are showcased in a blog post published on the anniversary of the first manned landing on the moon.
The permalinks tab that provides these identifiers for the tens of billions of software artifacts in the archive has been improved:
it now offers badges that you can use to enhance your web pages, and point to the archived version of the artifact you are interested in.
We also made available to researchers, both on AWS, and on Azure, the whole graph of Software Heritage: this dataset is used for the mining challenge of the MSR 2020 international conference.
Looking ahead
There are so many exciting areas of development and collaboration that will keep us busy in the coming years, so it’s now time to fix some priorities.
Saving massively endangered source code will always be a topmost priority: we already know that time and energy will need to be devoted to salvage the 250K+ mercurial repositories that BitBucket is planning to remove by June 2020.
After that, our primary goal will be to ensure that the key functionalities that Software Heritage offers are rock solid: browsing, referencing, and saving source code.
Then, we will focus on scaling up, and deploying the mirror network, to cope with the growing amount of source code that will need to be harvested. We count on the recent partnership established with GitHub to improve the efficiency of archiving the software projects hosted on GitHub, through the GitHub Archive Program and dedicated support from GitHub’s teams, and hope to see more forges following GitHub’s example and establishing partnerships that ease the archival of their contents.
We’ll also work on several new exciting functionalities to make the archive even more usable.
Last, but not least, after some first steps made with GSoC and a few other collaborators, we look forward to foster the emergence of a broad community of contributors to complement the effort of the Software Heritage core team: it is essential that everybody interested and concerned steps up, if we collectively want to take up the huge challenge that underlies the mission we have undertaken!
— Roberto Di Cosmo