1. General
1.1 What is Software Heritage?
Software Heritage is an open, non-profit infrastructure launched in 2016 by Inria. It is supported by a broad panel of institutional and industry partners, in collaboration with UNESCO.
Expand for details
The long-term goal is to collect all publicly available software in source code form together its development history, replicate it massively to ensure its preservation, and share it with everyone who needs it.
For more information about the Software Heritage mission.
1.2 What is the Software Heritage archive?
The Software Heritage archive is the largest public collection of source code in existence. Visit the archive at https://archive.softwareheritage.org.
1.3 What is the size of the archive?
The archive is growing over time as we crawl new source code from software projects and development forges. You can see live counters of the archive contents, as well as a breakdown by crawled origins, on https://archive.softwareheritage.org.
1.4 What are the services provided by Software Heritage?
Software Heritage is a mutualised platform that offers a growing number of services to a large spectrum of users.
The features page provides an overview of the features currently available. This includes, for example, archiving software repositories, browsing the archived source code and providing persistent identification.
2. Archiving software
2.1 Which software platforms (forges, package managers, etc.) are archived?
The software origins that are currently regularly archived are listed on the main archive page.
Expand for details
Here is an excerpt of this list:
- Git repositories from multiple forges (GitHub, Bitbucket, GitLab instances, cgit instances, Gitea instances, Phabricator instances, etc.)
- SVN repositories…
- Mercurial repositories…
- Debian packages in apt
- Python packages in PyPI
- R packages in CRAN
- NPM packages in npm.org
- zip or archives tarballs in gnu.org
2.2 If my code is on GitHub/GitLab/Bitbucket, is it already archived in Software Heritage?
It might be, as we crawl these and other popular forges regularly.
Search for your code repository on https://archive.softwareheritage.org/browse/search/.
Expand for details
If it is not there yet, or if the latest snapshot is not the most recent state of your repository, you can trigger a new archival using the Save Code Now functionality
https://archive.softwareheritage.org/save/ or by clicking on the « Save again » button in the browse view.
A GitHub action is available to automatically push a save code now request. Here is an example of this action configured to run each time a new release is issued.
You can also use the browser extension.
2.3 If I delete the repository that contains my code, will the data stay in Software Heritage?
Yes, all software source code artifacts are preserved for the long-term.
2.4 What is the policy for determining what deserves to be archived? are there requirements for a GitHub, GitLab or XXX repository to be archived by Software Heritage?
We do not inspect or filter the source code, and archive anything that we can get hold of. As a consequence, here are no requirements but we do suggest following the Software Heritage guidelines for best results.
Expand for details
The reason for this approach is because the value of software source code cannot be known in advance. When a project starts, one cannot predict whether it will become a key software component or not. For example, when Rasmus Lerdorf released the first version on PHP back in 1995, who could have predicted that it would become one of the most popular tools for the Web.
And it also happens that very precious pieces of source code may be go unnoticed for decades, until one day some unexpected bug unveils that a big part of our digital infrastructure relies on them.
2.5 Is the code checked for LICENSE file or any specific characteristic in the repository before archiving?
Software Heritage archives everything that is publicly available, without preliminary tests or checks.
This means that you are responsible for checking whether the source code you find in the archive can be reused, and under which terms.
For the code that you produce, we do suggest following standard best practices, that are recalled in the Software Heritage guidelines, and this include adding licensing information.
2.6 Do you also archive software executables (aka binaries)?
Our core mission is to preserve source code, because it is human readable and contains precious information that is stripped out in the executables. As a consequence, we do not actively archive binaries, but if binaries are included in a software repository, we do not filter them out in the archival process. Hence you can find a few binaries in the archive.
2.7 I can’t find all my « releases » in a git repository in Software Heritage, what should I do?
Do not worry, your repository has been saved in full.
What you are witnessing is just a terminological difference between what
platforms like GitHub calls « releases » (any non annotated git tag) and what we call « releases » (a node in the Merkle tree, which corresponds to a git annotated tag). This is a common issue, as you can see for example in this discussion thread.
Expand for details
Let’s say you tagged your release naming it « FinalSubmission », but you did not use an annotated tag: in this case, it will not show up in the Releases tab on Software Heritage, but it is there nonetheless! Click on the branch dropdown menu on the Software Heritage Web interface and you’ll find it listed as « refs/tags/FinalSubmission ». If you want a release to appear in our web interface you should create your tags
using « git tag -a », instead of simply « git tag », or create the release directly on the code hosting platform, that uses the proper « git tag -a » behind the scenes, and then archive your
repository again.
3. Referencing and identification
3.1 What is a SWHID (SoftWare Hash Identifier)?
The SWHID (SoftWare Hash Identifier), is a persistent intrinsic identifier that is computed uniquely from the software artifact itself. See the dedicated blog post to learn more about intrinsic and extrinsic identifiers.
Expand for details
All details about the syntax, semantics, interoperability and implementation can be found in the formal specification.
The following diagram shows concisely the key components of a SWHID:
The top yellow box in the diagram corresponds to the « core SWHID« . It is possible to add qualifiers to a core SWHID in order to provide additional information about the location of the object in the Software Heritage graph, or its origin, and to identify code fragments.
3.2 What can be identified with a SWHID?
First, let’s notice that software can be identified at quite different levels of granularities, ranging from a conceptual level (e.g. the name of a software project), to concrete software artifacts (e.g. a directory containing plenty of files).
The SoftWare Hash Identifiers are designed to identify permanently and intrinsically all the levels of granularity that correspond to concrete software artifacts: snapshots, releases, commits, directories, files and code fragments.
Expand for details
A core SWHID can be used to identify the following source code artifacts:
- file contents; for example, swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 points to the content of a file containing the full text of the GPL3 license
- directories; for example, swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505 points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017
- revisions (a.k.a commits); for example, swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks
- releases; for example, swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f points to Darktable release 2.3.0, dated 24 December 2016
- snapshots; for example, swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453 points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Using the « lines » qualifier, it is possible to also identify « code fragments« , i.e. selected lines of code.
For example, swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;lines=4-6 pinpoint lines 4 to 6 of the full text of the GPL3 license.
More generally, using a fully qualified SWHID provides all relevant information for placing a software artifact in context. For example, the following SWHID pinpoints the core mapping algorithm contained in the file parmap.ml, located in the src directory of a specific revision of the Parmap project retrieved from https://github.com/rdicosmo/parmap
3.3 How can I get a SWHID for my software?
The core SWHID identifier is intrinsic, so yes, you can compute the core SWHID of any software artifact locally on your machine! You can find instructions in the documentation available here!
You can also get the full SWHID for any archived software artifact directly from the Software Heritage archive: using the red vertical tab called « Permalinks », present on every page that shows source code (see this HOWTO for more details). The advantage of this second approach is that you can get a SWHID with relevant contextual information (e.g. the position of your artefact in the global graph of software development).
Expand for details
The « Permalinks » tab let you obtain a SWHID for the content that you are browsing. Here is an example:
Clicking on « Copy identifier » you can get the SWHID in your clipboard. Clicking on « Copy permalink » you can get in your clipboard the corresponding URL.
The « Add contextual information » checkbox allows you to choose whether you will get a core SWHID or the SWHID with the extra qualifiers that provide contextual information.
Notice that the Permalinks tab offers a plurality of options to pick a SWHID (you may get the one for the file content, the directory that contains it, the revision, the release or the snapshot). See the following question to understand which is best for your use case.
3.4 Which type of SWHID should I use in my article/documentation?
It really depends on your use case, but as a general suggestion we recommend to take the full SWHID of a directory (with the contextual information).
Expand for details
When writing a research article, a blog post or technical documentation, one may face some tension between the need to provide the maximum amount of information, using the full SWHID, or keeping the reference short (for example due to page limitations).
Here is the recommended best practice to address this issue:
- get the full SWHID for the ‘directory’ containing the version of the code you want to reference. Here is an example of such a full SWHID:
swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18;origin=https://gitlab.com/lemta-rheosol/craft-virtual-dma;visit=swh:1:snp:ef1a939275f05b667e189afbeed5fd59cca51c9d;anchor=swh:1:rev:ad74f6a7f73c7906f9b36ee28dd006231f42552e
- ensure the « core SWHID » (
swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18
in the example above) is printed, and the full SWHID is available at least as an hyperlink.
This effect can be achieved as follows in LaTeX:
href{https://archive.softwareheritage.org/swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18;origin=https://gitlab.com/lemta-rheosol/craft-virtual-dma;visit=swh:1:snp:ef1a939275f05b667e189afbeed5fd59cca51c9d;anchor=swh:1:rev:ad74f6a7f73c7906f9b36ee28dd006231f42552e/}{swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18}
or in Markdown:
[swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18](https://archive.softwareheritage.org/swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18;origin=https://gitlab.com/lemta-rheosol/craft-virtual-dma;visit=swh:1:snp:ef1a939275f05b667e189afbeed5fd59cca51c9d;anchor=swh:1:rev:ad74f6a7f73c7906f9b36ee28dd006231f42552e/)
This approach ensures that in any printed version the reader will find the identifier that is most useful for reproducibility: the core SWHID of the directory. Indeed, the core SWHID of a directory can be locally computed from any instance of a source code, independently of the release or commit that cointains it in a specific project.
In the digital version the clickable link uses the full SWHID to let the reader browse the code in Software Heritage with all the proper context (version, origin, etc. etc.).
3.5 I want the full SWHID for a source code that is not already in the archive. How can I proceed? How long will it take?
If your code (or the latest version of it) is not yet in the archive, you need first to trigger its archival. This can be done with a « Save Code Now » request, or via the deposit API.
Once a Save Code Now request is issued, the ingestion of the code is usually completed in a few minutes, depending on the size of the repository. Once it’s done, the status of the save request is updated and you can get the SWHID as shown before.
When a deposit is submitted, the ingestion is also usually completed in a few minutes and the SWHID is accessible through the SWORD status response.
4. Access and Reuse
4.1 Can I reuse the source code artifacts I find on Software Heritage?
It depends on the license of the artifact, as stored alongside the source code: you must check this license before downloading or reusing it. If you cannot find the license information, you should assume that you have no right to reuse it.
Expand for details
All software components present in the Archive may be covered by copyright, or other rights like patents or trademarks. Software Heritage may provide automatically derived information on the software license(s) that may apply to a given software component, but it makes no claim of correctness and the licence information provided does not constitute legal advice. You are solely responsible for determining the license, or other rights that apply to any software component in the Archive, and you must abide by its terms.
4.2 Can I clone a repository using Software Heritage?
Please do not clone a full repository directly from Software Heritage: it is an archive, not a forge. Try first to clone a repository from the place where it is developed: it will be faster and as an added bonus you will be already in the right place to interact with its developers.
Expand for details
Software Heritage stores all the software artifacts in a massive shared Merkle tree, so that exporting (a specific version of) an archived respository implies traversing the graph to get all the relevant contents and packaging them up for your consumption. This operation is much more expensive than downloading an existing tar file or cloning a repository from a forge.
If really Software Heritage is your last resort, and you cannot find the source code of interest elsewhere, we recommend that you download only the version of interest for you, using the « directory » option of the Download button that you find when you browse the archive.
If absolutely needed, you can use the more expensive « revision » option of the Download button, that will prepare for you the equivalent of a git bare clone
, which you will be able to use offline. This may require quite some time (hours, or even days for huge repositories).
4.3 Can I retrieve a source code artifact through the API?
Yes, you can. If you have the SWHID at hand, you can use the appropriate API method for it to navigate through the endpoints to follow the graph of project artifacts. Checkout the API documentation for the complete list of endpoints.
Expand for details
- /api/1/snapshot/ which allows you to get the snapshot’s branches and tags, each with a
target_url
key that contains the URL to - /api/1/release/ or /api/1/revision/ which allow you to get the revision’s or releases’ data. Assuming you get a revision, the
directory_url
key contains a URL to: - /api/1/directory/ which lists entries of the root directory, with links to other directories and content objects
- /api/1/content/ which returns all the information about a given file content, including a link to the raw data.
You can also lookup an origin, and follow its visits:
- /api/1/origin/search/ allows you to search the exact URL of the code repository
- /api/1/origin/visits/ allows you to list the times Software Heritage visited the repository, and get the snapshot associated with each visit. For each visit, this snapshot is available as a
snapshot_url
key, that contains the URL to get the corresponding snapshot object.
If you are interested in downloading a large part of the repository (a directory or a set of revisions), you should use the download service called the Vault. The Vault allows you to fetch them in batch and download a tarball. The list of vault endpoints is available at the end of the list of all API endpoints
5. Software metadata
5.1 Can I add metadata to my software?
A regular user can add metadata files in the repository which will be ingested and indexed when using a specific file format (codemeta.json, package.json, pom.xml, etc.).
Expand for details
Follow the Software Heritage guidelines on how to prepare your code for archival.
More information about the formats that are indexed and some general overview of the metadata workflow in the blog-post about mining for software metadata
5.2 What metadata are preserved from a code repository, with save code now?
All metadata contained by the source code repository itself is preserved. This will include the development history and commit dates and messages.
At the moment, other metadata artifacts which are not part of the repository (known as extrinsic metadata) are not preserved when using the Save Code Now feature.
5.3 What metadata are preserved with a deposited software artifact?
All metadata which is sent via the SWORD protocol accompanying the software artifact. For more information visit the deposit documentation.
5.4 What is the codemeta.json file, why should I use it?
As software developers, we may want to provide a machine readable description of our projects, but there are (too) many metadata schemas for describing software, and one can easily get lost.
The CodeMeta initiative created a common vocabulary to address this issue, based on (a slight extension of) the SoftwareApplication and SoftwareSourceCode classes of the well established schema.org initiative, and provides tools to convert back and forth from other medatada schemas.
The codemeta.json file is a JSON-LD representation of the CodeMeta vocabulary, that can be easily created and validated using the Open Source codemeta generator tool. By adding a codemeta.json file to your project, you make it easy to share metadata information, and reduce the burden of retyping a lot of information in data entry forms.
Expand for details
For example, the French HAL national open access archive looks for a codemeta.json file when a software project archived in Software Heritage is deposited, and pre-fills the deposit form using the information it contains, a real time saver!
Last but not least, Software Heritage indexes the metadata contained in codemeta.json files and makes it searchable on the web-app using the CodeMeta crosswalk table. The crosswalk table is the Rosetta stone of software metadata, facilitating translation between ontologies and metadata standards for software.
5.5 Does Software Heritage perform a check on the metadata (e.g. to verify whether a licence is declared)?
The short answer is no. Software Heritage does not perform any a priori filtering of the repositories that are archived.