The Software Heritage features
Sharing the source code commons with a set of trustworthy and open services, providing access to the largest source code library in the world.
Browse & Search
The SWH archive is the gateway to all captured source code and its entire development history. With the browsable platform, it is possible to visualize all the visits made to a given location of the code (collected from different forges, package managers and distros) and read the source code content captured.
SWHID provider & resolver
SWH provides a Persistent IDentifier (PID) that can identify each and every source code artifact with integrity, called a SWHID. SWHIDs are intrinsic identifiers which are intimately bound to the designated object, they do not need a register, only agreement on a standard to resolve them.
The SWHID can also be used as a badge.
Go to the resolver API endpoint
Download
The Vault is the service in charge of reconstructing parts of the archive as self-contained bundles, that can then be imported locally. For instance in a Git repository. With the vault directories and revisions can be downloaded by users on the web platform or through the API.
Go to the download directory API endpoint
Save Code Now
It will take some time to get to every repository in the world, especially if these repositories keep on changing several times a day. This is why the “Save Code Now” service is provided, to give the possibility to notify SWH with a save request.
Deposit
The deposit feature is a SWORD 2.0 Server implementation. S.W.O.R.D (Simple Web-Service Offering Repository Deposit) is an interoperability standard for digital file deposit. The deposit allows a client (a repository, e.g. HAL) to submit software source archives and its associated metadata to the SWH archive. Metadata can be also submitted referencing a repository url (origin) or a SWHID.
Add Forge Now
“Add forge now” provides a service for Software Heritage users to save a complete forge in the Software Heritage archive by requesting the addition of the forge URL to the list of regularly visited forges.
The process follows a validation workflow, including curation, and verification that the forge technology is supported by Software Heritage tools.
Crawling
The SWH archive harvests source code from different sources and converts all the source code into a single and universal data structure, an enormous Merkle directed acyclic graph [Merkle, 1987], which is a classical cryptographic construction, combining a tree and a hash function.
Crawling is separated into three phases: listing software sources, scheduling updates and collecting the software artifacts into the archive.
Behind the scenes
Archiving all the source code is a daunting task and there are different mechanisms put in place to ensure the preservation of source code from different types of origins.
API
API access is over HTTPS. All API endpoints are rooted at https://archive.softwareheritage.org/api/1/ and the data is sent and received as JSON by default.
You can jump directly to the endpoint index , which lists all available API functionalities, or read on for more general information about the API.
Architecture
Archiving a repository from a forge isn’t the same action as archiving source code from a package manager. It becomes even harder when you realize that version control systems have evolved a lot over the last decades. The SWH architecture was designed to harmonize different sources into a robust infrastructure.
Data model
The data model adopted by Software Heritage to represent the information that it collects is centered around the notion of software artifact, using the following canonical names, from bottom to top: contents, directories, revisions and releases. Using also origins, visits ans snapshots to store provenance information. Read more in Software Heritage: Why and How to Preserve Software Source Code.
Mirrors
SWH mirrors are full copies that are in sync of the Software Heritage universal source code archive, operated independently from the Software Heritage initiative. Mirrors will improve software availability, prevent information loss and ultimately ensure unfettered access to software source code for all, reducing risk of data loss due to uncontrolled events.
Metadata
SWH collects and extracts metadata that describes and provides additional information on source code.
- Extrinsic metadata are metadata which aren’t found in the software source code.
- Intrinsic metadata are metadata included in the source code, in a specific file or as part of a source code file.
Indexing
swh-indexer module is in charge for computing source code files to extract information with the following objectives:
- mimetype
- ctags
- language
- fossology-license (detecting the license of a file)
- Intrinsic descriptive metadata which can be found in metadata files in the source code (e.g package.json, codemeta.json, pom.xml)