Close

Features

The Software Heritage features

Sharing the source code commons with a set of trustworthy and open services, providing access to the largest source code library in the world.

Browse & Search

The SWH archive is the gateway to all captured source code and its entire development history. With the browsable platform, it is possible to visualize all the visits made to a given location of the code (collected from different forges, package managers and distros) and read the source code content captured.

SWHID provider & resolver

SWH provides a Persistent IDentifier (PID) that can identify each and every source code artifact with integrity, called a SWHID. SWHIDs are intrinsic identifiers which are intimately bound to the designated object, they do not need a register, only agreement on a standard to resolve them.

The SWHID can also be used as a badge.

For more information

Go to the resolver API endpoint

Download

The Vault is the service in charge of reconstructing parts of the archive as self-contained bundles, that can then be imported locally. For instance in a Git repository. With the vault directories and revisions can be downloaded by users on the web platform or through the API.

For more information

Go to the download directory API endpoint

Save Code Now

It will take some time to get to every repository in the world, especially if these repositories keep on changing several times a day. This is why the “Save Code Now” service is provided, to give the possibility to notify SWH with a save request.

Go to API endpoint

Deposit

The deposit feature is a SWORD 2.0 Server implementation. S.W.O.R.D (Simple Web-Service Offering Repository Deposit) is an interoperability standard for digital file deposit. The deposit allows a client (a repository, e.g. HAL) to submit software source archives and its associated metadata to the SWH archive. Metadata can be also submitted referencing a repository url (origin) or a SWHID.

For more information

Add Forge Now

“Add forge now” provides a service for Software Heritage users to save a complete forge in the Software Heritage archive by requesting the addition of the forge URL to the list of regularly visited forges.

The process follows a validation workflow, including curation, and verification that the forge technology is supported by Software Heritage tools.

Crawling

The SWH archive harvests source code from different sources and converts all the source code into a single and universal data structure, an enormous Merkle directed acyclic graph [Merkle, 1987], which is a classical cryptographic construction, combining a tree and a hash function.

Crawling is separated into three phases: listing software sourcesscheduling updates and collecting the software artifacts into the archive.

Behind the scenes

Archiving all the source code is a daunting task and there are different mechanisms put in place to ensure the preservation of source code from different types of origins.

API

API access is over HTTPS.  All API endpoints are rooted at https://archive.softwareheritage.org/api/1/ and the data is sent and received as JSON by default.

You can jump directly to the  endpoint index , which lists all available API functionalities, or read on for more general information about the API.

For more information

Architecture

Archiving a repository from a forge isn’t the same action as archiving source code from a package manager. It becomes even harder when you realize that version control systems have evolved a lot over the last decades. The SWH architecture was designed to harmonize different sources into a robust infrastructure.

Data model

The data model adopted by Software Heritage to represent the information that it collects is centered around the notion of software artifact, using the following canonical names, from bottom to top: contents, directories, revisions and releases. Using  also origins, visits ans snapshots to store provenance information. Read more in  Software Heritage: Why and How to Preserve Software Source Code.

Mirrors

SWH mirrors are full copies that are in sync of the Software Heritage universal source code archive, operated independently from the Software Heritage initiative. Mirrors will improve software availability, prevent information loss and ultimately ensure unfettered access to software source code for all, reducing risk of data loss due to uncontrolled events.

For more information

Metadata

SWH collects and extracts metadata that describes and provides additional information on source code.

  • Extrinsic metadata are metadata which aren’t found in the software source code.
  • Intrinsic metadata are metadata included in the source code, in a specific file or as part of a source code file.

The metadata indexer docs

blog post

Indexing

swh-indexer module is in charge for computing source code files to extract information with the following objectives:

  • mimetype
  • ctags
  • language
  • fossology-license (detecting the license of a file)
  • Intrinsic descriptive metadata which can be found in metadata files in the source code (e.g package.json, codemeta.json, pom.xml)