Opening the archive, one API (and one FOSDEM) at a time
Today we open our public API, allowing to programmatically browse the Software Heritage archive.
As we are posting this, Roberto and Stefano are keynoting at FOSDEM 2017, to discuss the role Software Heritage plays in preserving the Free Software commons. To accompany the talk today we release our first public API, which allows to navigate the entire content of the Software Heritage archive as a graph of connected development objects (e.g., blobs, directories, commits, releases, etc.).
Over the past months we have been busy working on getting source code (with full development history) into the archive, to minimize the risk that important bits of Free/Open Sources Software that are publicly available today disappear forever from the net, due to whatever reason — crashes, black hat hacking, business decisions, you name it. As a result, our archive is already one of the largest collections of source code in existence, spanning a GitHub mirror, injections of important Free Software collections such as Debian and GNU, and an ongoing import of all Google Code and Gitorious repositories.
Up to now, however, the archive was deposit-only. There was no way for the public to access its content. While there is a lot of value in archival per se, our mission is to Collect, Preserve, and Share all the material we collect with everybody. Plus, we totally get that a deposit-only library is much less exciting than a store-and-retrieve one! Today we are taking a first important step towards providing full access to the content of our archive: we release version 1 of our public API, which allows to navigate the Software Heritage archive programmatically.
You can have a look at the API documentation for full details about how it works. But to briefly recap: conceptually, our archive is a giant Merkle DAG connecting together all development-related objects we encounter while crawling public VCS repositories, source code releases, and GNU/Linux distribution packages. Examples of the objects we store are: file contents, directories, commits, releases; as well as their metadata, such as: log messages, author information, permission bits, etc.
The API we open today allows to pointwise navigate this huge graph. Using the API you can lookup individual objects by their IDs, retrieve their metadata, and jump from one object to another following links — e.g., from a commit to the corresponding directory or parent commits, from a release to the annotated commit, etc. Additionally, you can retrieve crawling-related information, such as the software origins we track (usually as VCS clone/checkout URLs), and the full list of visits we have done on any known software origin. This allows, for instance, to know when we took snapshots of a Git repository you care about and, for each visit, where each branch of the repo was pointing to at that time.
Our resources for offering the API as a public service are still quite limited. This is the reason why you will encounter a couple of limitations. First, no download of the actual content of files we have stored is possible yet — you can retrieve all content-related metadata (e.g., checksums, detected file types and languages, etc.), but not the actual content as a byte sequence. Second, some pretty severe rate limits apply; API access is entirely anonymous and users are identified by their IP address, each “user” will be able to do a little bit more than 100 requests/hour. This is to keep our infrastructure sane while we grow in capacity and focus our attention to developing other archive features.
If you’re interested in having rate limits lifted for a specific use case or experiment, please contact us and we will see what we can do to help.
If you’d like to contribute to increase our resource pool, have a look at our sponsorship program!