March 25, 2025

Why software development history is worth preserving

By Alex Khrustalev, Software Heritage Ambassador

We know a lot about ancient Mesopotamia. Some events can be dated precisely to the year despite having occurred thousands of years ago. Why? The Sumerians, the main inhabitants of the area, wrote a lot, and luckily for us, they wrote on clay tablets. A clay tablet is a very durable medium, much more durable than a papyrus. While papyrus quickly turns to ashes in a fire, clay tablets instead become even more resilient when exposed to heat and survive for thousands of years.

Of course, we no longer use clay tablets, they’re not exactly practical. But we can learn a lesson from the Sumerians: preserving information is just as crucial as producing it. The way we store data matters. The medium has evolved, we are moving to digital storage away from paper and other analog formats. Therefore, it reduces the risks attached to previous forms of storage and presents a set of new challenges. Fires are no longer such a significant risk because copying digital media is far cheaper and easier than a book. But we have another risk – what if someone tampers with it, how do we ensure data integrity?

The mission of Software Heritage is to solve these kinds of problems related to preserving software source code. If you’re not a developer, you might not be familiar with how software is created, stored, and shared. Let me break it down for you.

Software source code consists of folders containing files inside them written in a specific fashion depending on the programming language used. Software developers can modify these files by creating new ones, updating content, reorganizing folders, or deleting unnecessary files. That means source code is always evolving; it’s never in a static state.

Each software developer working on a source code has a local copy of the entire source code and makes changes independently. Here’s the problem: How do you combine all of the local changes from different developers and apply them to a final version? For that, a special software was created called a version control system (VCS). When source code is placed in a VCS, it becomes a repository. In modern VSC tools such as Git (the most widely used) there’s usually a main repository (often called “upstream”) that serves as the central source of truth. A local copy of the same repository is called “downstream.” The upstream repository is often hosted on a hosting platform such as Github or Gitlab. When a developer makes a change to a file or a set of files, they record this change with VCS – a process known as making a commit. Then the commit is pushed from downstream to upstream, making changes available for other developers.

Developers find this incredibly convenient for reviewing changes over time. By examining a commit I can understand which changes were made, and most importantly, who made them, so I can reach out to the author if I have any questions or need clarification about the changes.

This is all cool, but it works only within a specific repository. But what happens if this repository is deleted? This is a pretty common problem. The most notable incident, which shook the industry, was the removal of the “left-pad” package from npm. Although npm Inc. quickly restored the repository, it caused significant disruptions in service, affecting large companies like Facebook and Netflix.

Software Heritage is developing a system to address this. Going back to the concept of a commit, each commit has a unique identifier that developers can use to reference a specific point in a repository’s history. Software Heritage has its own unique identifier, called Software Heritage ID (SWHID), but it goes beyond just commits. SWHIDs can identify not only commits but a wide variety of software artifacts: files, directories, revisions (aka commits), and more. Unlike traditional version control systems, SWHIDs are not tied to a specific repository. The fact that the archive collects repositories from multiple sources – Github, Gitlab, npm, to name a few – makes it possible to have a unified archive with persistent identifiers across different platforms. It’s possible to adapt any kind of software source, even if it’s not currently stored in VCS. To see the full list of origins go to the archive.

Here’s an explanation of how Software Heritage differs from something like Git.

SWHID has a wider scope. The first fundamental difference is that a Git commit hash identifies a specific commit in a single repository. In contrast, the SWHID identifies a wide range of software artifacts (not only commits) including files and directories.
These SWHID examples demonstrate its capabilities across various artifacts.

Directory:

swh:1:dir:717248067ccd951a4dd64d63353ad491fcb7b7eb – a SWHID which identifies a directory in a git repository of Elixir Ecto project at path /lib/ecto/adapter/

File Content:

swh:1:cnt:7590da26689a6269ef5a6dbe75dbecf6531c7d8f – a SWHID which identifies a file content in a git repository of Elixir Ecto project at path /lib/ecto/adapter/storage.ex.

Commit (Revision):

swh:1:rev:6c612ca358b567242a13fee1fcc3fceb2edce6a6 – a SWHID which identifies a commit (revision) in a git repository of Elixir Ecto project authored by José Valim 26 February 2025, 11:34:18 UTC with a message “Update changeset.ex”.

There are other artifacts such as origins, projects, releases, snapshots, and visits.

SWHID is platform agnostic. It makes it possible to crawl the source code from different origins (e.g. Github, Gitlab, Bitbucket, etc.) across many different VCS (Git, Mercurial, Subversion, etc.) and it’s possible to adapt any kind of software source, even if it’s not currently stored in VCS.

SWHID is designed to be persistent indefinitely. Even if the original repository disappears, the Software Heritage archive will preserve it intact.

Hopefully, this article has given you insight into how development history is preserved, why it’s important, the unique role of Software Heritage and how it adds value to traditional VCS. By archiving source code on a global scale, Software Heritage stores knowledge of software development for current and future generations.

About the author

Alex Khrustalev is a full-stack developer, with experience from delivering small to large-scale web applications across various domains. His main interests include functional programming, UI/UX development, and distributed systems. A huge open-source software enthusiast, he’s been using it for years at Prosapient. Although Khrustalev is adept at working with both server-side and client-side codebases, his true passion lies in crafting user interfaces and delving into the latest trends in web development. He has a blog at Hackernoon where he writes on different software topics related to web development and programming in general.

You can book a free consultation with him or an ambassador in your field to learn more about Software Heritage by sending an email to: ambassadorprogramATsoftwareheritage.org

Software Heritage

About the author

Follow us