Preserving source code archive files
As announced here before, we partnered with funders around the world to provide grants for experts that are willing to engage with the long term mission of Software Heritage.
Today we are delighted to share that one more subgrant has been awarded! Thanks to the Alfred P. Sloan Foundation, Timothy Sample, will be working to enable the preservation of source code archive files (e.g., “.tar.gz” files) in the Software Heritage archive.
What is an «archive file»?
Software source code is often distributed in archive files, using a format like «.zip» or “.tar”, «.tar.gz», and the like, that pack several files and directories into a single file, often called a container: it is very handy to distribute a given version of a software source code project, that has been used widely before the advent of version control systems, and is still quite popular. Software developers usually sign entire archive files, and downstream software distributors usually verify software using a checksum computed over entire archive files.
Most software packaging tools that verify external sources use the entire source code archive file for verification. This includes Guix, Nix, Spack, Gentoo, Arch, MacPorts, Opam, and many more.
How does Software Heritage handle archive files today?
When Software Heritage encounters such a container, it unpacks it, and archives its content, but not the container itself, because storing the entire archive file greatly reduces the effectiveness of file-level deduplication, and the same source code can be distributed in a variety of different container formats.
This current approach prevents Software Heritage from being able to reproduce the exact container format that packaging tools expect to verify the integrity of the source code.
The way forward
There are two ways to improve this state of affairs. On the one hand, advocating the adoption of SWHID by packaging tools, that will take quite some time. On the other hand, finding a way to reproduce the exact container file from the Software Heritage archive without storing it in full: this will enable Software Heritage to provide artifacts that can be verified easily using current, widely-deployed mechanisms, and would also make it easy to check how well the archive covers these collections (by checking the container hashes directly).
Ideally, we would be able to supplement that by storing—separately—whatever information is required to reproduce the original archive file.
Timothy Sample has developed a tool called Disarchive to do exactly that!
Disarchive examines the archive file and captures information like file order and compression parameters so that the original archive file can be reproduced exactly from its contents. This technique combined with deduplication results in a space-efficient means to preserve entire source code archive files.
This grant will allow Timothy to integrate Disarchive into Software Heritage, allowing Software Heritage to reproduce entire source code archive files on demand.
You too can contribute to Software Heritage’s mission! If you want to get involved just fill this simple form to start the process!