What’s next in research for Software Heritage

After nearly a decade of Software Heritage, changes are afoot. Co-founder Stefano Zacchiroli is shifting focus to become the Chief Scientific Officer (CSO), while Thomas Aynaud joins the team as the new Chief Technical Officer, taking over Zacchiroli’s previous responsibilities.
The move to CSO allows Zacchiroli to focus on his research interests: digital commons, open-source software engineering, computer security, and the software supply chain. He’s a full professor of computer science at Télécom Paris, Polytechnic Institute of Paris. A Debian developer since 2001, he served as Debian project leader from 2010 to 2013.
In this interview, he talks about getting back to research full-time, how Software Heritage helps make open-source more secure for everyone, and why keeping the hobbyist ethos alive is important.
You’re the co-founder of Software Heritage, what was the pivot point for changing roles?
In the beginning, we were a fairly small team, so we had to distribute the roles among ourselves. I was technically inclined, since I’d been doing free software technical work for decades, so I picked up the CTO role. I was happy to lay the technical foundations of Software Heritage, which are still fundamental for the archive today. But my real life is as a researcher – I’m a computer science professor, and I’d been doing research for most of my career. Over the past 10 years, all my research has been built upon what Software Heritage enables: very large-scale empirical analyses of the software commons. Eventually, my vocation pulled me in that direction more and more, and I was delighted to find someone to step into the CTO role so I can go back to full-time research work.
What’s a Chief Scientific Officer (CSO)?
Doing research and enabling research has been part of the mission of Software Heritage from the very beginning. We have a lot of research work conducted by team members themselves (like me), members of the technical team helping researchers conduct their work, as well as outside researchers from universities and research labs mining the Software Heritage archive. The CSO role is about coordinating all of that, it doesn’t mean running every possible research project: we want people to be able to do research on Software Heritage on their own. But it does mean keeping an eye on how we interact with researchers, to make their work easier.
“Research used to be a side activity of Software Heritage, not part of our main strategy. Now, it’s a key focus and we’re making it visible.“
What are you most excited about in your new role?
There’s a lot of research going on around Software Heritage, from all kinds of angles. I’m the principal investigator for many of them, especially around security use cases. On that front, the question is: how can we effectively leverage all the knowledge collected by Software Heritage, and use it so that free software developers can create more secure software for all of us? The same general principle of efficiently leveraging Software Heritage knowledge to enable the public good can be applied to other fields. Two I’m actively pursuing are: first, how Software Heritage can enable reproducible research when it comes to software. And second, how it can enable build reproducibility of open source software. This lets people using open-source software check where it comes from and have a solid way to trust the link between the code and what’s running on their devices.
Finally, another angle I’ve been working on extensively is the human aspect of software engineering, specifically the global collaboration involved in creating open-source software. We’ve previously studied diversity in terms of gender and geographic origin. Software Heritage offers a unique perspective on how these trends evolve worldwide.
Software Heritage recently held a partner kickoff for CodeCommons, what’s your involvement in that project?
So as you can imagine, a lot of people have been interested in trying to use the Software Heritage Archive for large language models (LLMS) for code. We’re not providers of models, but we want to understand our role in helping create ethical datasets, where ethical means, first and foremost, keeping track of where the code comes from when it is used to train LLMs. We want people to be able to know if their code has been used in LLM training. We want to be able to produce open datasets for that, and we want to provide all the relevant information that enables LLM producers to respect the license of the code used in the model and have information about the provenance of the code that has been used.
What are you most looking forward to leaving behind?
The technical work to maintain and develop the Software Heritage archive has really changed in scale, over time. We have a larger code database, of course, because that’s ever-growing, and we have a larger team. We have about a dozen engineers now. We have changed technologies, so we’re using more large-scale technologies than we were using in the very beginning. I’m very glad to hand that off to the capable hands of Thomas Aynaud, who’s far more skilled than I am on that front and is taking over the leadership of that team, too.
How will this transition impact the scientific direction of SWH, and what changes can we expect short and long term?
Software Heritage as an initiative is primarily an enabler of many use cases, and large-scale research on the software commons is one of them. I don’t expect the focus of Software Heritage to change as part of my role change. What will change is that Software Heritage will support more independent research teams doing this kind of large-scale research. I expect more collaborations to materialize, more researchers will use Software Heritage data, leading to increased research results, both directly and indirectly.
Do you see your contributions changing much, or is it a question of quantity or focus?
No, it’s more about peace of mind. I’ve been primarily doing research for years. This is simply aligning my title with my actual work, not changing what I’m doing. Research used to be a side activity of Software Heritage, not part of our main strategy. Now, it’s a key focus and we’re making it visible.
Follow-up question from the recent Symposium: You mentioned that SWHSec empowers anyone, not just large corporations, to contribute to software security.
SWHSec is a large and diverse research project, with many teams looking at how Software Heritage can help secure open source for everyone. My presentation was about one specific R&D task: figuring out if a specific software version in the archive is vulnerable. This isn’t a new problem, and there are industry solutions for it. But Software Heritage has something unique: deep source code visibility. We can track vulnerabilities down to individual commits and see if even a random fork on a little-known development forge is affected by a specific vulnerability or not. This is a kind of hidden problem industry usually ignores. For example, if you forked a project two years ago when it was vulnerable, and never merged back a fix, we can tell you it’s still vulnerable. Sure, there’s not a lot of money in this, but we can give developers data and tools to fix these issues. We’ve found real vulnerabilities and helped fix them, that maintainers didn’t know about. As we can do so, it’s part of our social responsibility to do so.
How do you see Software Heritage evolving in terms of its impact on open-source sustainability?
Open-source sustainability is a huge topic, and people are trying all sorts of things to improve the status quo. A common approach is to convince maintainers to become independent businesses, so they get paid for their work. Another is to look at how civil servants build software for governments. Software Heritage helps by showing all the ways people contribute to keeping open source going, and how funding works. For instance, we’ve recently used Software Heritage to show that researchers are really important for maintaining data and machine learning open-source software, which is super important these days. The role of civil servants and researchers is often overlooked in sustainability discussions, but you can see it in our data – by showing who’s really contributing to important software, we can help shape how funding is used. We can also spot software that’s important but isn’t getting enough attention – it needs more maintenance, more contributions, or more funding – so it’s at risk. We’re the only ones who can do this on such a large scale, and see all the different kinds of contributions you can’t see just by looking at GitHub.
You’ve been a Debian developer for a long time. How has your perspective on open-source development changed over the years?
Well, back in the early days, it was all volunteers, right? People were doing this just for fun or to “scratch an itch,” as the saying goes. Free software made collaboration possible, and then the industry took notice. Now there’s a lot of intermingling between paid and volunteer contributions. There are very relevant projects like the Linux kernel that are almost entirely maintained by paid contributions. But I think it’s important to keep the door open for hobbyists, both volunteer maintainers and drive-by contributors, to participate. We’re risking a strict separation between commercial open source, paid for by large companies, and hobbyist open source, which gives the ability for everyone to contribute and understand the code they run.