Reproducible builds: a status update

Ludovic Courtès — October 31, 2017

With the yearly Reproducible Build Summit starting today, now’s a good time for an update on what has happened in Guix land in that area!

Isolated build environments are very helpful to achieve reproducible builds, but they are not sufficient: timestamps and non-determinism can still make a package build non-reproducible. Developers can rely on guix build --check and guix challenge to identify non-reproducible builds.

This article provides an overview of the progress made to fix non-reproducibility issues in packages over the year, and then goes on to show a very concrete way for Guix to take advantage of reproducible builds.

Building reproducibly

Tools that produce build artifacts occupy a key role: if their output is non-reproducible, then lots of packages that use them will be non-reproducible as a result. Among those packages, we fixed:

  • GNU R (timestamps in .rds files and man pages; random temporary file names recorded in generated files);
  • GNU Guile (order-sensitive symbol generation during macro expansion);
  • Ghostscript (timestamps and UUIDs in generated PDF files);
  • GNU groff (timestamps in generated files);
  • gdk-pixbuf (unsorted directory entries ending up in generated cache files);
  • Perl build system (perllocal.pod files were produced in a non-deterministic way).

Sometimes we think that an issue is rare and we embark on a trip to fix individual packages that are affected… until we realize that it’s common enough to deserve a global, once-and-for-all fix. This is what happened with timestamps in gzip headers: after blissfully assuming that “almost everyone” uses the -n flag of gzip, we finally introduced a build phase to automatically strip timestamps for gzip headers—this is a subset of what Debian’s strip-nondeterminism achieves, but hey, Scheme integration matters to us!

There’s a number of well-identified issues left to be addressed: Python bytecode, GTK+ icon them caches, TeX Live, and more. Often, the issue database initiated by Debian is a great resource to find about issues and fixes.

And the result is…

We recently gained a new build farm called berlin.guixsd.org, which is slated to replace our existing build farm at mirror.hydra.gnu.org. Having set it up as an independent build farm—berlin does not download binaries from hydra—we can challenge build reproducibility by comparing the binaries produced on each of these build farms. Comparing the results of two independent build farms, with different hardware and kernel versions, maximizes the chances to catch all sorts of non-reproducibility issues. The result with today’s master is… drum rolls

$ guix challenge $(guix package -A | cut -f1) \
    --substitute-urls="https://mirror.hydra.gnu.org https://berlin.guixsd.org"

…

6,501 store items were analyzed:
  - 5,048 (77.6%) were identical
  - 533 (8.2%) differed
  - 920 (14.2%) were inconclusive

We’re somewhere between 78% and 91%—not as good as Debian yet, but we know what to do next! The inconclusive comparisons here can be due to a package that failed to build on one machine, for instance because its test suite fails in a non-deterministic way, or simply because one of the build farms is lagging behind. guix challenge lists all the problematic packages, which makes it easy to retrieve the faulty binaries and investigate.

Reproducible builds = faster downloads!

There’s a very practical advantage to reproducible builds: anyone who publishes binaries is in essence a mirror of our build farm.

Until now, Guix’s public key infrastructure (PKI) was used pretty rigidly: you could download binaries from a server if and only if you had previously authorized its public key. So to download binaries from the person next to you, you would first need to retrieve their public key and authorize it. In addition to being inconvenient, it has the drawback of being an all-or-nothing decision: you would now accept any binary coming from that person. Can’t we do better?

We realized there’s an easy way to exploit the mirroring property mentioned above: assuming I trust binaries from mirror.hydra.gnu.org, then I can download from anyone who publishes the exact same binaries. Put this way, it seems obvious, but it required some adjustments to the substitute code.

To understand what’s going on, let’s look at the metadata guix publish produces, in a format inherited from Hydra:

$ wget -q -O - https://berlin.guixsd.org/8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw.narinfo
StorePath: /gnu/store/8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw-sed-4.4
URL: nar/gzip/8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw-sed-4.4
Compression: gzip
NarHash: sha256:18v7dgny1xna7f53mbkj8bk4y2f00l5rjk2k6hj166kjv964lz7r
NarSize: 637360
References: 3x53yv4v144c9xp02rs64z7j597kkqax-gcc-5.4.0-lib 8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw-sed-4.4 n6nvxlk2j8ysffjh3jphn1k5silnakh6-glibc-2.25
FileSize: 218663
System: x86_64-linux
Deriver: pi8686q63rwr4md90vm3qxwhk2g2fvqa-sed-4.4.drv
Signature: 1;berlin.guixsd.org;KHNpZ25hdHVyZSAKIChkYXRhIAogIChmbGFncyByZmM2OTc5KQogIChoYXNoIHNoYTI1NiAjQTRDRjUyMTVGNzlBOEUxRkFBNjIyOEQwQjk0QjMyMTZCRkY1RjA1NkQxMzZENUEzNTFGM0I2OTYzQzc1MzQzMiMpCiAgKQogKHNpZy12YWwgCiAgKGVjZHNhIAogICAociAjMDFDM0NGMEIzRUMwNkIwRUNGMTJEMTU4MkNCMzA2RjkzMEU2Njc1NDNFOEQ2NkZCRjhDRUY4QkQwMkMzOTg1NCMpCiAgIChzICMwRTg2MUEyRjI3MDg2MjVBRDkzMDg5RjFFRjE4NzUwQjIzQjM0RTA5MkFFRkQ3RTlFNkZCMjlCMkMwMURFNjI5IykKICAgKQogICkKIChwdWJsaWMta2V5IAogIChlY2MgCiAgIChjdXJ2ZSBFZDI1NTE5KQogICAocSAjOEQxNTZGMjk1RDI0QjBEOUE4NkZBNTc0MUE4NDBGRjJEMjRGNjBGN0I2QzQxMzQ4MTRBRDU1NjI1OTcxQjM5NCMpCiAgICkKICApCiApCg==

This “narinfo” gives us, among other things, the hash of the sed binary that berlin.guixsd.org obtained, the URL where it can be downloaded, and a signature on this metadata (a base64-encoded canonical s-expression).

Guix has supported the ability to specify several substitute servers for a while, with --substitute-urls, but it would filter out narinfos signed by an unauthorized key. The main change was thus to keep narinfos with a hash identical to that advertised by one of the authorized narinfos. Thus, if I run:

$ guix build sed \
  --substitute-urls="https://somebody.example.org https://mirror.hydra.gnu.org"

Guix will fetch a narinfo from both URLs. If somebody’s narinfo claims the same hash as hydra, then Guix will download the actual binary from somebody—which, hopefully, may be faster than downloading from hydra. Of course, when the download completes, guix-daemon verifies that the hash is really as advertised in the narinfo, such that somebody.example.org cannot tweak me into downloading a different binary.

The future

This feature landed in September, and will be in the forthcoming Guix release.

Among the ideas floating around, one is to have guix publish advertise itself on the local network via Avahi. Guix could, optionally, discover neighboring guix publish instances and add them to its list of substitute servers. Binaries could sometimes be downloaded from the local network, which should be faster.

More generally, the role of our build farm shifts from providing binaries to providing meta-data about binaries. We can entirely decouple the choice of a meta-data server from the choice of a binary provider.

Longer-term, binaries could very well be downloaded from a content-addressed store such as IPFS or GNUnet without having to forego our existing infrastructure. Peer-to-peer distribution of binaries has been on our mind for a while, but we hadn’t quite realized this decoupling and how it would allow us to support a smooth transition.

These are all exciting perspectives, and a nice practical consequence of reproducible builds!