Incremental Compilation in Build Systems
Build systems are used by every software engineer but rarely get any love. For decades, the best tools engineers had was make
. JavaScript developers are now plagued by slow webpack
build times.
In this post, I'll unpack some of the different differentiators between build systems and where I think the most exciting opportunities are.
What order are tasks built in?
Make: Constructs a dependency graph from the Makefile
and executes the tasks in topological order.
Excel: The build system that nobody realizes is a build system. Excel is unique because it doesn't need to know the dependencies upfront – it handles dynamic build dependencies. For instance, you might have a formula, INDIRECT, which can return the address of a cell and change the task dependency graph during execution. Of course, this means that a topological sort won't work.
Excel uses a calculation chain, in which the program marks cells "dirty" for recalculation and greedily starts to execute cells in the chain. If it reaches a cell that requires a value that hasn't been computed yet, it moves that cell and its dependents down the chain.
This means that calculation times can improve in a worksheet after a few calculation cycles.
What tasks are rebuilt?
Nix Package Manager: The Nix package manager deals with a higher abstraction level than tasks: packages. As a functional model, it installs packages into unique directories identified by a hash of the package.
One article claims that Nix fixes dependency hell on all Linux distributions (article) – it helps, but the claim might be a bit dubious.
Bazel: The open-source version of Google's internal build system, Blaze, uses a content-addressable cache to download a previously built task given the hash of its inputs.
Some of my work related to build systems:
An Alternative to the Dockerfile – The current Dockerfile format limits how users express the dependency graph for Docker image builds. This leads to unnecessary calculation. While this format doesn't entirely solve that issue, it allows for more parallelism and proof that the underlying layer executor can be optimized.
Virgo: a Graph-based Configuration Language – There aren't any configuration tools to write or serialize dependency graphs, despite how embedded they are in nearly all of our tools. Virgo is a configuration language where graphs are first-class citizens.
Docker is a compiler – What it means if we think of Docker as a generic build system for all sorts of artifacts.
Skaffold – Optimizing the recalculation engine for the container workflow. Skaffold lets users encode information about their build dependency graph and optimizes it behind the scenes. Take a monorepo with React code, a Go backend, and Kubernetes configuration: Skaffold can automatically sync the JavaScript code to a running container to be hot-reloaded, recompile the Go code and redeploy it, and reload the infrastructure configuration.
Live Programming – Thinking about what happens when the calculation engine reaches speeds that allow for rapid iteration. How does a developer workflow change when we can instantly get feedback about our code changes?
Some further reading on this subject:
Build Systems à la Carte - A classic on what makes a build system out of Microsoft Research Lab. It compares Excel, Make, Shake, and other build systems across a few dimensions.
Noira: data-flow for high-performance web applications – A project based on Jon Ferdinand's Ph.D. thesis at MIT. Focused on incremental build systems for database queries (found in startups like Materialize)