Push and Pray No More: Why We Need Static Correctness Guarantees for Modern CI/CD

←

Jul 23, 2025
@maxgallup

▶ Table of Contents

Continuous Integration and Deployment (CI/CD) is the intricate machinery behind every software company that allows it to operate like a software factory. Changes made to the source code flow down a long conveyor belt to pass tests, sanity checks, builds, more tests and eventually get shipped out of the factory and deployed into the public. From the users' perspective, their software silently bumps up in version number and they get to enjoy the bug fix or new feature made by the engineers back in the factory. The technology is widely adopted, so what could possibly be so wrong with it?

In short, CI/CD systems fundamentally lack the ability to make any static correctness guarantees about the setup and execution of a single "pipeline" as well as the orchestration of a large collection of interdependent pipelines. To make matters worse, some CI/CD providers make it very difficult to even test pipelines locally due to their deeply rooted dependency on the CI/CD service provider. The phrase "push and pray" naturally emerged amidst developers' frustration with such systems.

But what are Pipelines?

Put simply, a pipeline is a collection of jobs that each contain a series of steps where the input of one step usually depends on the output of the previous. While the concept is commonly used in software engineering to automate testing and deployment of software, it's widely applicable to a range of other domains that benefit from automation such as data science. However, most pipelines today generally boil down to the same thing: running commands in a specific order (shell scripts!). But what exactly are shells and why were they created?

In the early '70s, engineers at Bell Labs faced a productivity problem: there was a high demand for software development, yet the process of writing and maintaining code was too arduous and stunted progress. The Unix operating system and Ken Thompson's initial shell provided his colleagues the ability to string together a number of existing programs from which new functionality emerged. By redirecting the output of one program to the input of another, the shell filled the productivity gap because it allowed developers to better use the work done by their colleagues¹. While Thompson's first shell was rather minimal, in 1979, the Bourne Shell quickly became a more feature rich successor that is comparable to modern day shells ². In principle, today's shell solves the same problem as it did in 1970 by providing a language to declare the execution of multiple programs and the flow of data from one to another. While the notion of a pipeline is rather abstract and general, they are the building blocks of modern CI/CD systems.

Pipelines in the wild: CI/CD Today

In order to understand the inherent design flaws of CI/CD, let's imagine a simple, yet realistic example of what it's like to build CI/CD circa 2025. Imagine you and your team of developers are on the second week of building a brand new software project. The codebase has grown to a point where you would like to start running tests alongside your development workflow. Ideally, you would run your test suite on a server every time you push changes to your remote git repository. For this reason, companies like Github provide exactly such services, because they can react to when you push new code. The automation infrastructure that Github provides can be interfaced through a declarative language that is analogous to a cooking recipe. It describes what steps need to be performed in what order to accomplish the job. In our example, the steps are rather simple:

Checkout the latest version of the source code.
Install any necessary dependencies.
Run the test suite and report on error.

In order for Github to understand our recipe we can define a "workflow" file right along side our code. For example in a file called .github/workflows/test.yaml we can define our recipe from above as follows:

on:
  push:
    branches:
      - main

jobs:
  run-test-suite:
    runs-on: ubuntu-latest

    steps:
      # Step 1
      - name: Checkout Code
        uses: actions/checkout@v3

      # Step 2
      - name: Install Dependencies
        run: sudo apt install some_python_package

      # Step 3
      - name: Run Build/Analysis
        run: |
          mkdir output
          make
          python test.py

First, at the top we indicate that this workflow should be triggered on a push to the main branch. After that, we specify a list of jobs that happen one after the other, but in this case we only specify one, namely run-test-suite and that it presumably runs on the latest version of Ubuntu (more on that in a bit). Then, at the heart of the workflow, we define the individual steps that get executed one after another. Note, any step in any job can fail resulting in a failed workflow. The first step of the pipeline actually uses a community defined workflow that clones the repository associated with the current workflow. The second step uses the apt package manager to install dependencies since the Github's runner is based on Ubuntu. Finally, the last step actually performs what we are interested in, namely running our tests.

While this example pipeline is rather simple, it is a realistic snippet of what most CI roughly looks like (not "CD", since this does not deploy anything). You might be asking yourself: What's wrong with it?

No Type Safety Within Pipelines

Within any pipeline, there are a number of steps executed by independent programs, each of which must be setup correctly. They might rely on environment variables, preinstalled programs or any number of external factors that the pipeline author must be aware of. Without a strong and expressive type systems to warn the programmer about potential misuses before deploying the pipeline, precious time will continue to be wasted.

Mismatched Development and Runner Environment

The server that actually runs the steps defined in our pipeline is referred to as a "runner" which in most cases is just an ephemeral VM hosted by the git provider. So in our case, Github spins up a modified Ubuntu VM (since we defined runs-on: ubuntu-latest) and runs our steps inside of it. Yes, this means workflows have an inherently slow start and there's no caching by default, because each run starts with a fresh VM. Github had to make an executive decision on what commonly used programs it ships with its Ubuntu VM, since if it was too bare-bones, the VM would spend most of its time installing dependencies every time a pipeline is run. As Amos Wenger points out, this VM faced an identity crisis, since part of its responsibility is to act as a package manager to meet the needs of "most" common pipelines. As a result, the VM weighs in at a whopping 51.2 GB, making it uninviting to try locally¹. The mismatch between the environment of the developer's laptop and the CI/CD server create a messy problem that results in frankenstein VMs like those found in Github as well as countless developer and runtime hours spent installing the dependencies onto the CI-Runner.

I argue that the root of this problem is one that is much deeper than pipelines themselves and results from the heterogeneous architectures, OSes, languages and most importantly: incompatible software packaging efforts. Unlike the universally spoken Internet Protocol for connecting computers over the network, software packaging is different almost everywhere you look. Each operating systems chose its own way to package and distribute software, effectively reinventing the wheel every time. This led to "ecosystems" of software which meant siloing the packaging effort and burdening the developer to make their software compatible for each silo. For example, the extensive list of platforms that curl supports shows how much work is placed on the developer to support all platforms. This boils down to a tale as old as Java "Write once, run anywhere" which was followed by attempt after attempt to unify the software distribution problem.

Shell Scripting

A related problem is that today's yaml based configurations depend heavily on shell scripting itself and thus provide no static correctness checking capabilities. In other words, the static validity of the pipeline declaration does not entail validity of the pipeline's execution. For example, a pipeline might be as simple as:

#!/bin/bash
mkdir output
make
python test.py

However, the pipeline definition provides no static information that allows a system to infer the most basic premises: Is python actually installed and is it the correct version? Is test.py even in the current working directory? It's common to just delay those questions to run-time and fix them after having run the pipeline. So, after installing the correct version of python and making sure we're in the right directory we can simply re-run the pipeline and suddenly it executes in a valid environment. The validity of the pipeline's code is entirely decoupled from the validity of whether the pipeline can execute correctly. Such a decoupling creates a burden to find out configuration errors at run time.

In another article I propose some low hanging fruit we can reach by unifying the CLI experience and rethinking the shell itself.

No Type Safety Between Pipelines

Another problem with modern CI/CD that static guarantees could greatly improve is the managing of complex interdependent pipelines. Big tech companies tackle the complexity of their build systems through tools like Buck2 which make it possible to manage dependencies of build targets. This concept could be applied further to CI/CD pipelines to make it possible to rebuild dependencies automatically. Adding a type safe layer on top of such a system would make complex multi-repository systems much more manageable and maintainable.

Conclusion

The software factory metaphor we began with reveals an uncomfortable truth: our CI/CD conveyor belts are held together with duct tape and hope. While these systems have scaled to serve millions of developers and enabled the rapid iteration cycles we've come to expect, they remain fundamentally fragile. The lack of static correctness guarantees means that every pipeline change is a potential breakage waiting to be discovered at runtime, and the heterogeneous mess of environments, package managers, and shell scripts compounds the problem at every turn. Until we close the gap between what we know is possible in terms of static verification and what our CI/CD systems actually provide, developers will continue to waste countless hours debugging environment mismatches and praying their pipelines work on the first try. The software factory can do better than "push and pray"—it's time we built the machinery to prove it.

References

AT&T Tech Channel. (2014, January 27). UNIX: Making Computers Easier To Use -- AT&T Archives film from 1982, Bell Laboratories [Video]. YouTube. https://www.youtube.com/watch?v=XvDZLjaCJuw
https://en.wikipedia.org/wiki/Bourne_shell
fasterthanlime. (2023, Dec 23). GitHub Actions Feels Bad [Video]. YouTube. https://www.youtube.com/watch?v=9qljpi5jiMQ