Brute-forcing race conditions in CI

Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.

Morpheus, 1999

Most projects I've worked on are occasionally afflicted by the dreaded flaky tests - intermittent failures seen in CI, but which you can't recreate on your own machine.

These failures may only occur every few tens or hundreds of test runs. The less frequently failures occur, the harder they are to catch, understand, and resolve.

Whatever the cause - a subtle bug in an underlying library, a poorly crafted browser-driver assertion, a missing await in some javascript - until a problem can be reliably recreated, it's impossible to have confidence that it's been genuinely fixed.

Here's how you can use brute force to more reliably recreate these failures, and have more confidence they are fixed.

Enter the Matrix

For this example I'll use Github Actions, but you can probably take a similar approach with GitLab, Jenkins, CircleCI, or even Travis (if you still use it).

What?

Here's the basic approach

  1. isolate the test
  2. make it run as many times as possible, as quickly as possible, until failure
  3. read the logs

How?

1. Isolate the test

For a javascript test framework like Jest or mocha, this is usually as simple as

-  it('some test...', ...
+  it.only('some test...', ...

Other approaches might be to delete irrelevant tests, jobs or workflows.

2. Run it loads

To run the job 100 times per CI run, add the following to your GitHub Actions workflow YAML:

jobs:
  your-job:
+    strategy:
+      matrix:
+        x: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
+        y: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
...

3. Read the logs

From here on in, you're on your own - good luck!