Flaky tests are the bane of a developer’s existence - they pass locally but fail in CI, or worse, pass 9 times out of 10 only to mysteriously fail when you least expect it. A failing CI build leads to fatigue and frustration for everyone involved, so any flaky test should be analyzed and fixed.

We quickly understood that a good part of those flaky tests had the same root cause: database queries without explicit ordering.

The Problem: Unstable Sorting in Database Queries

When we write a query like this:

Repo.all(from u in User)

SQL makes no such guarantee unless you explicitly specify an ORDER BY clause. However, since PostgreSQL often returns results in the same order due to how data is stored and accessed, this lead to subtle bugs in tests: we might write assertions that expect specific ordering, and they’ll pass most of the time by coincidence.

But occasionally, when database statistics change or PostgreSQL chooses a different query plan, the order will change and our tests will fail. These are the dreaded flaky tests. Surprisingly, it happens more often in the CI than locally, leading to hard to reproduce failures.

A proposed solution: Enforcing Randomness

I came up with a solution that’s both simple and effective: what if we could make the ordering explicitly random during tests? This would turn intermittent failures into consistent ones, forcing us to fix the underlying issues.

Here’s the trick - we can override Ecto’s prepare_query/3 callback to add a ORDER BY RANDOM() to any query that doesn’t already have an ordering specified:

defmodule WebApp.Repo do
  @doc """
  Testing helper that enforces sorts in tests/queries. It helps tracking flaky tests and will make them more obvious as the order won't
  be stable unless an order_by clause is specified.
  """
  if Mix.env() == :test do
    @impl true
    def prepare_query(operation, query, opts) when operation == :all do
      query =
        if query.order_bys == [] and is_nil(query.distinct) and query.combinations == [] do
          import Ecto.Query
          order_by(query, fragment("RANDOM()"))
        else
          query
        end

      {query, opts}
    end

    def prepare_query(_operation, query, opts) do
      {query, opts}
    end
  end
end

This implementation uses the prepare_query/3 callback from Ecto.Repo, which allows us to transform queries before they’re executed. When it detects a query without ordering, it adds ORDER BY RANDOM() (excluding some specific queries such as DISTINCT which don’t support/require ordering, naturally).

The beauty of this approach is that it only affects your test environment and even prevents new flaky tests from being introduced - they’ll fail immediately during development rather than randomly in CI months later.

Finding All the Flaky Tests

With our new randomized ordering in place, we’ll quickly discover tests that were silently depending on implicit ordering. But there’s a catch - since the ordering is random, we need to run our test suite multiple times to catch all the issues.

Since I knew about the mix test --failed command, which runs tests that just failed, I knew there was some kind of manifest somewhere that would store the failing tests on disk. After some digging, I discovered it was in the _build directory, more specifically in the _build/test/lib/webapp/.mix/.mix_test_failures file.

So let’s write a script to capture these failures and store them in a more stable file:

defmodule AddStableFailures do
  # The initial failed manifest
  @failed_manifest "_build/test/lib/webapp/.mix/.mix_test_failures"
  # The file to store persistent failures
  @stable_manifest "stable_failures"

  def run do
    if not File.exists?(@failed_manifest) do
      IO.puts("Failed manifest file not found: #{@failed_manifest}")
      System.stop()
    end

    failed_tests =
      @failed_manifest
      |> File.read!()
      |> :erlang.binary_to_term()
      |> elem(1)
      |> Enum.group_by(fn {_, file} -> file end, fn {{_mod, name}, _file} -> name end)

    existing_stable_failures = read_stable_failures()

    new_stable_failures =
      Enum.reduce(failed_tests, existing_stable_failures, fn {file, tests}, acc ->
        Enum.reduce(tests, acc, fn test_name, acc ->
          entry = {file, test_name}

          if Enum.member?(acc, entry) do
            IO.puts("Test already in stable manifest: #{test_name} from #{file}")
            acc
          else
            IO.puts("Adding stable failure: #{test_name} from #{file}")
            [entry | acc]
          end
        end)
      end)

    # Write the entire list back to the file as a single term.
    write_stable_failures(new_stable_failures)
  end

  defp write_stable_failures(terms) do
    File.write!(@stable_manifest, :erlang.term_to_binary(terms))
  end

  defp read_stable_failures do
    case File.read(@stable_manifest) do
      {:ok, binary} -> :erlang.binary_to_term(binary)
      {:error, :enoent} -> []
    end
  end
end

AddStableFailures.run()

Here’s how to use it:

Run your test suite: mix test, some but not all flaky tests will fail.
Run this script to capture failures: elixir accumulate_failures.exs
Repeat steps 1-2 several times to build up a complete list

After a few iterations, you’ll have a pretty comprehensive list of all the tests affected by ordering issues.

Fixing the Tests

Once you’ve identified the flaky tests, fixing them is usually straightforward:

Add explicit ordering to your queries:

# Before
users = Repo.all(from u in User)

# After
users = Repo.all(from u in User, order_by: u.id)

Sort results in memory if the order doesn’t matter for the database:

# Before
assert [user1, user2, user3] = Repo.all(User)

# After
assert Enum.sort_by([user1, user2, user3], & &1.id) == Enum.sort_by(Repo.all(User), & &1.id)

Remove order dependence from the test if possible:

# Before
assert [%{name: "Alice"}, %{name: "Bob"}] = result

# After
assert Enum.count(result) == 2
assert Enum.any?(result, &(&1.name == "Alice"))
assert Enum.any?(result, &(&1.name == "Bob"))

The Results

After implementing this approach at work, we quickly identified and fixed a dozen of flaky tests, most having never been seen before. Our CI became more reliable which is always a good thing.

What’s more, this approach catches potential flaky tests during development. When a developer writes a new test that implicitly depends on ordering, it fails immediately (well, it’s still random so it might take a few runs) on their machine rather than randomly in CI weeks later.

As an added bonus, we’ve made our codebase more robust by adding explicit ordering at places we had forgotten about.

Conclusion

Flaky tests cause a lot of frustration, but they are surprisingly easy to fix once you know what to look for. A good first thing is to ask your team to start tracking failing builds so you can start accumulating data on the frequency and nature of these issues. This will help you prioritize fixes and ensure that your CI pipeline remains stable over time.