Stop procrastinating and just fix your flaky tests, it may be catching nasty bugs

I believe every developer knows this in their heart, yet sometimes we still ignore it. I procrastinated with one flaky test for a long time in my project Nun-db.

It only happens on the GitHub actions, and I could never replicate it locally, so I was used to running the build and moving on since I only open or merge one or 2 PR a week on Nun-db it ran like that for longer than I would like.

This Saturday morning, while in a coffee shop, there was a broken build in the master branch from a PR I had merged 2 hours before, it kills me to see a broken build in my projects in master, so I decided to fix it and do not accept this flakiness anymore.

To my surprise, it was not a test flaky; it was a real problem. The code was flaky, not the test.

Allow me to give some context. Next, the unit test I am talking about starts three nun-db processes as replica sets and sends a command to make sure the value is replicated correctly for all processes. You can see it here Here.

After reading the test, I was convinced there was no wrong code on the test, and I decided to look at the logs from the CI, as I am presenting next. I omitted most of the log for simplicity but paying particular attention to the # will help you understand what is happening.

2022-03-19T13:09:30.4500089Z test tests::should_replicate_as_expected_set_safe ... [2022-03-19T13:09:30.449713727Z nun_db] nundb starting!
# Primary starts fine

2022-03-19T13:09:35.6648602Z [2022-03-19T13:09:35.664777173Z nun_db::lib::process_request] [139909448636160] Server processed message 'join 127.0.0.1:3018
2022-03-19T13:09:35.6671592Z [2022-03-19T13:09:35.667074838Z nun_db::lib::process_request] Setting 127.0.0.1:3017 as primary!
# Secoundary 1 stats fine

# Secoundary 2 starts

2022-03-19T13:09:40.8938860Z [2022-03-19T13:09:40.893781615Z nun_db::lib::process_request] [139909444421376] Server processed message 'join 127.0.0.1:3016
# Primary fails to connect to the Secoundary 2
2022-03-19T13:09:40.8949291Z [2022-03-19T13:09:40.894793987Z nun_db::lib::replication_ops] Could not stablish the connection to replicate to 127.0.0.1:3016 error : Connection refused (os error 111)
# Test fails here

What does that mean?

The primary process fails to connect to the secondary as it was off, tho in the Nun-db flow, the new secondary part of a cluster tells the others it became alive, which means this is absurd and should never happen. Well, it did, so let’s see in code what caused it.

The bug is in the next block of code; let’s check it together.


fn start_db(
    user: &str,
    pwd: &str,
    ws_address: &str,
    http_address: &str,
    tcp_address: &str,
    replicate_address: &str,
) -> Result<(), String> {
//... omited for simplicity
    let db_replication_start = dbs.clone();
    let tcp_address_to_relication = Arc::new(tcp_address.to_string());
//... start replication thereads
    let join_thread = thread::spawn(move || {
// Here notify the other nodes the process has started
        lib::replication_ops::ask_to_join_all_replicas(
            &replicate_address_to_thread,
            &tcp_address_to_election.to_string(),
            &dbs_self_election.user.to_string(),
            &dbs_self_election.pwd.to_string(),
        );
        lib::election_ops::start_inital_election(dbs_self_election)
    });
//  Other steps here
    let _snapshot_thread = thread::spawn(|| lib::disk_ops::start_snap_shot_timer(timer, db_snap));
// Start ws connections
// Start http connection
// Only now will start tcp connection (nodes communication use TCP) (UPS)
    let tcp_thread =
        thread::spawn(move || lib::network::tcp_ops::start_tcp_client(dbs_tcp, &tcp_address));
//... omited
}

So locking now seems evident that I should never send a notification to another process before preparing the current one to respond to it. In this case, start the TCP connection listener.

Well, the fix you already know at this point. Move the TCP start thread up. And that is what I did in the PR, and I am showing next

Conclusion

Dear future me, flaky tests may not just be flaky tests; they may be hiding a real issue in your system. So the bug may be flaky, not the test. That will be true for race conditions like this. When dealing with distributed systems, this is even more critical. DO NOT IGNORE FLAKY TESTS, thanks.

Written on March 19, 2022