An approach to hunt and fix non-reproducible bugs - Case study - Fixing a race conditions in Nun-db replication algorithm in rust

This is the story behind the fix of one race condition in Nun-db. In this post, I went over the process to find the race condition by eliminating two initial theories and finally identifying the root cause, fixing it, and validating the fix. If you only want to see the results go straight to 3. Third and correct theory: Are nodes acknowledging messages before they are registered as pending?

What is Nun-db?

Nun-db is a real-time key-value database open-source MIT licensed I am building using Rust. You can read more about it on GitHub. It gives superpower web apps with super simple API.

Wait, but doesn’t Rust prevent race conditions?

No. Rust, with its ownership system, prevents data races. Furthermore, race conditions are probably impossible to prevent since they may be related to network, hardware, or multi processes out of language control. I encourage you to read the section Data Races and Race Conditions to get better clarification.

How did I find the problem?

I run a Nun-db cluster for a few users (including my own applications). While I was working on adding more observability to Nun-db, I noticed the number of oplog pending ops was growing slowly day by day at a rate of ~0.25%. To contextualize at the moment, Nun-db has a single primary that handles all the writes and replicates them asynchronous to secondary nodes. Then the secondary nodes send an ack message back to the primary to finish the replication transaction. The operation is considered pending in the time between the replication and the ack message. Now you know probably enough about the Nun-db replication algorithm to continue, if you want to read more about it, read “The Nun-db data replication model”).

Moreover, in the blog post “Writing a Prometheus exporter in rust” I shared an image that pictures exactly this bug, the same I am sharing next but without the WTF arrow that I added only to this post. Note in the plot the pending ops are growing and never returning to its previous state.

WTF

That means some messages are either not replicating correctly, or even worst, getting lost. I tried to conciliate data from primary and secondary and did not find any difference in the values. That is the good news; there was no data loss or difference between primary and other nodes. So it was time to start hunting the problem. My approach for fixing this kind of problem is: Theorise -> investigate/invalidate the theory -> Formulate a new theory based on the results -> repeat.

The path to the fix theory by theory

1. The Initial Theory: Did I miss the replication implementation of some command(s)?

First, I guessed I had not implemented the replication of some commands correctly, and all commands of one kind were never appropriately replicated. Those commands could only be used for some external users and were included in the unit tests, so that is why I missed them. Since there was no data difference between primary and other nodes, the command would have to be something that does not change data, like election messages or retrieve commands (that should not be replicated).

1.1. Validating the theory: I needed a way to see what operations were not replicated, then I would just fix the replication implementation. I failed to replicate this problem locally and with unit or integrations tests, so I had no option other than to add a way to debug this in a running environment. So I create a new admin command on Nun-db for debugging proposes. I wrote more about that in the post Nun-db the debug command.

1.2. Invalidating the theory: To my surprise, as soon as I deployed the updated version, I saw normal set messages as pending, like for example:

### Pending message
message: "replicate org-559 lastTreatment { "id": 1641746754477, ...}}", opp_id: 1641746754499484856, ack_count: 1, replicate_count: 2 

ack_count: 1, replicate_count: 2 means it was replicated to 2 nodes, and only one responded with an ack, theory one eliminated because the most basic command set was present among the not acknowledged messages.

2. The second theory: Is there a problem with some of the nodes in the cluster?

Next, I thought some Nun-db node was not working correctly or it was not acknowledged because of network problems.

2.1 Validating the guess: I tailed the logs from the cluster to see the ack messages coming back from the secondary nodes, which would tell me what node is not acknowledging the messages, and I can concentrate my investigation on that node.

2.2. Invalidating the theory: Once again, reality proved my prediction was wrong. In the logs, I saw messages from both nodes proving both were acknowledging the message, but I found something very interesting the Acknowledging invalid opp check it out.

[2022-01-09T16:45:54.498664265Z  'set lastTreatment { "id": 1641746754477, "value" : {"$persisted":true,"user":618812,"id":5608309,...}}' in 8.021µs
# ...
[2022-01-09T16:45:54.504963203Z nun_db::lib::bo] Acknowledging invalid opp 1641746754499484856
# 6ms latter
[2022-01-09T16:45:54.504980963Z nun_db::lib::process_request]  'ack 1641746754499484856 nun-db-2-1.nun-db-2.nun-db.svc.cluster.local:3019' in 28.6µs
[2022-01-09T16:45:54.509355868Z nun_db::lib::process_request] 'ack 1641746754499484856 nun-db-2-0.nun-db-2.nun-db.svc.cluster.local:3019
    ' in 10.56µs

The message Acknowledging invalid opp means the id is invalid or no longer exists in the pending message list, as we can see in the code on GitHub or in the next block.

 pub fn acknowledge_pending_opp(&self, opp_id: u64, server_name: String) {
	 let mut pending_opps = self.pending_opps.write().unwrap();
	 match pending_opps.get_mut(&opp_id) {
	 Some(replicated_opp) => {
///....
	 }
	 None => {
	 log::warn!("Acknowledging invalid opp {}", opp_id);
	 }
	 };
	 }

But I am searching for a valid id captured in the log. So we found something that should be impossible.

As Thorin says in “The Hobbit,” “There is nothing like looking if you want to find something… You certainly usually find something, if you look, but it is not always quite the something you were after”.

I noticed that the time between the first ack and the error was only 6ms. It is a lot of time, but the messages are not shown just in time in the log. So that could mean the ack and the replication could have been just a few milliseconds from each other.

3. Third and correct theory: Are nodes acknowledging messages before they are register as pending?

Based on my last observations it looks like nodes acknowledge messages before they were added to the pending messages list.

3.1. Validating the theory: Unfortunately, no easy debug would prove this guess is correct or incorrect. So I needed to jump into the code and find where it could be lucky it is isolated into a few functions which deal with replication.

Here is the old version of that code (You can also see in GitHub here)

code-explaned

By now, you have probably already noticed the problem. We send the message to a channel in the line 167 (That channel will replicate the message to the node) and only add it as pending after sending it to all nodes in the line 179.

That was not a good idea. If replicating it to several channels is slower than the first replication thread, we risk acknowledging the message before registering it as pending.

The fix

It is hard to be sure you found the root cause of these problems. Nevertheless, you should follow the strong leads and test the fixes in running environments. The first fix I needed to do was to register the pending ops earlier, but I need to ensure I add it only once and not for each node.

I made the add_pending_opp a more general and complex method renaming it to register_pending_opp. Keep in mind that this method has to own the ReplicationMessage to add it to the object used by all other threads. It became more costly since it needs to check if the message existed; if it did not, I added it to the pending ops, one addictions operation that was not needed before since I used to add it in the end. There are now two locks on the function, one to read checking the existence and one for wire if needed. I think that is the cost of safety but peace of code to improve (If you are reading this blog and you have ideas to improve it, contributions are welcome :) )

Here is how the code looks like in the end.

Next, I made a simple change to add the register_pending_opp earlier in the replicate_message_to_secoundary function, making the impossible state of a valid message being acknowledged before its existence for the rest of the system impossible.

Code change

I think this is the biggest learn in this journey is “Making the Impossible States Impossible” by Richard Feldman, that is a talk about Elm programming language, and it is also a solution to many other things. If you don’t allow your system to get into a state that should not be possible you will prevent a family of problems that are hard to debug and find like this one.

Final Code

The final code in the replicate_message_to_secoundary is simpler than the version it initially was because now the complexity is hidden inside the Databases method register_pending_opp I liked it because the replication modules are already too complicated to follow, given its multi-thread and asynchronous communication patterns so remove complexity from it is great.

fn replicate_message_to_secoundary(op_log_id: u64, message: String, dbs: &Arc	<Databases>) {
		log::debug!("Got the message {} to replicate ", message);
		let state = dbs.cluster_state.lock().unwrap();
		for (_name, member) in state.members.lock().unwrap().iter() {
				match member.role {
						ClusterRole::Secoundary => {
								let message_to_replicate = dbs.register_pending_opp(op_log_id, message.clone());
								replicate_if_some(&member.sender, &message_to_replicate, &member.name)
						}
						ClusterRole::Primary => (),
						ClusterRole::StartingUp => (),
				}
		}
}

Validating the fix

I was not 100% sure this would be the root cause of the problem until I deployed this code; as mentioned before, it was a problem for sure, but was it “the” problem?

To validate it, I deployed on Jan-16, and as I am writing this blog post at Jan-23 07:26 am, there have been 0 pending ops since the deployment. There have been ~250k transactions in the last seven days with 0 replication problems; check out the next plot where I share the charts, both pending ops and total transactions.

deploy-stuff

With that, we can ensure the change is super positive, and of course, I will follow it for the weeks to come, but it is already a success story of a debug that went well.

Conclusion

Lessons I would like to share at the end of this post:

  1. Rust nor any other language will prevent you from getting race conditions. You need to take care of them yourself.
  2. When coding your system, make the impossible state impossible to happen. That will prevent you from having complex problems to find and fix like this one.

  3. In the face of a problem that is not replicable in test/dev, follow these method steps to validate/invalidate theories and eliminate them once they are disproved.

3.1 Based on what you know, create a new “Guess” or theory.

3.2 Find ways to validate/invalidate your guess with a good amount of confidence.

3.3 Upon invalidating a theory, formulate a new one, use what you already found to make a well-informed guess, document with notes eliminated guesses and why they were eliminated so you can avoid running circles, and when parring with others, you can re-validate your findings.

3.4 Repeat 3.1 with your new guess.

Having the wrong theory led me to find the real problem in a short amount of time. Follow strong leads deep, and you will probably find something, maybe not exactly what you expected.

Sometimes validating theories will require code change and deploys; it is good to have strong evidence that you may be right. Share your findings in the PR/MR before engaging too much into endless code changes to hunting errors that may not be what you think they are.

Logs and metrics are your allies; use them as much as you can to validate your theories. Read code to understand how “impossible” things happened. Make sure you are in the same code version as the running system you are investigating.

If you want to know more about Nun-db star or watch us on GitHub, there I share all the important posts I write about Nun-db on GitHub. If you want to hear more from me, I also write about books, remote work and general software development in this blog. You can follow me on Twitter

Thanks for reading, and see you in the next post.

Written on January 23, 2022