A fast-to-sync/search and space-optimized replication algorithm written in rust, The Nun-db data replication model

I have been working on Nun-db as a side project for over two years. Finally, in June 2020, I got it running as the unique real-time database for my main application. In the post, I shared that milestone. Since then, a couple of other applications have been starting using Nun-db, e.g., a chatbot, a real-time visitor for my blog, a house price scraper, a tic tac toe multiplayer, etc.

Most of them are small applications but they were good enough to test the stability and the main concepts of Nun-db as a viable option. So it was now time for the next step: make it ready to scale and safe to use in more prominent use-cases; that means not trusting in a single VM to keep it running, that means making it distributed, that means lots of new challenges and learnings. So this is the first post of a series of how I am making that possible.

Why is it important?

Replicate data is essential for several reasons. Disaster recovery and high availability are probably the most important ones for databases. In addition, data replication is crucial for Nun-db to help drive the adoption of applications with more critical loads and serve them as their primary database.

The model

Several decisions are necessary if you are thinking of implementing a replication algorithm. Every decision means you will be choosing among some options and embracing some challenges:

  • NunDB will be single primary: I decide that Nun-db will be a single leader database because it facilitates the distributed algorithms’ implementation; because the leader is the source of truth. However, it also brings the necessity of a leader election (I will talk about this shortly as it deserves its own post).
  • NunDB will have eventual consistency: which means the synchronization between nodes will be done async. For me, it’s ok if, for a couple of ms, the value in a replica is different from the leader. On the other hand, I guarantee that if you watch a key (the main feature) from a replica, you will eventually get all the updates (if you are wondering, lots of database providers are like that, MongoDB, for example).
  • NunDB will push all writes to primary: you can send any written request to a replicate, but it will forward pass the request to the primary.

All of these decisions have been made to make the initial model simpler to implement and deliver while keeping the speed for the end-users. For that reason, Nun-db still won’t scale horizontally for writes yet, but now, we support much more reading clients and we can also survive outages and guarantee availability even with VMs crashing, and even 0 downtime updates.

When all replicas are online, the replication is straightforward. All we do is replicate all the events that cause some change from primary to the replicas, like a set, remove, and snapshot commands. However, when a new replica set is added to the cluster or recovers from a long time offline, things get more interesting. In the following image, I present what happens in the simple case when all replicas are online; as of the rest of the post, I will focus on the more complicated matter when the sync node connects to the cluster.

  1. A client sets a set command to the primary.
  2. Primary stores that command in the oplog.
  3. Primary replicate the command to all the other nodes (in the image as they happen parallelly, I marked them as 3.1 and 3.2).

Nun-Db Sync model

A single replication thread guarantees that the oplog file will be in the same order as the replication messages are sent to the replicas.

Still, each replica set has its sync thread, so if one replica is slow, it cannot slow down the others replication.

Once a replica comes back online after sometime offline, the next steps happen:

  1. It sends to the primary the last operation id it has received.
  2. After that, the primary searches for all commands and sends them to the replica so it can get in sync.

That is usually how this process works. We have made a couple of improvements that make the Nun-db sync process faster, as I will show in the following chapters.

Log-based replication

Nun-db being a key-value database is a tremendous advantage for data replication because if a node is up, we can replicate the command live as it happens. On the other hand, if the node went down and came back online, we can replicate only the last state of a key, and there is no need to replicate all the operations that happened.

For example, if a key is an updated 200 times when a node was offline, once the node is back, we don’t need to replicate all the 200 operations. Instead, we can replicate only one to update the replica with the latest data. Nevertheless, we need to implement a log-based opp log to determine what keys to replicate.

Here is what I would like to accomplish with this implementation,

  1. Fast writes: since all operations will have to end up writing to the disk, it needs to be fast so it can’t delay the database operations.
  2. Space optimized: I run the current stances of Nun-db in small cheap serves, and I would like to keep it like that for a long time.
  3. Fast to search: the initial idea with the oplog is to serve disaster recovery from a node in the cluster, but if possible, I would like to use it for future “offline” synchronization from the clients.

Testing the writing speed

After reading around, I learn that an append log is probably the fast way to write to a file, so I test how fast it would be in my case with a single thread.

  1. Single write thread: using an append-only with single thread write approach, I was able to get 300.000 writes/s, not bad.

Impressive how fast did this code run, and it should be fast enough for me to keep a single thread to writing it on Nun-db load today (and for a long time, when I am close to hitting +300k operations per second, I will be more than happy to come back to this code and improve it).

Optimizing for space and search speed

As mentioned in the last section, we don’t need to store the value in the oplog since we can recover from the database running once we need it. Thus, this is already a space saver since we do not need to store the values in the oplog. That would probably be the more significant space usage.

Therefore we only need to store a reference to the key and the database to recover the value once required. Nun-db does not limit the size of the keys, so storing the key would also require more space. Additionally, I would like to have a known size for the record so I could implement the search as a binary search that would be much faster.

To do that, I have two options: 1) Use a stable hash algorithm to store and hash for the key instead of the key. 2) Create some metadata about the keys so each key will have a single id, and I can use that to convert it back to the original value. (Check Ps1)

At the moment, it makes little difference to me. I like the first one better because I don’t need to worry about keeping the metadata. Still, it also requires a significant change in the internal database objects, so I choosed the second one because it would be faster to code.

With that, Nun-db oplog record will look in the disk in the way I am presenting in the next block:

The predictable record size gives me some significant advantages:

  1. Each record will use 25 bytes on disk, which means I can store ~42 million operations in a GB (not too bad).
  2. Fast search, since I know the size of each record, I can jump any number of records I want and implement optimized searches. Here is a POC implementation of the search:

With this implementation, I could find a key 150000 in a log with 1 million records in 72.9µs (0,0729 ms) again, impressively fast and more than fast enough for my current load.

Replicating fast

Given how we can write, find and read the records fast, now let’s think about how we will speed up the sync over the network. As already mentioned, the Nun-db nature will make the sync easier once we need to sync the last state of any key and not every single command all the time.

To implement that, I created a hashmap to hold the operations. The key of the hashmap map is ${db_id}_${key_id} (database id and key id concatenated), and the value is the last operation.

First, we find the offset we want to start from and read all sequential operations setting in the hashmap each one of them. So, in the end, the hashmap will have only the last operation for each key in a database.

Checkout the next block of code where I presents the important parts of that code.

Now we need a second step to add the values that we need to send to the new replica set.

As my replication layer expects an array of strings to replicate, it is not time to convert this HashMap to a Vector of String; here is how I did it.

After that, all I need to do is send the vector of string to the node coming, and it will be in sync with the current node as the primary. Next, I will present a quick video showing the replication in action. In the video, I start with two replicas. I send hundreds of updates to the primary, and we note it live to replicate to the secondary, then I add a new node to the cluster (in this case, all clean). Only three messages are sent to the new node, 1. to create the db, 2. to snapshot the db, and the last one to set the last state. Once Nun-db knows it does not need to sync all the commands, that makes the new node sync with the other in seconds.

I figured the video became complicated to follow without more context, so, in the following image, I am pointing to the hot spots and the logs for each process.

Video context

Conclusion

You can find the final implementation at Nun-db repository mainly in the files disk_ops.rs in the functions read_operations_since and write_op_log, and replication_ops.rs in the functions start_replication_thread get_pendding_opps_since. By the time you are reading this, it may be in its final form, is hard to understand how the process starts from the election to the synchronization so I will keep that for a future post. Still, I think this one can give you some context to read those functions.

I don’t expect this to be a novel replication method as it is mainly based on the theories with a couple of tweaks here and there to leverage the Nun-db model, but it had served me as great learning (what was my main goal when I started Nun-db).

The implementation turned out to work great, and I got to thank for authors of the books (1) Designing Data-Intensive Applications and (2) Principles of Distributed Database Systems that helped understand the problems and coming up with this final implementation.

Implementing the distributed capabilities of Nun-db has been much harder than I have anticipated, but that also made it much more fun. So far, I needed to read about three books about distributed systems and database systems and uncountable posts and papers from the internet to get to this point. I am looking forward to testing these new capabilities into my production cluster, deploying them in different data centers, and seeing the issues I may find in the field. All this code is already merged and running as part of my main deployment, but it is not yet a multi VM deployment as that is my next step. By the time I am writing this post, it has been running for about four days with no problem (after two months of coding in a branch).

There is a lot to share/learn still on this journey, like the Election implementation, how I changed the client libs to support the multi-node cluster. I want to share them in a near-future post.

Ps1: Fixing invalid oplog after restart 28-Jul-21

As we keep a keys metadata file and we store it in the disk only when any database does a snapshot, it was possible that if new keys were added and the server restarted before saving the keys to disk. So when the server restarted in op-log, there would be keys ids without equivalent to the id stored. To solve this problem, we had two options.

  1. Store new keys synchronously: I wouldn’t say I liked this option because it would slow down operation since now they would not happen only in memory.

  2. Control validation invalidation of the oplog file while new keys are not yet safe saved. This seemed to me as the best option. It required an addiction control to know if the oplog-file is safe, but I can maintain a file with a single byte on disk to keep it over restarts.

I decided to go with option 2, focusing on performance over safety. So, next, I am presenting the main changes I needed to make.

  • Check in the startup if the oplog is valid and clean up if not here.
  • Mark op-log as invalid when new keys are added here
  • Mark op-log as valid when the keys file is saved here
  • Finally, I added a trap to the terminate signal to save the keys file before shutting down the process to ensure this error will rarely happen here.

It is safe to clean the op-log because the Database will do the full sync if the op-log is missing. Check the full PR here

References

  1. Kleppmann, M. (2017). Designing Data-Intensive Applications. Beijing: O’Reilly. ISBN: 978-1-4493-7332-0

  2. M. Tamer zsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems (3rd. ed.). Springer Publishing Company, Incorporated.

Reviewers

@Daymannovaes @igoroctaviano

Written on May 23, 2021