Writing a prometheus exporter in rust from idea to grafana chart
In this post, I will show my thought process to get the Nun-db Prometheus exporter from idea to POC to repository to the final Grafana chart. I have used Prometheus for a long time, but I never took the time to understand how the exporters work in deep. So it seemed the obvious choice when I started to look for an alternative to improving Nun-db observability. Observability is a must-have for infrastructure components like Nun-db, and I want to make sure we will monitor what we need to make it successful.
If you don’t know, Nun-db is an open-source real-time database made to be fast, light, and easy to use. Read more about it in GitHub.
What is a Prometheus exporter?
In my own words, an exporter is a program that runs closer to the component you want to observe, and it needs to be accessible from the Prometheus so it can scrape the metrics from it.
How Prometheus exporters works
There is excellent documentation on how to write exporters in WRITING EXPORTERS, yet I did not find how they work in a high-level way, so here I am drawing what I understood of the process, the first important thing is that you don’t push metrics to Prometheus, it scrapes the exporters to pick the metrics like I show in the next diagram.
This was an interesting decision, but it means our exporter has to expose some URLs so Prometheus can capture the metrics.
Writing the exporter
It looks like most exporters are written in
Go, but I wanted to use Rust since Nun-db is written in rust. So I choose to use the Unofficial lib from tikv in GitHub.
This exporter will probably grow as the Nun-db’s available metrics grow, yet I want to keep it simple as much as I can, so I won’t be creating too many abstractions at the moment.
The initial POC
To make sure I understood all the concepts after reading a bit about exporters and having accessed some of their codes, I decided to code a minimal POC. I created a simple version only exposing
My initial code was simple, as I only wanted to make sure Prometheus would be able to scrape it the way I expected. The following block shows my first code version. The comments here are not present in the original code; I added them to make it easier to read if you are not familiar with rust.
At this point, I thought about creating a docker image and deploying it as a service to my cluster to test, but it started to have too many problems with my M1 mac to build the simple docker, so instead, I decided to run local and connect my Prometheus to it using some dynamical DNS to see it working quickly.
I decided to use ngrok since it is easy and stable and I am already used to it.
Running the exporter:
cargo run Listening on http://0.0.0.0:9898
ngrok http 9898
The documentation is not clear on connecting additional exporters to the operator, but the secret is to create additional job configurations into a
YAML file like I named
Then generate a secret file to be applied to the cluster.
additional-scrape-configs.yaml file will look like
Then apply the secret to the cluster.
It worked like a charm … Check it in action in the next print.
And the first Chart on grafana.
What a success, tho; I thought I would get to this point on the first day. In reality, it took me 2 other days (In the morning before I started working) to study Prometheus operator and get it working.
Now it is time to make it real with more metrics.
Deciding what metrics to expose
Now with the essential working and the confidence that it will serve the purpose, it is time to decide what metrics to expose. Oplog stats are the most obvious as they are already implemented, and the pending_ops is the one I use to build the initial POC. Now I will add the op_log_file_size and op_log_count since they are returned in the same response as we can see in the next block.
In this commit, you can see the changes made to accomplish this.
Now, what else would be great to have? I think the minimum would be Query and replication time.
As this is not a push query by query or replication by replication, I needed a way to aggregate the data. Aggregating metrics can lead to a rabbit hole of what stats to expose, and I want this to be working rather than perfect, so I will, for now, get something simple that can give me some visibility and improve it as I make progress. After a couple of hours reading around, I found and liked the way this guy did it here, with an exponential moving average query, I could have a fast to calculate, light on memory and yet be sensitive to recent movements in the query or replication time, Yet it does not help if one specific query is slower than other, I will deal with debugging slow queries latter, for now I am happy with the EMA idea.
I decided that monitoring would be a separated module in nun-db given its volatility and importance, so I decided to create a dedicated module to it instead of adding code changes to the already existent modules.
That change resulted in this PR first commit, in Nun-db main code. But I decided to change a couple of other points. Let’s see how that will look. I will change the command OplogState to be more general and get all metrics called MetricsState.
The hard part here, in fact, was to convince me that these metrics were enough. Coding it was surprisingly simple. Now it is time to make the docker build and push it to the docker hub. For that, I created 2 GitHub actions. One for running the rust unit tests and building like follows.
And the other to auto-publish docker image on each merge to master.
To publish the docker automatically, I had to add 2 secrets, DOCKER_PASSWORD and DOCKER_USERNAME, to GitHub configurations.
And add a minimal docker file to build and expose the service like the following.
git push and now we have the https://hub.docker.com/repository/docker/mateusfreira/nun-db-prometheus-exporter out there :).
Testing it to see if work
docker run --env NUN_USER=$user --env NUN_PWD=$pwd --env NUN_URL="$nun_url" -p 9898:9898 mateusfreira/nun-db-prometheus-exporter:latest
Worked gracefully.Time to add it to my running cluster. For that, I created a StatefulSet to live close to the nun-db running in the cluster. The only important point there is the NUN_URL set to
nun-db-0.nun-db.nun-db.svc.cluster.local so it can internally connect to the cluster to collect the stats it needs. The
yaml of the StatefulSet is like the following:
I needed to add a service to be able to make Prometheus find the exporter in the internal DNS(odd).
With that, I was able to get my metrics flowing and created the next Chart.
Now it is time to get it all wired and make and move to my last activities in Nun-db this year.
These metrics also got my attention to some operations that seem not to be replicating as expected. Since I published the last version, there have been 22 pending operations growing like 2 a day. It is time to fix it, I think I will add a log to see what opps are pending replication and see what I find, but that may be the subject for the next post.
I expected this change to be much simpler than it turns out to be. Initially, I hoped to finish this on a Saturday morning before my wife woke up. Instead, I am days later, finishing the last details, yet this is the first move into a more observably Nun-db, so it is great I could make it a thing in a few days with the years’ events and my full-time job. So it feels excellent to get to the finish line.
Check out the final version in https://github.com/mateusfreira/nundb-prometheus-exporter and watch Nun-db on GitHub for future posts.
- Stop procrastinating and just fix your flaky tests, it may be catching nasty bugs
- An approach to hunt and fix non-reproducible bugs - Case study - Fixing a race conditions in Nun-db replication algorithm in rust
- Nun-db the debug command
- Keeping up with Nun-db 2021
- Writing a prometheus exporter in rust from idea to grafana chart
- Integration tests in rust a multi-process test example
- Leader election in rust the journey towards implementing nun-db leader election
- How to make redux TodoMVC example a real-time multiuser app with nun-db in 10 steps
- A fast-to-sync/search and space-optimized replication algorithm written in rust, The Nun-db data replication model
- NunDb How to backup one or all databases
- How to create your simple version of google analytics real-time using Nun-db
- Migrating a chat bot feature from Firebase to Nun-db
- Keepin' up with Nun-DB
- Going live with Nun-DB