Book review / personal notes - Distributed Tracing in Practice: Instrumenting, Analyzing, and Debugging Microservices

This book was a recommendation from a very senior close friend, which is why I started reading it. At the beginning of the book, I was in the middle of implementing distributed tracing for the customer I was consulting for, and it gave me rich and practical information on the implementation, such as the information I might like to store on the traces and the drawbacks from tracing all or sampling requests.

In the system I was instrumenting, we did not use microservices, so the complexity was not as high as the ones described in the book. It took me too long to finish this book, and in the later chapters, I was not as interested in the subject anymore since the project had been delivered and was, from my point of view, a success. Nevertheless, let me present a few core ideas about the book.

The Problem with Distributed Tracing

In the first chapter, the author defines the scope of the problem; there is a bit of history, too, about how the book came to be and why it is important in today’s context.

Spans

Distributed trances today, as implemented using the concept of span, will represent one operation, and the combination of spans connected to each other by timing will tell the history of one “request .”This span will contain tags that you can use to store information that may be useful for later debugging.

Important information to hold when implementing distributed

This small list may seem obvious, but if you forget some of those, you may regret not having it to do a filter in the future, for example in my case, when I first implemented the tracing in my application, I forgot to add the backend version that processed the information and that turned out to be one of the first information I needed when debugging important problems after the deploy of the distributed tracing. Next, my own small list of must-have information on the tags.

Software version (What client version did the request? What hash was running the backend when the request was processed)
Tenant information (organization, user, unit, or any other identifier that helps you cluster the information)
Timing. (When did it happen? Most tools have this out of the box)
Parent request(or source of the action, what service originated the action)
Hardware information (What pod processed this request? What node? What Pod?)

Why did the span win the model?

There were other tries to solve the same problem of “debugging” in production with other models, but the spam representation is a clear winner model. The author lists why the span model won, among others that existed before I listed some of them here:

The first system to need this was a web system, and therefore a request/response system was a perfect match to the span model.
They abstract the request/response model well and work well to represent remote procedure calls.
The abstraction is easy to implement in subsystems.
Most systems are request/response based, which favors the simple model even if some aspects are lost.
It is simple and adaptable to most situations; simplicity often wins in life, even if there are losses.

It will probably remain span-based forever. Span is easy to scale because it is decoupled, and as far as you can relate them using context propagation, it is easy to rebuild the history of a request.

Why spans are not enough?

Spans won as the model, but they are not perfect for all cases; the author lists a few kinds of systems that will be hard to fit into span models when implementing distributed tracing.

Streaming systems where the end of an event processing is not clear, and the system is processing the information as it comes can be difficult to present on a span-only system.
When a request may call multiple downstream systems and later join them to give one or more responses (graph), some systems may “give up” on a response if it takes more than a certain amount of time. (Google search, for example)
Bath request(bath queue), when many requests are put together and processed all at once, one may delay the other 10, and it is hard to find which one is to blame.
Decouple dependencies, for example, in the PubSub case where one request puts a message into a queue, and that queue is processed at a later time, spans can represent that, but it will not give a clear view. One message may be processed by many clients, and in this case, the trace may look messy, or a complicated way to trace it may need to be created. In machine learning, while performing inference, it is easy to map as a request-response model, but for training, it is more complex.
IN ML, while performing inference, it is easy to map as a request-response model, but for the training, it is not that simple.

Causality?

Traces can be useful for identifying some slowness. But what do you do once you’ve found it? How do you determine why it was slow? It can be complicated, and usually, you will need more tools to investigate further. Reading the code, reading metrics at that time, and reviewing communication in that span may help. You may also deploy another version with new sub-steps to help identity which step is slow in that other step. Today’s work requires a lot of brainpower and coffee and instrumenting sub-steps to identify them. You may also register them as events in the tracing.

Beyond distributed tracing

Anticipating Problems: Code and infrastructure require instrumentation. When you do not anticipate what could go wrong, you may miss one detail that will be crucial when trying to debug a problem, such as if one request is slow and you only inspect the full request, you may not know if it was the database, the processing, or any other component taking time to process something.
Completeness Versus Cost: You would probably want to trace all requests, but will that have a cost that will be worth it? How frequently will it be used? It is a trade-off; you can decide to use sampling, but in that case, you may lose some requests (or calls) that you would like to debug. There is no perfect solution here yet. Maybe trace requests if the time is longer than Xms, but that is hard to implement with multiple downstream services.
Open-Ended User Cases: Little to no value for a single span request (Agreed). There are other options that propagate the tags instead of the trace ID, which may be better in some cases but not all.
Is it “All or nothing”: It is costly, and we need to think if that is worth it for our case to store it all. Think about being able to enable/disable on demand (cost); it may not be worth it to instrument your code all at once in the first version; for recurring errors, you will mostly like to have the opportunity to instrument later; keep in mind that this will have a cost in users goodwill.
It does not need to be an “All or nothing” solution, it may be useful to collect very well-detailed trances in some cases, and others let it be simple as it can get (detailing it later if they are important).
The hardest problem is not to propagate context and rebuild the history in the end; make sure to think about how your system will propagate the context in the borders.

The future of Context propagation

The author cites and describes a few tools and techniques trying to predict how the context propagation will evolve and shows a few research tools and how they solve ( or try to solve the problem). This chapter was not particularly interesting to me as I am not looking too deeply into how those tools are evolving soon, there has been some time since the Book publication, but no revolution is yet popular, at least in this field.

Who should read the book?

Anyone who evolved in developing distributed systems should read (or at least pass and quick eye at) this book; it gives a clear view of how the distributed trace can help you debug real problems in production, yet it is not an easy nor a fast reading, I would suggest to fly over the chapters that are not interested in you and to focus on the points where it will add more value for the problems you are facing at the moment.

For example, the sampling chapter caught my attention more than the others when considering the trade-offs on what to store and how; storing everything can be and probably will be costly, so you need to think about the level of detail you will really need.

Conclusion

This was a good book but a challenging read. I am storing the notes for future reference, but this book will not become part of any particular role as my recommendation. It is worth reading when implementing distributed tracing in your application. If you are on the journey of becoming an architect or a performance engineer, try to read it without the obligation to read it back to back in a few sessions.

Where to find it

Amazon

Written on May 14, 2023