What is Tracing - metaruslan's blog
Home

What is Tracing

Summary

Tracing (or distributed tracing) is just "visual logging", but by being visual it becomes more useful in certain scenarios (e.g. performance management), makes the tracing tools and developers approach it differently and also presents storage limitations.
The article uses 2 example tools: AWX X-Ray and Datadog APM

Unlike logging tracing can nest events.

Here is an example with pseudo-code
@Path("/user/add")
handleWebRequest() {
    runSqlQuery("select * from table")
    doSomethingElse()
    makeOutgoingWebRequest("https://example.com")
}
Imagine a problem where we want to investigate performance. Here is how it would look in tracing (mimicking a beautiful UI with simple text here). Scroll horizontally if needed.
        /user/add            |--------------------------------------------| (5 seconds)
          select* from table  |-----------------------------| (3.5 seconds)
          https://example.com                                       |----| (1 second)
Look how easy it is to spot the performance culprit here (the sql query). Because tracing is so useful to spot performance bottlenecks - sometimes it's called "Application Performance Management" (APM for short). For example, Datadog tracing tool is called APM - Datadog APM.
Let's try to use logging for the same thing, here is what the code would look like (don't look too close, just appreciate all the logging noise added)
@Path("/user/add")
handleWebRequest() {
    int start = now()
    logInfo("starting request")

    int startSql = now()
    logInfo("starting sql query")
    runSqlQuery("select * from table")
    logInfo("finished sql query in " + format(startSql, now())

    doSomethingElse()

    int startOutgoingRequest = now()
    logInfo("starting outgoing request")
    makeOutgoingWebRequest("https://example.com")
    logInfo("finished outgoing request in " + format(startOutgoingRequest, now())

    logInfo("finished request in " + format(start, now()))
}
And we get something like this in the logs
        starting request
        starting sql query
        finished sql query in 3.5 seconds
        starting outgoing request
        finished outgoing request in 1 second
        finished request in 5 seconds
Now imagine 2 web requests hit your server at the same time, the logs will be intermingled and thus insanely difficult to see what's going on. There are workarounds like putting more context to logs, so you could have more filtering options, but none are ideal. A tracing tool would normally give you the list of "basic units of work" events (web requests in our case)
        /user/add 2000-01-01T00:00:00Z (5 seconds)
        /user/add 2000-01-01T00:00:01Z (15 seconds)
Then you would drill down to one of those and investigate one request in isolation. Now let's explore what's "basic unit of work".

Because tracing is nested - "basic unit of work" needs to be defined.

Otherwise your tracing tool won't be able to do the nesting, it needs to know what are the top-level events. Here are example of "basic units of work": Normally your tracing tool would start tracing web requests automatically among other things it detects. But sometimes the developer needs to put a hint to things like thread pool tasks.

Because tracing is nested - it makes sense to trace almost everything without making too much noise.

It's the opposite with logging - you would normally dislike too many logs as it makes it difficult for analysis. For example, logging every sql query could be disabled by default in your sql client library. Tracing every single sql query is perfectly fine in tracing, because those sql queries are nested within "basic units of work" (e.g. web request top-level events).
On the other hand tracing absolutely everything would be insane, so normally we would trace long-running things and things that make up physical interactions with other systems (sql queries are a good example). Tracing physical interactions with other systems makes some people use the term distributed tracing instead of just tracing. In real distributed tracing the traces would be propagated among micro services using http headers. That is just an example and goes beyond the scope of the article.

Because it makes sense to trace almost everything - the tracing tools usually do this for you out of the box.

For example, AWS X-Ray would give you a set of tools for Java (like a servlet filter and jdbc interceptor), so you get web requests, sql queries traced out of the box. Another approach is Datadog APM that gives you a java agent that detects thinks like sql queries, rabbit events, web requests, etc and so you don't even need to touch your code.

Because tracing tools trace almost everything - the developer may not realize he/she still needs to tweak.

Consider the code example
@Path("/user/add")
handleWebRequest() {
    runSqlQuery("select * from table")
    querySomeExoticDatabase()
    makeOutgoingWebRequest("https://example.com")
}
and this trace (scroll horizontally if needed):
        /user/add            |--------------------------------------------| (5 seconds)
          sql                 |--| (0.2 seconds)
          https://example.com                                          |-| (0.1 second)
There is a huge blind spot in the middle. The developer needs to eliminate all such blind spots to make use of tracing. What we know here is that the middle is taken by something that is not automatically traced by the tool (not sql, not web, etc). It could be some exotic database or some CPU-intensive operation. We need to wrap the querySomeExoticDatabase() method as a custom trace (we know our code, and we know this method can take a time). Normally it's super easy in those tools, for example, in Datadog APM you would just need to put @Trace annotation to the method if it's Java. Then you get the blind spot covered and see what takes most of the time (scroll horizontally if needed):
        /user/add                   |--------------------------------------------| (5 seconds)
          select* from table         |--| (0.2 seconds)
          querySomeExoticDatabase        |-----------------------------------| (4.7 seconds)
          https://example.com                                                 |-| (0.1 second)

In logging you would normally start with nothing, and thus you become aware right away that you need to put logs here and there. In tracing the developers get the wrong expectations that the tracing tool traces everything. Then they mistakenly decide that the tracing tool is useless when they see such traces with huge blind spots. Even some top-level events may not be traced by default. A good example is thread pool tasks. Your app may not be serving web requests, but doing scheduled tasks only. In that case the tracing tool would trace nothing out of the box unless you wrap those tasks in custom traces.

Because tracing tools trace almost everything - we hit storage limitations.

Tracing tools would normally have some sort of "sampling" enabled by default, for example, only 1 web request per second is traced. This is fine when all you care is performance investigations, but sometimes you would really want to disable sampling. A good example is when

Comments