Home
What is Tracing
Summary
Tracing (or distributed tracing) is just "visual logging", but by being visual it becomes more useful in certain scenarios (e.g. performance management), makes the tracing
tools and developers approach it differently and also presents storage limitations.
The article uses 2 example tools:
AWX X-Ray and
Datadog APM
Unlike logging tracing can nest events.
Here is an example with pseudo-code
@Path("/user/add")
handleWebRequest() {
runSqlQuery("select * from table")
doSomethingElse()
makeOutgoingWebRequest("https://example.com")
}
Imagine a problem where we want to investigate performance. Here is how it would look in tracing (mimicking a beautiful UI with simple text here). Scroll horizontally if needed.
/user/add |--------------------------------------------| (5 seconds)
select* from table |-----------------------------| (3.5 seconds)
https://example.com |----| (1 second)
Look how easy it is to spot the performance culprit here (the sql query).
Because
tracing is so useful to spot performance bottlenecks - sometimes it's
called "Application Performance Management" (APM for short).
For example, Datadog tracing tool is called APM -
Datadog APM.
Let's try to use logging for the same thing,
here is what the code would look like (don't look too close, just appreciate all the logging noise added)
@Path("/user/add")
handleWebRequest() {
int start = now()
logInfo("starting request")
int startSql = now()
logInfo("starting sql query")
runSqlQuery("select * from table")
logInfo("finished sql query in " + format(startSql, now())
doSomethingElse()
int startOutgoingRequest = now()
logInfo("starting outgoing request")
makeOutgoingWebRequest("https://example.com")
logInfo("finished outgoing request in " + format(startOutgoingRequest, now())
logInfo("finished request in " + format(start, now()))
}
And we get something like this in the logs
starting request
starting sql query
finished sql query in 3.5 seconds
starting outgoing request
finished outgoing request in 1 second
finished request in 5 seconds
Now imagine 2 web requests hit your server at the same time, the logs will be
intermingled and thus insanely difficult to see what's going on. There are workarounds like
putting more context to logs, so you could have more filtering options, but none are ideal. A tracing tool would normally give you the list of "basic units of work"
events (web requests in our case)
/user/add 2000-01-01T00:00:00Z (5 seconds)
/user/add 2000-01-01T00:00:01Z (15 seconds)
Then you would drill down to one of those and investigate one request in isolation. Now let's explore what's "basic unit of work".
Because tracing is nested - "basic unit of work" needs to be defined.
Otherwise your tracing tool won't be able to do the nesting, it needs to know what are the top-level events. Here are example of "basic units of work":
- web requests
- thread pool tasks
- if your app is a sql storage engine then serving sql queries wil be its top-level events, i.e. basic units of work
Normally your tracing tool would start tracing web requests automatically among other things it detects. But sometimes the developer needs to put a hint to things like
thread pool tasks.
Because tracing is nested - it makes sense to trace almost everything without making too much noise.
It's the opposite with logging - you would normally dislike too many logs as it makes it difficult for analysis. For example, logging every sql query could be disabled by
default in your sql client library. Tracing every single sql query is perfectly fine in tracing, because those sql queries are nested within "basic units of work" (e.g. web
request top-level events).
On the other hand tracing absolutely everything would be insane, so normally we would trace long-running things and things that make up physical interactions with other
systems (sql queries are a good example). Tracing physical interactions with other systems makes some people use the term
distributed tracing instead of just tracing.
In real distributed tracing the traces would be propagated among micro services using http headers. That is just an example and goes beyond the scope of the article.
Because it makes sense to trace almost everything - the tracing tools usually do this for you out of the box.
For example, AWS X-Ray would
give you a set of tools for Java (like a servlet filter and jdbc
interceptor), so you get web requests, sql queries traced out of the box. Another approach is Datadog APM that
gives you a java agent that detects thinks like sql queries, rabbit events, web
requests, etc and so you don't even need to touch your code.
Because tracing tools trace almost everything - the developer may not realize he/she still needs to tweak.
Consider the code example
@Path("/user/add")
handleWebRequest() {
runSqlQuery("select * from table")
querySomeExoticDatabase()
makeOutgoingWebRequest("https://example.com")
}
and this trace (scroll horizontally if needed):
/user/add |--------------------------------------------| (5 seconds)
sql |--| (0.2 seconds)
https://example.com |-| (0.1 second)
There is a huge blind spot in the middle. The developer needs to
eliminate all such blind spots to make use of tracing. What we know here is that the middle
is taken by something that is not automatically traced by the tool (not sql, not web, etc). It could be some exotic database or some CPU-intensive operation. We need to wrap
the querySomeExoticDatabase() method as a custom trace (we know our code, and we know this method can take a time). Normally it's super easy in those tools, for example, in
Datadog APM you would just need to put @Trace annotation to the method
if it's Java. Then you get the blind spot covered and see what takes most of the time (scroll horizontally if needed):
/user/add |--------------------------------------------| (5 seconds)
select* from table |--| (0.2 seconds)
querySomeExoticDatabase |-----------------------------------| (4.7 seconds)
https://example.com |-| (0.1 second)
In logging you would normally start with nothing, and thus you become aware right away that you need to put logs here and there. In tracing the developers get the wrong expectations
that the tracing tool traces everything. Then they mistakenly decide that the tracing tool is useless when they see such traces with huge blind spots.
Even some top-level events may not be traced by default. A good example is thread pool tasks. Your app may not be serving web requests, but doing scheduled tasks only.
In that case the tracing tool would trace nothing out of the box unless you wrap those tasks in custom traces.
Because tracing tools trace almost everything - we hit storage limitations.
Tracing tools would normally have some sort of
"sampling" enabled by default, for example, only 1 web request per second is traced. This is fine when all you care is
performance investigations, but
sometimes you would really want to disable sampling. A good example is when
- You are sure there won't be too many events
- You really want to keep track of every event for audit purposes (much like logging is often used)
Comments