Monday, 27 March 2017

Named thread flamegraphs

After watching a great talk by Nitsan Wakart at this year's QCon London, I started playing around with flamegraphs a little more.

To get the gist of Nitsan's talk, you can read his blog post Java Flame Graphs Introduction: Fire For Everyone!.

The important thing to take away is that the collapsed stack files that are processed by Brendan Gregg's FlameGraph scripts are just text files, and so can be filtered, hacked, and modified using your favourite methods for such activities (your author is a fan of grep and awk).

The examples in this post are generated using a fork of the excellent perf-map-agent. I've added a couple of scripts to make these examples easier.

To follow the examples, clone this repository.

Aggregate view

One feature that was demonstrated in Nitsan's talk was the ability to collapse the stacks by Java thread. Usually, flamegraphs show all samples aggregated into one view. With thread-level detail though, we can start to see where different threads are spending their time.

This can be useful information when exploring the performance of systems that are unfamiliar.

Let's see what difference this can make to an initial analysis. For this example, we're going to look at a very simple microservice built on dropwizard. The service does virtually nothing except echo a response, so we wouldn't expect the business logic to show up in the trace. Here we are primarily interested in looking at what the framework is doing during request processing.

We'll take an initial recording without a thread breakdown:

$ source ./etc/

$ ./bin/perf-java-flames 7731

You can view the rendered svg file here: flamegraph-aggregate-stacks.svg.

Aggregated stacks
Aggregated stacks

From the aggregated view we can see that most of the time is spent in framework code (jetty, jersey, etc), a smaller proportion is in log-handling (logback), and the rest is spent on garbage collection.

Thread breakdown

Making the same recording, but this time with the stacks assigned to their respective threads, we see much more detail.

$ source ./etc/

$ ./bin/perf-java-flames 7602

In flamegraph-thread-stacks.svg we can immediately see that we have five threads doing most of the work of handling the HTTP requests; they all have very similar profiles, so we can reason that these are probably threads belonging to a generic request-handler pool.

Thread stacks
Thread stacks

We can also see another thread that spends most of its time writing log messages to disk. From this, we can reason that the logging framework has a single thread for draining log messages from a queue - something that was more difficult to see in the aggregated view.

Now, we have made some assumptions in the statements above, so is there anything we can do to validate them?

Annotating thread names

With the addition of a simple script to replace thread IDs with thread names, we can gain a better understanding of thread responsibilities within the application.

Let's capture another trace:

$ source ./etc/

$ ./bin/perf-thread-flames 8513

Although flamegraph-named-thread-stacks.svg looks identical when zoomed-out, it contains one more very useful piece of context.

Named stacks
Named stacks

Rolling the mouse over the base of the image shows that the five similar stacks are all from threads named "dw-XX", giving a little more evidence that these are dropwizard handler threads.

There are a couple of narrower stacks named "dw-XX-acceptor"; these are the threads that manage incoming connections before they are handed off to the request processing thread pool.

Further along is a "metrics-csv-reporter" thread, whose responsibility is to write out performance metrics at regular intervals.

The logging framework thread is now more obvious when we can see its name is "AsyncAppender-Worker-async-console-appender".

With nothing more than an external script, we can now infer the following about our application:

  • this application has a request-handling thread-pool
  • fronted by an acceptor thread
  • logging is performed asynchronously
  • metrics reporting is enabled

This kind of overview of system architecture would be much harder to piece together by just reading the framework code.


Now that we have this extra context in place, it is a simple matter to filter the flamegraphs down to a finer focus.

The perf-thread-grep script operates on the result of a previous call to perf-java-flames (as seen above).

Suppose we just wanted to look at what the JIT compiler threads were doing?

$ source ./etc/

$ ./bin/perf-thread-grep 8513 "Compiler"


Compiler threads
Compiler threads

or to focus in on threads that called into any logging functions?

$ source ./etc/

$ ./bin/perf-thread-grep 8513 "logback"


Logging functions
Logging functions


Annotating flamegraphs with java thread names can offer insight into how an application's processing resources are configured. We can use this extra context to easily zoom in on certain functionality.

This technique is particularly powerful when exploring unfamiliar applications with large numbers of threads, whose function may not be immediately obvious.

My learned correspondent Nitsan has suggested that I'm being lazy by using jstack to generate the thread-name mapping. His main complaint is that it causes a safe-point pause in the running application. To make these scripts a little more lightweight, I will explore retrieving the thread names via JVMTI or another low-level interface. But that's a blog post for another day.

Friday, 12 August 2016

Lightweight tracing with BCC

Ever since I read some initial blogs posts about the upcoming eBPF tracing functionality in the 4.x Linux kernel, I have been looking for an excuse to get to grips with this technology.

With a planned kernel upgrade in progress at LMAX, I now have access to an interesting environment and workload in order to play around with BCC.

BPF Compiler Collection

BCC is a collection of tools that allows the curious to express programs in C or Lua, and then load those programs as optimised kernel modules, hooked in to the runtime via a number of different mechanisms.

At the time of writing, the motivation subsection of the BCC README explains this best:

End-to-end BPF workflow in a shared library
    A modified C language for BPF backends
    Integration with llvm-bpf backend for JIT
    Dynamic (un)loading of JITed programs
    Support for BPF kernel hooks: socket filters, tc classifiers, tc actions, and kprobes

Tracking process off-cpu time

One of the more in-depth investigations I've been involved in lately is tracking down a throughput problem on one of our services. We have a service that will sometimes, during a test run in our performance environment, fail to keep up with the message rate.

The service should be reading multicast traffic from the network as fast as possible. We dedicate a physical core to the network processing thread, and make sure that the Java thread doing this work is interrupted as little as possible.

Now, given that we run with the SLAB allocator and want useful vmstat updates, our user-land processes will sometimes get pre-empted by kernel threads. Looking in /proc/sched_debug shows the other processes that are potentially runnable on our dedicated core:

runnable tasks:
            task   PID         tree-key  switches  prio  ... 
        cpuhp/35   334         0.943766        14   120 
    migration/35   336         0.000000        26     0  
    ksoftirqd/35   337        -5.236392         6   120   
    kworker/35:0   338 139152554.767297       456   120  
   kworker/35:0H   339        -5.241396        12   100   
    kworker/35:1   530 139227632.577336       902   120  
   kworker/35:1H  7190         6.745434         3   100   
R           java 102825   1306133.326251       479   110  
    kworker/35:2 102845 139065252.390586         2   120  

We know that our Java process will sometimes be kicked off the CPU by one of the kworker threads, so that it can do some house-keeping. In order to find out if there is a correlation between network traffic build-up and the java process being off-cpu, we have traditionally used ftrace and the built-in tracepoint sched:sched_switch.

Determining off-cpu time with ftrace

The kernel scheduler emits an event to the tracing framework every time that a context switch occurs. The event arguments report the process that is being switched out, and the process that is about to be scheduled for execution.

The output from ftrace will show these data, along with a microsecond timestamp:

<idle>-0     [020] 6907585.873080: sched_switch: 0:120:R ==> 33233:110: java
<...>-33233  [020] 6907585.873083: sched_switch: 33233:110:S ==> 0:120: swapper/20
<idle>-0     [020] 6907585.873234: sched_switch: 0:120:R ==> 33233:110: java

The excerpt above tells us that on CPU 20, the idle process (pid 0) was runnable (R) and was switched out in favour of a Java process (pid 33233). Three microseconds later, the Java process entered the sleep state (S), and was switched out in favour of the idle process. After another 150 microseconds or so, the Java process was again ready to run, so the idle process was switched out to make way.

Using the output from ftrace, we could build a simple script to track the timestamps when each process was switched off the CPU, then subtract that timestamp from the next time that the process gets scheduled back onto a CPU. Aggregating over a suitable period would then give us the maximum time off-cpu in microseconds for each process in the trace. This information would be useful in looking for a correlation between network backlog and off-cpu time.

Determining off-cpu time with BCC

So that's the old way, and it has its drawbacks. ftrace can be quite heavy-weight, and we have found that running it can cause significant overhead in our performance-test environment when running under heavy load. Also, we are only really interested in the single dedicated network-processing core. While it is possible to supply ftrace with a cpumask to restrict this, the interface is a little clunky, and doesn't seem to work very will with the trace-cmd front-end to ftrace.

Ideally, we would like a lighter-weight mechanism for hooking in to context switches, and performing the aggregation in kernel-space rather than by post-processing the ftrace output.

Luckily, BCC allows us to do just that.

Looking at the kernel source, we can see that the trace_sched_switch event is emitted immediately before the call to context_switch:

    trace_sched_switch(preempt, prev, next);
    rq = context_switch(rq, prev, next, cookie); /* unlocks the rq */

At the end of the context_switch function, finish_task_switch is called:

    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);
    return finish_task_switch(prev);

The argument passed to finish_task_switch is a task_struct representing the process that is being switched out. This looks like a good spot to attach a trace, where we can track the time when a process is switched out of the CPU.

In order to do this using BCC, we use a kprobe to hook in to the kernel function finish_task_switch. Using this mechanism, we can attach a custom tracing function to the kernel's finish_task_switch function.

BCC programs

The current method for interacting with the various probe types is via a Python-C bridge. Trace functions are written in C, then compiled and injected by a Python program, which can then interact with the tracing function via the kernel's tracing subsystem.

There is a lot of detail to cover in the various examples and user-guides available on the BCC site, so I will just focus on a walk-through of my use-case.

First off, the C function:

#include <linux/types.h>
#include <linux/sched.h>
#include <uapi/linux/ptrace.h>

struct proc_name_t {
    char comm[TASK_COMM_LEN];

BPF_TABLE("hash", pid_t, u64, last_time_on_cpu, 1024);
BPF_TABLE("hash", pid_t, u64, max_offcpu, 1024);
BPF_TABLE("hash", pid_t, struct proc_name_t, proc_name, 1024);

int trace_finish_task_switch(struct pt_regs *ctx, struct task_struct *prev) {
    // bail early if this is not the CPU we're interested in
    u32 target_cpu = %d;
    if(target_cpu != bpf_get_smp_processor_id())
        return 0;

    // get information about previous/next processes
    struct proc_name_t pname = {};
    pid_t next_pid = bpf_get_current_pid_tgid();
    pid_t prev_pid = prev->pid;
    bpf_get_current_comm(&pname.comm, sizeof(pname.comm));

    // store mapping of pid -> command for display
    proc_name.update(&next_pid, &pname);

    // lookup current values for incoming process
    u64 *last_time;
    u64 *current_max_offcpu;
    u64 current_time = bpf_ktime_get_ns();
    last_time = last_time_on_cpu.lookup(&next_pid);
    current_max_offcpu = max_offcpu.lookup(&next_pid);

    // update max offcpu time
    if(last_time != NULL) {
        u64 delta_nanos = current_time - *last_time;
        if(current_max_offcpu == NULL) {
            max_offcpu.update(&next_pid, &delta_nanos);
        else {
            if(delta_nanos > *current_max_offcpu) {
                max_offcpu.update(&next_pid, &delta_nanos);

    // store outgoing process' time
    last_time_on_cpu.update(&prev_pid, &current_time);
    return 0;

Conceptually, all this program does is update a map of pid -> timestamp, every time a process is switched off-cpu.

Then, if a timestamp exists for the task currently being scheduled on-cpu, we calculate the delta in nanoseconds (i.e. the time that the process was not on the cpu), and track the max value seen so far.

This is all executed in the kernel context, with very low overhead.

Next up, we have the Python program, which can read the data structure being populated by the trace function:

while 1:
    for k,v in b["max_offcpu"].iteritems():
        if v != 0:
            proc_name = b["proc_name"][k].comm
            print("%s max offcpu for %s is %dus" % 
                (, proc_name, v.value/1000))

Here we sleep for one second, then iterate over the items in the max_offcpu hash. This will contain an entry for every process observed by the tracing function that has been switched off the CPU, and back in at least once.

After printing the off-cpu duration of each process, we then clear the data-structure so that it will be populated with fresh data on the next iteration.

I'm still a little unclear on the raciness of this operation - I don't understand well enough whether there could be lost updates between the reporting and the call to clear().

Lastly, this all needs to be wired up in the Python script:

b = BPF(text=prog % (int(sys.argv[1])))
b.attach_kprobe(event="finish_task_switch", fn_name="trace_finish_task_switch")

Here we are requesting that the kernel function finish_task_switch be instrumented, and our custom function trace_finish_task_switch attached.


Using my laptop to test this script, I simulated our use-case by isolating a CPU (3) from the OS scheduler via the isolcpus kernel parameter.

Running my script with BCC installed generated the following output:

[root@localhost scheduler]# python ./ 3
2016-08-12 16:13:42.163032 max offcpu for swapper/3 is 5us
2016-08-12 16:13:43.164501 max offcpu for swapper/3 is 18us
2016-08-12 16:13:44.166178 max offcpu for kworker/3:1 is 990961us
2016-08-12 16:13:44.166405 max offcpu for swapper/3 is 10us
2016-08-12 16:13:46.169315 max offcpu for watchdog/3 is 3999989us
2016-08-12 16:13:46.169413 max offcpu for swapper/3 is 6us
2016-08-12 16:13:50.175329 max offcpu for watchdog/3 is 4000011us
2016-08-12 16:13:50.175565 max offcpu for swapper/3 is 9us

This tells me that most of the time the swapper or idle process was scheduled on CPU 3 (this makes sense - the OS scheduler is not allowed to schedule any userland programs on it because of isolcpus).

Every so often, a watchdog or kworker thread is schedued on, kicking of the swapper process for a few microseconds.

If I now simulate our workload by executing a user process on CPU 3 (just like our network-processing thread that is busy-spinning trying to receive network traffic), I should be able to see it being kicked off by the kernel threads.

The user-space task is not complicated:


while [ 1 ]; do
    echo "" > /dev/null

and executed using taskset to make sure it executes on the correct CPU:

taskset -c 3 bash ./

Running the BCC script now generates this output:

[root@localhost scheduler]# python ./ 3
2016-08-12 16:14:29.120617 max offcpu for bash is 4us
2016-08-12 16:14:30.121861 max offcpu for kworker/3:1 is 1000003us
2016-08-12 16:14:30.121925 max offcpu for bash is 5us
2016-08-12 16:14:31.123140 max offcpu for kworker/3:1 is 999986us
2016-08-12 16:14:31.123201 max offcpu for bash is 4us
2016-08-12 16:14:32.123675 max offcpu for kworker/3:1 is 999994us
2016-08-12 16:14:32.123747 max offcpu for bash is 5us
2016-08-12 16:14:33.124976 max offcpu for kworker/3:1 is 999994us
2016-08-12 16:14:33.125046 max offcpu for bash is 5us
2016-08-12 16:14:34.126287 max offcpu for kworker/3:1 is 999995us
2016-08-12 16:14:34.126366 max offcpu for bash is 5us

Great! The tracing function is able to show that my user-space program is being kicked off the CPU for around 5 microseconds.

I can now deploy this script to our performance environment and see whether the network-processing thread is being descheduled for long periods of time, and whether any of these periods correlate with increases in network buffer depth.

Further reading

There are plenty of examples to work through in the BCC repository. Aside from kprobes, there are uprobes that allow user-space code to be instrumented. Some programs also contain user-defined static probes (UDSTs), which are akin to the kernel's static tracepoints.

Familiarity with the functions and tracepoints within the kernel is a definite bonus, as it helps us to understand what information is available. My first port of call is usually running perf record while capturing stack-traces to get an idea of what a program is actually doing. After that, it's possible to start looking through the kernel source code looking for useful data to collect.

The scripts referenced in this post are available on github.

Be warned, they may not be the best example of using BCC, and are the result of an afternoon's hacking.