Thursday, 21 March 2019

Performance Tuning E-Book

I've condensed many of the posts on this blog into a series of e-books, available for free download.

The first, on performance tuning for low-latency applications, is available here:

https://s3-eu-west-1.amazonaws.com/aitusoftware-doc-public/downloads/PerformanceTuningHandbookShare.pdf


Tuesday, 19 February 2019

Recall Design

This article discusses the design of Recall, an efficient, low-latency off-heap object store for the JVM.

Recall is designed for use by single-threaded applications, making it simple and performant.

In the multi-core era, it may seem odd that high-performance software is designed for use
by a single thread, but there is a good reason for this.

Developers of low-latency systems will be aware of the Disruptor pattern,
which showed that applying the Single Writer Principal could lead to extreme performance gains.

Recall is also allocation-free (in steady state), meaning that it will always use pieces of
memory that are more likely to be resident in CPU caches. Recall will never cause unwanted
garbage collection pauses, since it will not exhaust any memory region.

Allocations in Recall


While allocation-free in steady state, Recall will allocate new buffers as data is recorded
into a Store. For the most part, these should be direct ByteBuffers, which will not
greatly impact the JVM heap. Users can pre-empt these allocations by correctly sizing
a Store or SequenceMap before use, to ensure that necessary storage is available at
system start time.

Benefits of single-threaded design


Since the data structures in Recall are never accessed by multiple threads, there is no need
for locking, so there is no contention. Resource contention is one of the main things that
can limit throughput in multi-threaded programs, so side-stepping the problem altogether
leads to significant gains over solutions designed for multi-threaded applications.

Highly peformant multi-threaded map implementations will use compare-and-swap (CAS)
operations to avoid the cost of standard locks, and remain lock-free. While this does
lead to typically better scalability than a map utilising locks, those CAS operations
are still relatively expensive compared to the normal arithmetic operations that
are used in single-threaded designs.

 

Memory layout


Recall's Store object is an open-addressed hash map, storing a 64-bit long value against
an arbitrary number of bytes. The developer is responsible for providing functions to
serialise and deserialise objects to and from the Store's buffer. Each entry is of a fixed
length, eliminating the chance of compaction problems.

The mapping of long to byte sequence is performed by a simple open-addressed hash map Agrona library).
Each entry in the map records the long identifier against a
position in the Store's buffer. This makes the "index" for the data structure incredibly
compact (~16b per entry), and has the nice side-effect of making inserts, updates and removals
occur in effectively constant time.

Due to the use of a separate index, Recall's Store does not suffer from compaction problems,
and is always 100% efficient in terms of storage space used.

 

Inserting a new record


Record insertion involves adding a new key to the "index", serialising the entry,
and increasing the write pointer by the record size.

Note that if insertion causes the map to exceed its load factor, then a re-hash
will occur, causing each entry to be copied into a new, larger buffer.
For this reason it is recommended to correctly size the Store when it is
constructed.

 

Deleting a record


Removing a record from the map involves copying the highest entry in the buffer
over the top of the entry to be remove, updating the "index", and decreasing
the write pointer by the record size.

 

Example usage


Consider the following highly contrived example:

A trading exchange has the requirement to publish possible profit reports to holders
of open positions as market prices fluctuate. When a user creates an open position,
a distribution gateway will need to cache the position, and on every price update
received from the exchange, publish some information indicating the possible profit
associated with the position.

In order to meet the exchange's strict latency requirements, the gateway must be allocation-free,
and is written according to the single-writer principle.

 

Messages


Orders are represented by the Order class:


public final class Order {
  private long accountId;
  private long orderId;
  private int instrumentId;
  private double quantity;
  private double price;

  // getters and setters omitted
}


New Orders are received on the OrderEvents interface:


public interface OrderEvents {
  void onOrderPlaced(Order orderFlyweight);
}



Market data is received on the OrderBook interface:

public interface OrderBook {
  void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp);
}


Profit updates are published on the ProfitEvents interface:


public interface ProfitEvents {
  void onProfitUpdate(long orderId, long accountId, double buyProfit, double sellProfit, long timestamp);
}


Implementation



public final class ProfitPublisher implements OrderBook, OrderEvents {
  private final SingleTypeStore<ByteBuffer, Order> store = createStore();
  private final IdAccessor<Order> idAccessor = new OrderIdAccessor();
  private final Encoder<Order> encoder = new OrderEncoder();
  private final Decoder<Order> decoder = new OrderDecoder();
  private final Int2ObjectMap<LongHashSet> instrumentToOrderSetMap = createMap();
  private final Order flyweight = new Order();
  private final ProfitEvents profitEvents; // initialised in constructor
 
  public void onOrderPlaced(Order orderFlyweight) {
    store.store(orderFlyweight, encoder, idAccessor);
    instrumentToOrderSetMap.get(orderFlyweight.instrumentId()).add(orderFlyweight.id());
  }

  public void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp) {
    for (long orderId : instrumentToOrderSetMap.get(instrumentId)) {
      store.load(flyweight, decoder, orderId);
      double buyProfit = flyweight.quantity() * (flyweight.price() - ask);
      double sellProfit = flyweight.quantity() * (flyweight.price() - bid);
      profitEvents.onProfitUpdate(orderId, flyweight.accountId(), buyProfit, sellProfit, timestamp);
    }
  }
}


In this example, the incoming Orders are serialised to a Store. When a price update is received,
the system iterates over any Orders associated with the specified instrument, and publishes
a profit update for the Order.

Operation is allocation-free, meaning that the system will run without any garbage-collection pauses
that could cause unwanted latency spikes. There is no need for object-pooling, as domain objects are
serialised to the Store's buffer until required.


Monday, 18 September 2017

Heap Allocation Flamegraphs

The most recent addition to the grav collection of performance visualisation tools is a utility for tracking heap allocations on the JVM.

This is another Flamegraph-based visualisation that can be used to determine hotspots of garbage creation in a running program.

Usage and mode of operation

Detailed instructions on installation and pre-requisites can be found in the grav repository on github.

Heap allocation flamegraphs use the built-in user statically-defined tracepoints (USDTs), which have been added to recent versions of OpenJDK and Oracle JDK.

To enable the probes, the following command-line flags are required:

-XX:+DTraceAllocProbes -XX:+ExtendedDTraceProbes

Once the JVM is running, the heap-alloc-flames script can be used to generate a heap allocation flamegraph:

$ ./bin/heap-alloc-flames -p $PID -e "java/lang/String" -d 10
...
Wrote allocation-flamegraph-$PID.svg

BE WARNED: this is a fairly heavyweight profiling method - on each allocation, the entire stack-frame is walked and hashed in order to increment a counter. The JVM will also use a slow-path for allocations when extended DTrace probes are enabled.

It is possible to limit the profiling to record every N samples with the '-s' parameter (see the documentation for more info).

For a more lightweight method of heap-profiling, see the excellent async-profiler, which uses a callback on the first TLAB allocation to perform its sampling.

When developing low-garbage or garbage-free applications, it is useful to be able to instrument every allocation, at least within a performance-test environment. This tool could even be used to regression-test allocation rates for low-latency applications, to ensure that changes to the codebase are not increasing allocations.

eBPF + USDT

The allocation profiler works by attaching a uprobe to the dtrace_object_alloc function provided by the JVM.

When the profiler is running, we can confirm that the tracepoint is in place by looking at /sys/kernel/debug/tracing/uprobe_events:

$ cat  /sys/kernel/debug/tracing/uprobe_events
p:uprobes/
p__usr_lib_jvm_java_8_openjdk_amd64_jre_lib_amd64_server_libjvm_so_0x967fdf_bcc_7043 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so:0x0000000000967fdf
p:uprobes/
p__usr_lib_jvm_java_8_openjdk_amd64_jre_lib_amd64_server_libjvm_so_0x96806f_bcc_7043 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so:0x000000000096806f

Given that we know the type signature of the dtrace_object_alloc method, it is a simple matter to extract the class-name of the object that has just been allocated.

As the profiler is running, it is recording a count against a compound key of java class-name and stack-trace id. At the end of the sampling period, the count is used to 'inflate' the occurrences of a given stack-trace, and these stack-traces are then piped through the usual flamegraph machinery.

Controlling output

Allocation flamegraph

Stack frames can be included or excluded from the generated Flamegraph by using regular expressions that are passed to the python program.

For example, to exclude all allocations of java.lang.String and java.lang.Object[], add the following parameters:

-e "java/lang/String" "java.lang.Object\[\]"

To include only allocations of java.util.ArrayList, add the following:

-i "java/util/ArrayList"

Inspiration & Thanks

Thanks to Amir Langer for collaborating on this profiler.

For more information on USDT probes in the JVM, see Sasha's blog posts.

Tuesday, 30 May 2017

Performance visualisation tools

Update


Amir has very kindly contributed a Vagrant box configuration to enable non-Linux users to work with the tools contained in the grav repository.

Thanks Amir!




In my last post, I looked at annotating Flamegraphs with contextual information in order to filter on an interesting subset of the data. One of the things that stuck with me was the idea of using SVGs to render data generated on a server running in headless mode.
Traditionally, I have recorded profiles and traces on remote servers, then pulled the data back to my workstation to filter, aggregate and plot. The scripts used to do this data-wrangling tend to be one-shot affairs, and I've probably lost many useful utilities over the years. I am increasingly coming around to the idea of building the rendering into the server-side script, as it forces me to think about how I want to interpret the data, and also gives the ability to deploy and serve such monitoring from a whole fleet of servers.
Partly to address this, and partly because experimentation is fun, I've been working on some new visualisations of the same ilk as Flamegraphs.
These tools are available in the grav repository.

 

Scheduler Profile

The scheduler profile tool can be used to indicate whether your application's threads are getting enough time on CPU, or whether there is resource contention at play.
In an ideal scenario, your application threads will only ever yield the CPU to one of the kernel's helper threads, such as ksoftirqd, and then only sparingly. Running the scheduling-profile script will record scheduler events in the kernel and will determine the state of your application threads at the point they are pre-empted and replaced with another process.
Threads will tend to be in one of three states when pre-empted by the scheduler:
  • Runnable - happily doing work, moved off the CPU due to scheduler quantum expiry (we want to minimise this)
  • Blocked on I/O - otherwise known as 'Uninterruptible sleep', still an interesting signal but expected for threads handling IO
  • Sleeping - voluntarily yielding the CPU, perhaps due to waiting on a lock
There are a number of other states, which will not be covered here.
Once the profile is collected, these states are rendered as a bar chart for each thread in your application. The examples here are from a JVM-based service, but the approach will work just as well for other runtimes, albeit without the mapping of pid to thread name.
The bar chart will show the proportion of states encountered per-thread as the OS scheduler swapped out the application thread. The sleeping state is marked green (the CPU was intentionally yielded), the runnable state is marked red (the program thread was pre-empted while it still had useful work to do).
Let's take a look at an initial example running on my 4-core laptop:


Scheduling profile

original
This profile is taken from a simple drop-wizard application, the threads actually processing inbound requests are prefixed with 'dw-'. We can see that these request processing threads were ready to yield the CPU (i.e. entering sleep state) about 30% of the time, but they were mostly attempting to do useful work when they were moved off the CPU. This is a hint that the application is resource constrained in terms of CPU.
This effect is magnified due to the fact that I'm running a desktop OS, the application, and a load-generator all on the same laptop, but these effects will still be present on a larger system.
This can be a useful signal that these threads would benefit from their own dedicated pool of CPUs. Further work is needed to annotate the chart with those processes that were switched in by the scheduler - i.e. the processes that are contending with the application for CPU resource.
Using a combination of kernel tuning and thread-pinning, it should be possible to ensure that the application threads are only very rarely pre-empted by essential kernel threads. More details on how to go about achieving this can be found in previous posts.

 

CPU Tenancy

One of the operating system's responsibilities is to allocate resources to processes that require CPU. In modern multi-core systems, the scheduler must move runnable threads to otherwise idle CPUs to try to maximise system resource usage.
A good example if this is network packet handling. When a network packet is received, it is (by default) processed by the CPU that handles the network card's interrupt. The kernel may then decide to migrate any task that is waiting for data to arrive (e.g. a thread blocked on a socket read) to the receiving CPU, since the packet data is more likely to be available in the CPU's cache.
While we can generally rely on the OS to do a good job of this for us, we may wish to force this cache-locality by having a network-handling thread on an adjacent CPU to the interrupt handler. Such a set-up would mean that the network-handling thread would always be close to the data, without the overhead and jitter introduced by actually running on a CPU responsible for handling interrupts.
This is a common configuration in low-latency applications in the finance industry.
The perf-cpu-tenancy script can be used to build a picture showing how the scheduler allocates CPU to your application threads. In the example below, the threads named dw- are the message-processing threads, and it is clear that they are mostly executed on CPU 2. This correlates with the network card setup on the machine running the application - the IRQ of the network card is associated with CPU 2.


CPU tenancy by thread

original

 

Further work

To make the scheduling-profile tool more useful, I intend to annotate the runnable state portion of the bar chart with a further breakdown detailing the incoming processes that kicked application threads off-CPU.
This will provide enough information to direct system-tuning efforts to ensure an application has the best chance possible to get CPU-time when required.
If you've read this far, perhaps you're interested in contributing?

Monday, 27 March 2017

Named thread flamegraphs

Update


The utility described in this post has been moved to the grav repository, see "Performance visualisation tools" for more details.


After watching a great talk by Nitsan Wakart at this year's QCon London, I started playing around with flamegraphs a little more.
To get the gist of Nitsan's talk, you can read his blog post Java Flame Graphs Introduction: Fire For Everyone!.
The important thing to take away is that the collapsed stack files that are processed by Brendan Gregg's FlameGraph scripts are just text files, and so can be filtered, hacked, and modified using your favourite methods for such activities (your author is a fan of grep and awk).
The examples in this post are generated using a fork of the excellent perf-map-agent. I've added a couple of scripts to make these examples easier.
To follow the examples, clone this repository.

Aggregate view

One feature that was demonstrated in Nitsan's talk was the ability to collapse the stacks by Java thread. Usually, flamegraphs show all samples aggregated into one view. With thread-level detail though, we can start to see where different threads are spending their time.
This can be useful information when exploring the performance of systems that are unfamiliar.
Let's see what difference this can make to an initial analysis. For this example, we're going to look at a very simple microservice built on dropwizard. The service does virtually nothing except echo a response, so we wouldn't expect the business logic to show up in the trace. Here we are primarily interested in looking at what the framework is doing during request processing.
We'll take an initial recording without a thread breakdown:
$ source ./etc/options-without-threads.sh

$ ./bin/perf-java-flames 7731
You can view the rendered svg file here: flamegraph-aggregate-stacks.svg.


Aggregated stacks
Aggregated stacks

From the aggregated view we can see that most of the time is spent in framework code (jetty, jersey, etc), a smaller proportion is in log-handling (logback), and the rest is spent on garbage collection.

Thread breakdown

Making the same recording, but this time with the stacks assigned to their respective threads, we see much more detail.
$ source ./etc/options-with-threads.sh

$ ./bin/perf-java-flames 7602
In flamegraph-thread-stacks.svg we can immediately see that we have five threads doing most of the work of handling the HTTP requests; they all have very similar profiles, so we can reason that these are probably threads belonging to a generic request-handler pool.


Thread stacks
Thread stacks

We can also see another thread that spends most of its time writing log messages to disk. From this, we can reason that the logging framework has a single thread for draining log messages from a queue - something that was more difficult to see in the aggregated view.
Now, we have made some assumptions in the statements above, so is there anything we can do to validate them?

Annotating thread names

With the addition of a simple script to replace thread IDs with thread names, we can gain a better understanding of thread responsibilities within the application.
Let's capture another trace:
$ source ./etc/options-with-threads.sh

$ ./bin/perf-thread-flames 8513
Although flamegraph-named-thread-stacks.svg looks identical when zoomed-out, it contains one more very useful piece of context.


Named stacks
Named stacks

Rolling the mouse over the base of the image shows that the five similar stacks are all from threads named "dw-XX", giving a little more evidence that these are dropwizard handler threads.
There are a couple of narrower stacks named "dw-XX-acceptor"; these are the threads that manage incoming connections before they are handed off to the request processing thread pool.
Further along is a "metrics-csv-reporter" thread, whose responsibility is to write out performance metrics at regular intervals.
The logging framework thread is now more obvious when we can see its name is "AsyncAppender-Worker-async-console-appender".
With nothing more than an external script, we can now infer the following about our application:
  • this application has a request-handling thread-pool
  • fronted by an acceptor thread
  • logging is performed asynchronously
  • metrics reporting is enabled
This kind of overview of system architecture would be much harder to piece together by just reading the framework code.

Filtering

Now that we have this extra context in place, it is a simple matter to filter the flamegraphs down to a finer focus.
The perf-thread-grep script operates on the result of a previous call to perf-java-flames (as seen above).
Suppose we just wanted to look at what the JIT compiler threads were doing?
$ source ./etc/options-with-threads.sh

$ ./bin/perf-thread-grep 8513 "Compiler"
flamegraph-compiler.svg


Compiler threads
Compiler threads

or to focus in on threads that called into any logging functions?
$ source ./etc/options-with-threads.sh

$ ./bin/perf-thread-grep 8513 "logback"
flamegraph-logback.svg


Logging functions
Logging functions

Summary

Annotating flamegraphs with java thread names can offer insight into how an application's processing resources are configured. We can use this extra context to easily zoom in on certain functionality.
This technique is particularly powerful when exploring unfamiliar applications with large numbers of threads, whose function may not be immediately obvious.
My learned correspondent Nitsan has suggested that I'm being lazy by using jstack to generate the thread-name mapping. His main complaint is that it causes a safe-point pause in the running application. To make these scripts a little more lightweight, I will explore retrieving the thread names via JVMTI or another low-level interface. But that's a blog post for another day.


Friday, 12 August 2016

Lightweight tracing with BCC

Ever since I read some initial blogs posts about the upcoming eBPF tracing functionality in the 4.x Linux kernel, I have been looking for an excuse to get to grips with this technology.

With a planned kernel upgrade in progress at LMAX, I now have access to an interesting environment and workload in order to play around with BCC.

BPF Compiler Collection

BCC is a collection of tools that allows the curious to express programs in C or Lua, and then load those programs as optimised kernel modules, hooked in to the runtime via a number of different mechanisms.

At the time of writing, the motivation subsection of the BCC README explains this best:

End-to-end BPF workflow in a shared library
    A modified C language for BPF backends
    Integration with llvm-bpf backend for JIT
    Dynamic (un)loading of JITed programs
    Support for BPF kernel hooks: socket filters, tc classifiers, tc actions, and kprobes

Tracking process off-cpu time

One of the more in-depth investigations I've been involved in lately is tracking down a throughput problem on one of our services. We have a service that will sometimes, during a test run in our performance environment, fail to keep up with the message rate.

The service should be reading multicast traffic from the network as fast as possible. We dedicate a physical core to the network processing thread, and make sure that the Java thread doing this work is interrupted as little as possible.

Now, given that we run with the SLAB allocator and want useful vmstat updates, our user-land processes will sometimes get pre-empted by kernel threads. Looking in /proc/sched_debug shows the other processes that are potentially runnable on our dedicated core:

runnable tasks:
            task   PID         tree-key  switches  prio  ... 
---------------------------------------------------------
        cpuhp/35   334         0.943766        14   120 
    migration/35   336         0.000000        26     0  
    ksoftirqd/35   337        -5.236392         6   120   
    kworker/35:0   338 139152554.767297       456   120  
   kworker/35:0H   339        -5.241396        12   100   
    kworker/35:1   530 139227632.577336       902   120  
   kworker/35:1H  7190         6.745434         3   100   
R           java 102825   1306133.326251       479   110  
    kworker/35:2 102845 139065252.390586         2   120  

We know that our Java process will sometimes be kicked off the CPU by one of the kworker threads, so that it can do some house-keeping. In order to find out if there is a correlation between network traffic build-up and the java process being off-cpu, we have traditionally used ftrace and the built-in tracepoint sched:sched_switch.

Determining off-cpu time with ftrace

The kernel scheduler emits an event to the tracing framework every time that a context switch occurs. The event arguments report the process that is being switched out, and the process that is about to be scheduled for execution.

The output from ftrace will show these data, along with a microsecond timestamp:

<idle>-0     [020] 6907585.873080: sched_switch: 0:120:R ==> 33233:110: java
<...>-33233  [020] 6907585.873083: sched_switch: 33233:110:S ==> 0:120: swapper/20
<idle>-0     [020] 6907585.873234: sched_switch: 0:120:R ==> 33233:110: java

The excerpt above tells us that on CPU 20, the idle process (pid 0) was runnable (R) and was switched out in favour of a Java process (pid 33233). Three microseconds later, the Java process entered the sleep state (S), and was switched out in favour of the idle process. After another 150 microseconds or so, the Java process was again ready to run, so the idle process was switched out to make way.

Using the output from ftrace, we could build a simple script to track the timestamps when each process was switched off the CPU, then subtract that timestamp from the next time that the process gets scheduled back onto a CPU. Aggregating over a suitable period would then give us the maximum time off-cpu in microseconds for each process in the trace. This information would be useful in looking for a correlation between network backlog and off-cpu time.

Determining off-cpu time with BCC

So that's the old way, and it has its drawbacks. ftrace can be quite heavy-weight, and we have found that running it can cause significant overhead in our performance-test environment when running under heavy load. Also, we are only really interested in the single dedicated network-processing core. While it is possible to supply ftrace with a cpumask to restrict this, the interface is a little clunky, and doesn't seem to work very will with the trace-cmd front-end to ftrace.

Ideally, we would like a lighter-weight mechanism for hooking in to context switches, and performing the aggregation in kernel-space rather than by post-processing the ftrace output.

Luckily, BCC allows us to do just that.

Looking at the kernel source, we can see that the trace_sched_switch event is emitted immediately before the call to context_switch:

    trace_sched_switch(preempt, prev, next);
    rq = context_switch(rq, prev, next, cookie); /* unlocks the rq */

At the end of the context_switch function, finish_task_switch is called:

    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);
    barrier();
    
    return finish_task_switch(prev);
}

The argument passed to finish_task_switch is a task_struct representing the process that is being switched out. This looks like a good spot to attach a trace, where we can track the time when a process is switched out of the CPU.

In order to do this using BCC, we use a kprobe to hook in to the kernel function finish_task_switch. Using this mechanism, we can attach a custom tracing function to the kernel's finish_task_switch function.

BCC programs

The current method for interacting with the various probe types is via a Python-C bridge. Trace functions are written in C, then compiled and injected by a Python program, which can then interact with the tracing function via the kernel's tracing subsystem.

There is a lot of detail to cover in the various examples and user-guides available on the BCC site, so I will just focus on a walk-through of my use-case.

First off, the C function:

#include <linux/types.h>
#include <linux/sched.h>
#include <uapi/linux/ptrace.h>

struct proc_name_t {
    char comm[TASK_COMM_LEN];
};

BPF_TABLE("hash", pid_t, u64, last_time_on_cpu, 1024);
BPF_TABLE("hash", pid_t, u64, max_offcpu, 1024);
BPF_TABLE("hash", pid_t, struct proc_name_t, proc_name, 1024);

int trace_finish_task_switch(struct pt_regs *ctx, struct task_struct *prev) {
    // bail early if this is not the CPU we're interested in
    u32 target_cpu = %d;
    if(target_cpu != bpf_get_smp_processor_id())
    {
        return 0;
    }

    // get information about previous/next processes
    struct proc_name_t pname = {};
    pid_t next_pid = bpf_get_current_pid_tgid();
    pid_t prev_pid = prev->pid;
    bpf_get_current_comm(&pname.comm, sizeof(pname.comm));

    // store mapping of pid -> command for display
    proc_name.update(&next_pid, &pname);

    // lookup current values for incoming process
    u64 *last_time;
    u64 *current_max_offcpu;
    u64 current_time = bpf_ktime_get_ns();
    last_time = last_time_on_cpu.lookup(&next_pid);
    current_max_offcpu = max_offcpu.lookup(&next_pid);

    // update max offcpu time
    if(last_time != NULL) {
        u64 delta_nanos = current_time - *last_time;
        if(current_max_offcpu == NULL) {
            max_offcpu.update(&next_pid, &delta_nanos);
        }
        else {
            if(delta_nanos > *current_max_offcpu) {
                max_offcpu.update(&next_pid, &delta_nanos);
            }
        }
    }

    // store outgoing process' time
    last_time_on_cpu.update(&prev_pid, &current_time);
    return 0;
};

Conceptually, all this program does is update a map of pid -> timestamp, every time a process is switched off-cpu.

Then, if a timestamp exists for the task currently being scheduled on-cpu, we calculate the delta in nanoseconds (i.e. the time that the process was not on the cpu), and track the max value seen so far.

This is all executed in the kernel context, with very low overhead.

Next up, we have the Python program, which can read the data structure being populated by the trace function:

while 1:
    time.sleep(1)
    for k,v in b["max_offcpu"].iteritems():
        if v != 0:
            proc_name = b["proc_name"][k].comm
            print("%s max offcpu for %s is %dus" % 
                (datetime.datetime.now(), proc_name, v.value/1000))
    b["max_offcpu"].clear()

Here we sleep for one second, then iterate over the items in the max_offcpu hash. This will contain an entry for every process observed by the tracing function that has been switched off the CPU, and back in at least once.

After printing the off-cpu duration of each process, we then clear the data-structure so that it will be populated with fresh data on the next iteration.

I'm still a little unclear on the raciness of this operation - I don't understand well enough whether there could be lost updates between the reporting and the call to clear().

Lastly, this all needs to be wired up in the Python script:

b = BPF(text=prog % (int(sys.argv[1])))
b.attach_kprobe(event="finish_task_switch", fn_name="trace_finish_task_switch")

Here we are requesting that the kernel function finish_task_switch be instrumented, and our custom function trace_finish_task_switch attached.

Results

Using my laptop to test this script, I simulated our use-case by isolating a CPU (3) from the OS scheduler via the isolcpus kernel parameter.

Running my offcpu.py script with BCC installed generated the following output:

[root@localhost scheduler]# python ./offcpu.py 3
2016-08-12 16:13:42.163032 max offcpu for swapper/3 is 5us
2016-08-12 16:13:43.164501 max offcpu for swapper/3 is 18us
2016-08-12 16:13:44.166178 max offcpu for kworker/3:1 is 990961us
2016-08-12 16:13:44.166405 max offcpu for swapper/3 is 10us
2016-08-12 16:13:46.169315 max offcpu for watchdog/3 is 3999989us
2016-08-12 16:13:46.169413 max offcpu for swapper/3 is 6us
2016-08-12 16:13:50.175329 max offcpu for watchdog/3 is 4000011us
2016-08-12 16:13:50.175565 max offcpu for swapper/3 is 9us

This tells me that most of the time the swapper or idle process was scheduled on CPU 3 (this makes sense - the OS scheduler is not allowed to schedule any userland programs on it because of isolcpus).

Every so often, a watchdog or kworker thread is schedued on, kicking of the swapper process for a few microseconds.

If I now simulate our workload by executing a user process on CPU 3 (just like our network-processing thread that is busy-spinning trying to receive network traffic), I should be able to see it being kicked off by the kernel threads.

The user-space task is not complicated:

cat hot_loop.sh

while [ 1 ]; do
    echo "" > /dev/null
done

and executed using taskset to make sure it executes on the correct CPU:

taskset -c 3 bash ./hot_loop.sh

Running the BCC script now generates this output:

[root@localhost scheduler]# python ./offcpu.py 3
2016-08-12 16:14:29.120617 max offcpu for bash is 4us
2016-08-12 16:14:30.121861 max offcpu for kworker/3:1 is 1000003us
2016-08-12 16:14:30.121925 max offcpu for bash is 5us
2016-08-12 16:14:31.123140 max offcpu for kworker/3:1 is 999986us
2016-08-12 16:14:31.123201 max offcpu for bash is 4us
2016-08-12 16:14:32.123675 max offcpu for kworker/3:1 is 999994us
2016-08-12 16:14:32.123747 max offcpu for bash is 5us
2016-08-12 16:14:33.124976 max offcpu for kworker/3:1 is 999994us
2016-08-12 16:14:33.125046 max offcpu for bash is 5us
2016-08-12 16:14:34.126287 max offcpu for kworker/3:1 is 999995us
2016-08-12 16:14:34.126366 max offcpu for bash is 5us

Great! The tracing function is able to show that my user-space program is being kicked off the CPU for around 5 microseconds.

I can now deploy this script to our performance environment and see whether the network-processing thread is being descheduled for long periods of time, and whether any of these periods correlate with increases in network buffer depth.

Further reading

There are plenty of examples to work through in the BCC repository. Aside from kprobes, there are uprobes that allow user-space code to be instrumented. Some programs also contain user-defined static probes (UDSTs), which are akin to the kernel's static tracepoints.

Familiarity with the functions and tracepoints within the kernel is a definite bonus, as it helps us to understand what information is available. My first port of call is usually running perf record while capturing stack-traces to get an idea of what a program is actually doing. After that, it's possible to start looking through the kernel source code looking for useful data to collect.

The scripts referenced in this post are available on github.

Be warned, they may not be the best example of using BCC, and are the result of an afternoon's hacking.