Tuesday, 19 February 2019

Recall Design

This article discusses the design of Recall, an efficient, low-latency off-heap object store for the JVM.

Recall is designed for use by single-threaded applications, making it simple and performant.

In the multi-core era, it may seem odd that high-performance software is designed for use
by a single thread, but there is a good reason for this.

Developers of low-latency systems will be aware of the Disruptor pattern,
which showed that applying the Single Writer Principal could lead to extreme performance gains.

Recall is also allocation-free (in steady state), meaning that it will always use pieces of
memory that are more likely to be resident in CPU caches. Recall will never cause unwanted
garbage collection pauses, since it will not exhaust any memory region.

Allocations in Recall

While allocation-free in steady state, Recall will allocate new buffers as data is recorded
into a Store. For the most part, these should be direct ByteBuffers, which will not
greatly impact the JVM heap. Users can pre-empt these allocations by correctly sizing
a Store or SequenceMap before use, to ensure that necessary storage is available at
system start time.

Benefits of single-threaded design

Since the data structures in Recall are never accessed by multiple threads, there is no need
for locking, so there is no contention. Resource contention is one of the main things that
can limit throughput in multi-threaded programs, so side-stepping the problem altogether
leads to significant gains over solutions designed for multi-threaded applications.

Highly peformant multi-threaded map implementations will use compare-and-swap (CAS)
operations to avoid the cost of standard locks, and remain lock-free. While this does
lead to typically better scalability than a map utilising locks, those CAS operations
are still relatively expensive compared to the normal arithmetic operations that
are used in single-threaded designs.

Memory layout

Recall's Store object is an open-addressed hash map, storing a 64-bit long value against
an arbitrary number of bytes. The developer is responsible for providing functions to
serialise and deserialise objects to and from the Store's buffer. Each entry is of a fixed
length, eliminating the chance of compaction problems.

The mapping of long to byte sequence is performed by a simple open-addressed hash map Agrona library).
Each entry in the map records the long identifier against a
position in the Store's buffer. This makes the "index" for the data structure incredibly
compact (~16b per entry), and has the nice side-effect of making inserts, updates and removals
occur in effectively constant time.

Due to the use of a separate index, Recall's Store does not suffer from compaction problems,
and is always 100% efficient in terms of storage space used.

Inserting a new record

Record insertion involves adding a new key to the "index", serialising the entry,
and increasing the write pointer by the record size.

Note that if insertion causes the map to exceed its load factor, then a re-hash
will occur, causing each entry to be copied into a new, larger buffer.
For this reason it is recommended to correctly size the Store when it is

Deleting a record

Removing a record from the map involves copying the highest entry in the buffer
over the top of the entry to be remove, updating the "index", and decreasing
the write pointer by the record size.

Example usage

Consider the following highly contrived example:

A trading exchange has the requirement to publish possible profit reports to holders
of open positions as market prices fluctuate. When a user creates an open position,
a distribution gateway will need to cache the position, and on every price update
received from the exchange, publish some information indicating the possible profit
associated with the position.

In order to meet the exchange's strict latency requirements, the gateway must be allocation-free,
and is written according to the single-writer principle.


Orders are represented by the Order class:

public final class Order {
  private long accountId;
  private long orderId;
  private int instrumentId;
  private double quantity;
  private double price;

  // getters and setters omitted

New Orders are received on the OrderEvents interface:

public interface OrderEvents {
  void onOrderPlaced(Order orderFlyweight);

Market data is received on the OrderBook interface:

public interface OrderBook {
  void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp);

Profit updates are published on the ProfitEvents interface:

public interface ProfitEvents {
  void onProfitUpdate(long orderId, long accountId, double buyProfit, double sellProfit, long timestamp);


public final class ProfitPublisher implements OrderBook, OrderEvents {
  private final SingleTypeStore<ByteBuffer, Order> store = createStore();
  private final IdAccessor<Order> idAccessor = new OrderIdAccessor();
  private final Encoder<Order> encoder = new OrderEncoder();
  private final Decoder<Order> decoder = new OrderDecoder();
  private final Int2ObjectMap<LongHashSet> instrumentToOrderSetMap = createMap();
  private final Order flyweight = new Order();
  private final ProfitEvents profitEvents; // initialised in constructor
  public void onOrderPlaced(Order orderFlyweight) {
    store.store(orderFlyweight, encoder, idAccessor);

  public void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp) {
    for (long orderId : instrumentToOrderSetMap.get(instrumentId)) {
      store.load(flyweight, decoder, orderId);
      double buyProfit = flyweight.quantity() * (flyweight.price() - ask);
      double sellProfit = flyweight.quantity() * (flyweight.price() - bid);
      profitEvents.onProfitUpdate(orderId, flyweight.accountId(), buyProfit, sellProfit, timestamp);

In this example, the incoming Orders are serialised to a Store. When a price update is received,
the system iterates over any Orders associated with the specified instrument, and publishes
a profit update for the Order.

Operation is allocation-free, meaning that the system will run without any garbage-collection pauses
that could cause unwanted latency spikes. There is no need for object-pooling, as domain objects are
serialised to the Store's buffer until required.

Monday, 18 September 2017

Heap Allocation Flamegraphs

The most recent addition to the grav collection of performance visualisation tools is a utility for tracking heap allocations on the JVM.

This is another Flamegraph-based visualisation that can be used to determine hotspots of garbage creation in a running program.

Usage and mode of operation

Detailed instructions on installation and pre-requisites can be found in the grav repository on github.

Heap allocation flamegraphs use the built-in user statically-defined tracepoints (USDTs), which have been added to recent versions of OpenJDK and Oracle JDK.

To enable the probes, the following command-line flags are required:

-XX:+DTraceAllocProbes -XX:+ExtendedDTraceProbes

Once the JVM is running, the heap-alloc-flames script can be used to generate a heap allocation flamegraph:

$ ./bin/heap-alloc-flames -p $PID -e "java/lang/String" -d 10
Wrote allocation-flamegraph-$PID.svg

BE WARNED: this is a fairly heavyweight profiling method - on each allocation, the entire stack-frame is walked and hashed in order to increment a counter. The JVM will also use a slow-path for allocations when extended DTrace probes are enabled.

It is possible to limit the profiling to record every N samples with the '-s' parameter (see the documentation for more info).

For a more lightweight method of heap-profiling, see the excellent async-profiler, which uses a callback on the first TLAB allocation to perform its sampling.

When developing low-garbage or garbage-free applications, it is useful to be able to instrument every allocation, at least within a performance-test environment. This tool could even be used to regression-test allocation rates for low-latency applications, to ensure that changes to the codebase are not increasing allocations.


The allocation profiler works by attaching a uprobe to the dtrace_object_alloc function provided by the JVM.

When the profiler is running, we can confirm that the tracepoint is in place by looking at /sys/kernel/debug/tracing/uprobe_events:

$ cat  /sys/kernel/debug/tracing/uprobe_events

Given that we know the type signature of the dtrace_object_alloc method, it is a simple matter to extract the class-name of the object that has just been allocated.

As the profiler is running, it is recording a count against a compound key of java class-name and stack-trace id. At the end of the sampling period, the count is used to 'inflate' the occurrences of a given stack-trace, and these stack-traces are then piped through the usual flamegraph machinery.

Controlling output

Allocation flamegraph

Stack frames can be included or excluded from the generated Flamegraph by using regular expressions that are passed to the python program.

For example, to exclude all allocations of java.lang.String and java.lang.Object[], add the following parameters:

-e "java/lang/String" "java.lang.Object\[\]"

To include only allocations of java.util.ArrayList, add the following:

-i "java/util/ArrayList"

Inspiration & Thanks

Thanks to Amir Langer for collaborating on this profiler.

For more information on USDT probes in the JVM, see Sasha's blog posts.