Tuesday 19 February 2019

Recall Design

This article discusses the design of Recall, an efficient, low-latency off-heap object store for the JVM.

Recall is designed for use by single-threaded applications, making it simple and performant.

In the multi-core era, it may seem odd that high-performance software is designed for use
by a single thread, but there is a good reason for this.

Developers of low-latency systems will be aware of the Disruptor pattern,
which showed that applying the Single Writer Principal could lead to extreme performance gains.

Recall is also allocation-free (in steady state), meaning that it will always use pieces of
memory that are more likely to be resident in CPU caches. Recall will never cause unwanted
garbage collection pauses, since it will not exhaust any memory region.

Allocations in Recall

While allocation-free in steady state, Recall will allocate new buffers as data is recorded
into a Store. For the most part, these should be direct ByteBuffers, which will not
greatly impact the JVM heap. Users can pre-empt these allocations by correctly sizing
a Store or SequenceMap before use, to ensure that necessary storage is available at
system start time.

Benefits of single-threaded design

Since the data structures in Recall are never accessed by multiple threads, there is no need
for locking, so there is no contention. Resource contention is one of the main things that
can limit throughput in multi-threaded programs, so side-stepping the problem altogether
leads to significant gains over solutions designed for multi-threaded applications.

Highly peformant multi-threaded map implementations will use compare-and-swap (CAS)
operations to avoid the cost of standard locks, and remain lock-free. While this does
lead to typically better scalability than a map utilising locks, those CAS operations
are still relatively expensive compared to the normal arithmetic operations that
are used in single-threaded designs.


Memory layout

Recall's Store object is an open-addressed hash map, storing a 64-bit long value against
an arbitrary number of bytes. The developer is responsible for providing functions to
serialise and deserialise objects to and from the Store's buffer. Each entry is of a fixed
length, eliminating the chance of compaction problems.

The mapping of long to byte sequence is performed by a simple open-addressed hash map Agrona library).
Each entry in the map records the long identifier against a
position in the Store's buffer. This makes the "index" for the data structure incredibly
compact (~16b per entry), and has the nice side-effect of making inserts, updates and removals
occur in effectively constant time.

Due to the use of a separate index, Recall's Store does not suffer from compaction problems,
and is always 100% efficient in terms of storage space used.


Inserting a new record

Record insertion involves adding a new key to the "index", serialising the entry,
and increasing the write pointer by the record size.

Note that if insertion causes the map to exceed its load factor, then a re-hash
will occur, causing each entry to be copied into a new, larger buffer.
For this reason it is recommended to correctly size the Store when it is


Deleting a record

Removing a record from the map involves copying the highest entry in the buffer
over the top of the entry to be remove, updating the "index", and decreasing
the write pointer by the record size.


Example usage

Consider the following highly contrived example:

A trading exchange has the requirement to publish possible profit reports to holders
of open positions as market prices fluctuate. When a user creates an open position,
a distribution gateway will need to cache the position, and on every price update
received from the exchange, publish some information indicating the possible profit
associated with the position.

In order to meet the exchange's strict latency requirements, the gateway must be allocation-free,
and is written according to the single-writer principle.



Orders are represented by the Order class:

public final class Order {
  private long accountId;
  private long orderId;
  private int instrumentId;
  private double quantity;
  private double price;

  // getters and setters omitted

New Orders are received on the OrderEvents interface:

public interface OrderEvents {
  void onOrderPlaced(Order orderFlyweight);

Market data is received on the OrderBook interface:

public interface OrderBook {
  void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp);

Profit updates are published on the ProfitEvents interface:

public interface ProfitEvents {
  void onProfitUpdate(long orderId, long accountId, double buyProfit, double sellProfit, long timestamp);


public final class ProfitPublisher implements OrderBook, OrderEvents {
  private final SingleTypeStore<ByteBuffer, Order> store = createStore();
  private final IdAccessor<Order> idAccessor = new OrderIdAccessor();
  private final Encoder<Order> encoder = new OrderEncoder();
  private final Decoder<Order> decoder = new OrderDecoder();
  private final Int2ObjectMap<LongHashSet> instrumentToOrderSetMap = createMap();
  private final Order flyweight = new Order();
  private final ProfitEvents profitEvents; // initialised in constructor
  public void onOrderPlaced(Order orderFlyweight) {
    store.store(orderFlyweight, encoder, idAccessor);

  public void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp) {
    for (long orderId : instrumentToOrderSetMap.get(instrumentId)) {
      store.load(flyweight, decoder, orderId);
      double buyProfit = flyweight.quantity() * (flyweight.price() - ask);
      double sellProfit = flyweight.quantity() * (flyweight.price() - bid);
      profitEvents.onProfitUpdate(orderId, flyweight.accountId(), buyProfit, sellProfit, timestamp);

In this example, the incoming Orders are serialised to a Store. When a price update is received,
the system iterates over any Orders associated with the specified instrument, and publishes
a profit update for the Order.

Operation is allocation-free, meaning that the system will run without any garbage-collection pauses
that could cause unwanted latency spikes. There is no need for object-pooling, as domain objects are
serialised to the Store's buffer until required.