Friday, 15 May 2015

Improving journalling latency

For the last few months at LMAX Exchange, we've been working on building out our next generation platform. Every few years we refresh our hardware and upgrade the machines that run our systems, and this time we decided to have a look at upgrading the operating system at the same time.

When our first generation exchange was built, we were happy with low-millisecond-level mean latencies. After a couple of years of operation, we upgraded to newer hardware, made some significant software changes and ended up with mean end-to-end latencies of around 250 microseconds. With our latest set of changes, we are aiming for sub-100 microsecond mean latency and significantly reduced jitter.

These changes should stand us in good stead for another year or two, before we repeat the cycle to further improve performance. In order to achieve this goal, we have modified and tuned our hardware, system architecture, operating system and application software.

In my next few posts, I will be describing our experiences of doing this, and lessons we've learned along the way.


Recap - LMAX Exchange Architecture


For a detailed description of our architecture and how it all fits together, I recommend watching my colleague's talk here: Lmax Exchange Architecture: High-throughput Low-latency and plain old Java.

For now, I'll give a high-level overview of message flow in our system.




  1. Inputs to our exchange arrive either from market makers who are generally responsible for making prices in the market, or customers who generally take the prices. All market maker traffic is based on the FIX protocol, customer traffic can be either FIX protocol, or XML/JSON over HTTP. The services responsible for access control and protocol translation are referred to as 'Gateways'.
  2. Inbound and outbound traffic is tapped at the edge of our network. This allows us to have an authoritative record of data transfer with our customers, and also provides the ideal place to measure end-to-end latency. 
  3. Market maker orders are converted to our internal message protocol, then routed over UDP straight to the matching engine.
  4. Customer orders are translated to internal messages, then routed to the order management and pre-trade risk engine (4a). Assuming that the customer has sufficient funds, their order will be forwarded to the matching engine (4b).
  5. The matching engine and order management engine are what we refer to as 'Core Services'. 
  6. These services journal inbound messages to disk (6a), and have an HA pair that receive and acknowledge messages (6b). Using these mechanisms, we protect ourselves from single-node service failure. Once messages have been journalled and replicated, they are passed on to the business-logic.
  7. Responses are published over UDP to the gateways for transmission back the the market maker or customer.
This describes the data flow in what we refer to as our latency-sensitive path. In the diagram above, each wheel icon represents a Disruptor instance, which is used extensively in our system.

Given this architecture, we necessarily tend to focus our attention on the core services, since these do the most work and actually model the business domain. The gateways are very lightweight, and are mainly just doing the translation work, they also have the nice property of being horizontally scalable if we need to lower the load of work being performed.

Two of the main costs that we need to address is the time taken to journal messages to disk, and to synchronously replicate messages out to a secondary. For this reason, it made sense to start looking at disk journalling performance.


Journalling Performance


For the last few years, we've been running our systems on CentOS 6.4, kernel 2.6.32 and journalling messages to an ext3 file-system. The file-system is backed by a battery-backed RAID array, and we perform asynchronous writes - meaning that the data is only guaranteed to be in the operating system's page cache after the write() call has returned. At the time, our testing showed this configuration to be the most performant, given the trade-offs of maturity of other file-systems and safety guarantees.

Testing also showed that from a software point-of-view, using the JDK's RandomAccessFile gave the best performance for writes that always append to the end of a file. Using this technique, as messages arrive at a core service, they are appended to the current journal. When a journal reaches a certain size, the journaller rolls to the next file and continues appending data.

In order to determine what benefit we would get from changing the operating system/file-system/storage hardware, we needed to be able to accurately measure the time taken to journal incoming messages to disk.

First off of course, it's necessary to be able to replicate production-like inbound traffic in a performance-test environment; see previous posts on how you might go about getting to this point.


Measuring the baseline


Once we were happy that we could adequately stress the system-under-test, we found that the best way to measure journalling latency was just to wrap the write call with a timer.

Our journaller was instrumented with a couple of calls to System.nanoTime():


The ValueRecorder component referenced here simply maintains a maximum-seen value and publishes this value to a monitoring endpoint every second or so. Using this small change, we were able to see exactly how long it was taking to perform an asynchronous write to underlying storage.

Armed with this ability to extract accurate metrics from our journaller, we ran a baseline test to see how the system was currently performing.



Max write latency in microseconds per-second (kernel 2.6.32/ext3)

In our existing configuration, we had a background noise of 200 - 400 microseconds write latency, with spikes up to 1 millisecond. Clearly, in order to get to consistently low latencies, we needed to address this. When inspected in detail, we can see that the best case latency for the write call is about 10 microseconds:


Best case write latency of ~10 microseconds


Measuring improvements


Although we knew that we were planning to upgrade to new hardware, when performance testing, it is always advisable to change only one thing at a time, otherwise it's impossible to know what single change had any benefit or detrimental impact. With this methodology in mind, we first decided to upgrade the kernel, then file-system using the same hardware, each time recording the results. For brevity's sake, I'll present the outcome of those tests - we found that the best combination using the old hardware was to upgrade the kernel to a more recent version (3.17), and to use the ext4 file-system in place of ext3. The results of these changes was obvious when we re-ran the previous test.


Max write latency in microseconds per-second (kernel 2.6.32/ext3 vs 3.17/ext4)

Background noise was now down to around 50-100 microseconds, with spikes of around 200 microseconds. Looking in detail again, we can see that the best-case write latency is still around 10 microseconds - suggesting that this is the real time for a write call + JNI overhead when the kernel really just performs async IO (essentially just a memory write).


Best case write latency still ~10 microseconds

Now that we had selected the optimal configuration for the OS/file-system, we tried out upgrading the hardware. Again, in attempting to change only one thing at a time, we tried kernel 2.6.32 and ext3 combinations on the new hardware, I will just show the results for kernel 3.17/ext4, which yielded the best results.


Write latency in microseconds after hardware upgrade

The improvement with the new hardware is actually difficult to see at this scale, such is the reduction in jitter and latency. Looking at the charts with a log scale on the y-axis makes things a little clearer.


Write latency in microseconds after hardware upgrade (log scale)

After the combination of hardware, operating system and file-system, background noise is down to 10 - 20 microseconds, with spikes up to around 300 microseconds. This is a great improvement on the baseline, which had a background of 200 - 400 microseconds, with spikes up to 1 millisecond. Also, we can see that the best-case write latency has decreased to 4 - 5 microseconds, about half of what it was on the original configuration.


Further improvements


More analysis of these results revealed that the 300 microsecond spikes are caused by the journaller rolling to a new file, rather than the cost of actually doing a write call (the file-rolling in our journaller is orchestrated at a lower level than the instrumentation that we added). This is something that we will be able to easily fix in software, meaning that we should have consistent write latencies of under 20 microseconds.

We have also spent very little time experimenting with kernel tunables related to I/O performance, there may be further gains to be made by working on I/O schedulers and priorities.


Conclusion


Upgrading our operating system and hardware made a huge difference to the amount of jitter that we saw in our journalling performance. This was however, a significant undertaking and not an opportunity that comes along very often.

A modern kernel on commodity hardware seems to be capable of write latencies as low as 5 microseconds when asynchronous I/O is used.

Once again, the importance of being able to replay production-like inputs to our system has proven invaluable in testing and tuning for performance in a repeatable manner. Having this ability means that we are able to try out different settings without impacting the performance of our production environments, and generate faster feedback on these changes.




12 comments:

  1. I've found it is better to append to a file using the algorithm we developed by using the following to avoid one of the system calls to set the position. It effectively calls pwrite().

    http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#write(java.nio.ByteBuffer,%20long)

    ReplyDelete
    Replies
    1. Hi Martin, we did try with this API call some time ago. We didn't really see much difference, but that was probably down to the background noise of the old kernel/ext3. I'll re-run these tests with our journaller that uses pwrite() and publish the results...

      Delete
  2. What do you mean when you say "asynchronous writes"? Do you mean that you don't call flush/force? If so, does your battery backed RAID array even get to see the data and thus keep it safe in the event of a power failure?

    ReplyDelete
    Replies
    1. Hi Craig,
      we don't call force(), so the data is really only in the OS's page-cache.

      So a single service journal is resilient to process crash, but not power failure. We replicate out to a secondary service (6b in the diagram above) before processing any business logic, so in the event of power failure on the primary, we still have a copy of received messages in the secondary.

      Delete
    2. Resilience against data loss comes from replicating the data. Persistence is replicating to yourself in the future. Replicating to other nodes and disk provides resilience in space and time. There are no absolutes, you need to determine what level of resilience is required for your application based on real numbers. Too often this is done no numbers about failure probabilities which is little more than folklore.

      Delete
    3. So the mention of the battery backed RAID array is fairly irrelevant in this discussion? Would it also be fair to say that your system does not have the same basic resiliency guarantees of say an Oracle or PostgreSQL database? I understand that resilience may be a continuum, but resilience in the event of a power failure is a reasonable point along that continuum that is achievable with flushing and battery backed cache. I am assuming that you are not synchronously replicating, and you aren't flushing, so is seems to me the architecture is not resilient to power failure.

      The reason this is so interesting to me is that we are running a similar architecture to LMAX that uses disruptors and journalling and replication. The disruptor pipeline goes at millions of events a second, but the best we can journal without flushing is closer to 100k events/sec and with flushing this drops down to less than 10k events/sec. So the journal is the bottleneck. The typical production sustained event rate in our system is about 5k events/sec. Ideally I would like that to be able to burst to 3 or 4 times that (equity/options markets being what they are) while still having resiliency, even in the event of power failure.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Persistence is one good way of handling power failures that cannot be isolated.

      Persistence of replicated synchronous logs is not really equivalent to Oracle or PostgreSQL, it is better. I worked with these databases for years and while their journalling is excellent they tend to replicate asynchronously. Need to check the latest version to see if this is still true. To get the same level of resilience you often have to employ storage that synchronously replicates which is mega expensive.

      Regarding your throughput rates. I would not flush when doing a synchronous journal. You need to write direct with a specific block writing pattern. If you do this it is possible to write at 100K+ events per second, or even much greater depending on storage type. Sounds like your write patterns, file system setup, and configuration is not appropriate to getting max throughput.

      As something to consider. When writing at max throughput the page cache is not helping. So firstly you should not use it, and secondly, it then is limited by the underlying storage throughput for a given write pattern. Given the significant drop you see this suggests the storage is not limiting throughput but how you are writing to it.

      The value of the page cache to writing is to handle burst scenarios at low latency, and improve the write patterns down to storage.

      Also note that all SSDs are not equal. Some with respond saying they have been flushed when they have not, and they can reorder writes!

      Delete
    2. Hi Craig,
      we do indeed synchronously replicate. So a message arriving at our matching engine must be synchronously replicated to the secondary, and written to the local OS page cache before it is processed. This happens in parallel, utilising the Disruptor.

      We've found that the network hop between primary & secondary is acceptable in terms of latency; I have another post queued up discussing the work we've done recently to tackle network latency. Critically, the message only needs to be acknowledged as received by the secondary instance, not journalled by the secondary instance. In this way, we're paying for two network hops over a 10Gb network, which on commodity hardware and stock OS comes in at around 30 microseconds average (*).

      For more detail about message flow in our system, have a look at this talk: https://skillsmatter.com/skillscasts/5247-the-lmax-exchange-architecture-high-throughput-low-latency-and-plain-old-java

      Our resiliency then, is based on the premise that we are unlikely to suffer power failure on both the primary and secondary within a certain time-window. All systems have redundant battery-backed UPS, so we're really talking about a machine going pop, or the datacentre suffering something catastrophic.

      At the cost of adding more complexity to our system, I believe it would be possible to get rid of the need to wait for synchronous acknowledgement from the secondary. This however would need to be balanced against the benefit, and we probably have parts of the system that are easier to optimise. Going for the low-hanging fruit tends to give the best results.

      (*) As I'm sure someone could point out, average is a terrible metric to quote - used for illustration purposes only.

      Delete
    3. That was the missing piece of information,... you are synchronously replicating. Are the nodes in the same datacenter? If so, I assume they are on different power circuits? So if your replication is synchronous, your "availability" is tied to the availability of both nodes, a factor you can probably reduce by adding more synchronous nodes and ensuring that you get a response from at least one.

      30 microseconds for 2 network hops sounds impressive. I would love to hear how you do that :)

      Delete
    4. 30 microseconds is pretty middle of the road on 10GigE with a decent messaging system. Try Aeron and see what you get.

      https://github.com/real-logic/Aeron

      If you then run this over something like Solareflare or Mellanox network card you can expect to half that again at least.

      Now if you messaging in not "decent", i.e. something JMS or AMQP based, it will be a much longer round trip time.

      Delete
    5. Martin obviously gets up earlier in the morning than me :)

      When trying to figure out how fast you can expect your network to go, it's worth using a tool like netperf (http://www.netperf.org/netperf/) to see what the raw throughput/latency capabilities are when your systems are not loaded.

      Any software stack running on top of this (including your application code) is going to add overhead. As Martin points out, there are huge differences in the performance you can expect from different open-source and proprietary messaging systems. Once you have some baseline numbers, you can start to see how much overhead you're getting from your software stack.

      If there is a large discrepancy, then further experimentation/instrumentation will be required to find out where the latency is occurring.

      Delete