Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the Four Steps to Faster Software newsletter. Thanks!
For the last few months at LMAX Exchange, we've been working on building out our next generation platform. Every few years we refresh our hardware and upgrade the machines that run our systems, and this time we decided to have a look at upgrading the operating system at the same time.
When our first generation exchange was built, we were happy with low-millisecond-level mean latencies. After a couple of years of operation, we upgraded to newer hardware, made some significant software changes and ended up with mean end-to-end latencies of around 250 microseconds. With our latest set of changes, we are aiming for sub-100 microsecond mean latency and significantly reduced jitter.
These changes should stand us in good stead for another year or two, before we repeat the cycle to further improve performance. In order to achieve this goal, we have modified and tuned our hardware, system architecture, operating system and application software.
In my next few posts, I will be describing our experiences of doing this, and lessons we've learned along the way.
For a detailed description of our architecture and how it all fits together, I recommend watching my colleague's talk here: Lmax Exchange Architecture: High-throughput Low-latency and plain old Java.
Given this architecture, we necessarily tend to focus our attention on the core services, since these do the most work and actually model the business domain. The gateways are very lightweight, and are mainly just doing the translation work, they also have the nice property of being horizontally scalable if we need to lower the load of work being performed.
Two of the main costs that we need to address is the time taken to journal messages to disk, and to synchronously replicate messages out to a secondary. For this reason, it made sense to start looking at disk journalling performance.
For the last few years, we've been running our systems on CentOS 6.4, kernel 2.6.32 and journalling messages to an ext3 file-system. The file-system is backed by a battery-backed RAID array, and we perform asynchronous writes - meaning that the data is only guaranteed to be in the operating system's page cache after the write() call has returned. At the time, our testing showed this configuration to be the most performant, given the trade-offs of maturity of other file-systems and safety guarantees.
Testing also showed that from a software point-of-view, using the JDK's RandomAccessFile gave the best performance for writes that always append to the end of a file. Using this technique, as messages arrive at a core service, they are appended to the current journal. When a journal reaches a certain size, the journaller rolls to the next file and continues appending data.
In order to determine what benefit we would get from changing the operating system/file-system/storage hardware, we needed to be able to accurately measure the time taken to journal incoming messages to disk.
First off of course, it's necessary to be able to replicate production-like inbound traffic in a performance-test environment; see previous posts on how you might go about getting to this point.
Once we were happy that we could adequately stress the system-under-test, we found that the best way to measure journalling latency was just to wrap the write call with a timer.
Our journaller was instrumented with a couple of calls to System.nanoTime():
The ValueRecorder component referenced here simply maintains a maximum-seen value and publishes this value to a monitoring endpoint every second or so. Using this small change, we were able to see exactly how long it was taking to perform an asynchronous write to underlying storage.
Armed with this ability to extract accurate metrics from our journaller, we ran a baseline test to see how the system was currently performing.
Although we knew that we were planning to upgrade to new hardware, when performance testing, it is always advisable to change only one thing at a time, otherwise it's impossible to know what single change had any benefit or detrimental impact. With this methodology in mind, we first decided to upgrade the kernel, then file-system using the same hardware, each time recording the results. For brevity's sake, I'll present the outcome of those tests - we found that the best combination using the old hardware was to upgrade the kernel to a more recent version (3.17), and to use the ext4 file-system in place of ext3. The results of these changes was obvious when we re-ran the previous test.
Upgrading our operating system and hardware made a huge difference to the amount of jitter that we saw in our journalling performance. This was however, a significant undertaking and not an opportunity that comes along very often.
A modern kernel on commodity hardware seems to be capable of write latencies as low as 5 microseconds when asynchronous I/O is used.
Once again, the importance of being able to replay production-like inputs to our system has proven invaluable in testing and tuning for performance in a repeatable manner. Having this ability means that we are able to try out different settings without impacting the performance of our production environments, and generate faster feedback on these changes.
Follow @epickrram
When our first generation exchange was built, we were happy with low-millisecond-level mean latencies. After a couple of years of operation, we upgraded to newer hardware, made some significant software changes and ended up with mean end-to-end latencies of around 250 microseconds. With our latest set of changes, we are aiming for sub-100 microsecond mean latency and significantly reduced jitter.
These changes should stand us in good stead for another year or two, before we repeat the cycle to further improve performance. In order to achieve this goal, we have modified and tuned our hardware, system architecture, operating system and application software.
In my next few posts, I will be describing our experiences of doing this, and lessons we've learned along the way.
Recap - LMAX Exchange Architecture
For a detailed description of our architecture and how it all fits together, I recommend watching my colleague's talk here: Lmax Exchange Architecture: High-throughput Low-latency and plain old Java.
For now, I'll give a high-level overview of message flow in our system.
- Inputs to our exchange arrive either from market makers who are generally responsible for making prices in the market, or customers who generally take the prices. All market maker traffic is based on the FIX protocol, customer traffic can be either FIX protocol, or XML/JSON over HTTP. The services responsible for access control and protocol translation are referred to as 'Gateways'.
- Inbound and outbound traffic is tapped at the edge of our network. This allows us to have an authoritative record of data transfer with our customers, and also provides the ideal place to measure end-to-end latency.
- Market maker orders are converted to our internal message protocol, then routed over UDP straight to the matching engine.
- Customer orders are translated to internal messages, then routed to the order management and pre-trade risk engine (4a). Assuming that the customer has sufficient funds, their order will be forwarded to the matching engine (4b).
- The matching engine and order management engine are what we refer to as 'Core Services'.
- These services journal inbound messages to disk (6a), and have an HA pair that receive and acknowledge messages (6b). Using these mechanisms, we protect ourselves from single-node service failure. Once messages have been journalled and replicated, they are passed on to the business-logic.
- Responses are published over UDP to the gateways for transmission back the the market maker or customer.
Given this architecture, we necessarily tend to focus our attention on the core services, since these do the most work and actually model the business domain. The gateways are very lightweight, and are mainly just doing the translation work, they also have the nice property of being horizontally scalable if we need to lower the load of work being performed.
Two of the main costs that we need to address is the time taken to journal messages to disk, and to synchronously replicate messages out to a secondary. For this reason, it made sense to start looking at disk journalling performance.
Journalling Performance
For the last few years, we've been running our systems on CentOS 6.4, kernel 2.6.32 and journalling messages to an ext3 file-system. The file-system is backed by a battery-backed RAID array, and we perform asynchronous writes - meaning that the data is only guaranteed to be in the operating system's page cache after the write() call has returned. At the time, our testing showed this configuration to be the most performant, given the trade-offs of maturity of other file-systems and safety guarantees.
Testing also showed that from a software point-of-view, using the JDK's RandomAccessFile gave the best performance for writes that always append to the end of a file. Using this technique, as messages arrive at a core service, they are appended to the current journal. When a journal reaches a certain size, the journaller rolls to the next file and continues appending data.
In order to determine what benefit we would get from changing the operating system/file-system/storage hardware, we needed to be able to accurately measure the time taken to journal incoming messages to disk.
First off of course, it's necessary to be able to replicate production-like inbound traffic in a performance-test environment; see previous posts on how you might go about getting to this point.
Measuring the baseline
Once we were happy that we could adequately stress the system-under-test, we found that the best way to measure journalling latency was just to wrap the write call with a timer.
Our journaller was instrumented with a couple of calls to System.nanoTime():
The ValueRecorder component referenced here simply maintains a maximum-seen value and publishes this value to a monitoring endpoint every second or so. Using this small change, we were able to see exactly how long it was taking to perform an asynchronous write to underlying storage.
Armed with this ability to extract accurate metrics from our journaller, we ran a baseline test to see how the system was currently performing.
Max write latency in microseconds per-second (kernel 2.6.32/ext3)
In our existing configuration, we had a background noise of 200 - 400 microseconds write latency, with spikes up to 1 millisecond. Clearly, in order to get to consistently low latencies, we needed to address this. When inspected in detail, we can see that the best case latency for the write call is about 10 microseconds:
Best case write latency of ~10 microseconds
Measuring improvements
Although we knew that we were planning to upgrade to new hardware, when performance testing, it is always advisable to change only one thing at a time, otherwise it's impossible to know what single change had any benefit or detrimental impact. With this methodology in mind, we first decided to upgrade the kernel, then file-system using the same hardware, each time recording the results. For brevity's sake, I'll present the outcome of those tests - we found that the best combination using the old hardware was to upgrade the kernel to a more recent version (3.17), and to use the ext4 file-system in place of ext3. The results of these changes was obvious when we re-ran the previous test.
Max write latency in microseconds per-second (kernel 2.6.32/ext3 vs 3.17/ext4)
Background noise was now down to around 50-100 microseconds, with spikes of around 200 microseconds. Looking in detail again, we can see that the best-case write latency is still around 10 microseconds - suggesting that this is the real time for a write call + JNI overhead when the kernel really just performs async IO (essentially just a memory write).
Best case write latency still ~10 microseconds
Now that we had selected the optimal configuration for the OS/file-system, we tried out upgrading the hardware. Again, in attempting to change only one thing at a time, we tried kernel 2.6.32 and ext3 combinations on the new hardware, I will just show the results for kernel 3.17/ext4, which yielded the best results.
Write latency in microseconds after hardware upgrade
The improvement with the new hardware is actually difficult to see at this scale, such is the reduction in jitter and latency. Looking at the charts with a log scale on the y-axis makes things a little clearer.
Write latency in microseconds after hardware upgrade (log scale)
After the combination of hardware, operating system and file-system, background noise is down to 10 - 20 microseconds, with spikes up to around 300 microseconds. This is a great improvement on the baseline, which had a background of 200 - 400 microseconds, with spikes up to 1 millisecond. Also, we can see that the best-case write latency has decreased to 4 - 5 microseconds, about half of what it was on the original configuration.
Further improvements
More analysis of these results revealed that the 300 microsecond spikes are caused by the journaller rolling to a new file, rather than the cost of actually doing a write call (the file-rolling in our journaller is orchestrated at a lower level than the instrumentation that we added). This is something that we will be able to easily fix in software, meaning that we should have consistent write latencies of under 20 microseconds.
We have also spent very little time experimenting with kernel tunables related to I/O performance, there may be further gains to be made by working on I/O schedulers and priorities.
We have also spent very little time experimenting with kernel tunables related to I/O performance, there may be further gains to be made by working on I/O schedulers and priorities.
Conclusion
Upgrading our operating system and hardware made a huge difference to the amount of jitter that we saw in our journalling performance. This was however, a significant undertaking and not an opportunity that comes along very often.
A modern kernel on commodity hardware seems to be capable of write latencies as low as 5 microseconds when asynchronous I/O is used.
Once again, the importance of being able to replay production-like inputs to our system has proven invaluable in testing and tuning for performance in a repeatable manner. Having this ability means that we are able to try out different settings without impacting the performance of our production environments, and generate faster feedback on these changes.
Follow @epickrram