Monday, 12 May 2014

Performance Testing at LMAX (Part One)

In order to help us in our goal to develop the world's fastest trading exchange, part of the continuous delivery pipeline at LMAX consists of performance tests. There are many pitfalls of performance testing, and we've come across a few of them over the years. I'm going to describe how we carry out performance testing of our system at LMAX. Our process isn't perfect, but it's something we're continually working on to make sure that our customers have the best possible experience when using our products.

Our business

LMAX runs a Multilateral Trading Facility (MTF), where customers can buy and sell a number of different asset classes as contracts-for-difference (CFD). Our success depends on us being able to offer customers the best prices, with tight spreads and fast execution speed, so that they know they will be matching at the price they expect. On the other hand, as we are an exchange, we need to offer our market-makers extrememly low latency order flow, in order for them to be able to react to market conditions. When we talk about latency in our system, we are describing the time it takes for an incoming message (i.e. order placement) to travel from our network (inbound), through our exchange, and back out to the network for transport to the client - this is a single transaction. On its way, the message may pass through several systems, taking in several network hops before a response is generated to send to the client.
Typically, we will see total message rates from our market-makers at between 1,000 and 2,000 messages per second, these rates can however spike up to around 10,000 messages per second during times of high market volatility, or during market announcements. At these message rates, our mean latency is currently 200-300 microseconds.
Customers placing orders against the market-makers will supply many hundreds of orders per second, with peaks of up to 1,000 messages per second during market announcements. Typical mean latencies for these clients are under 5 milliseconds (the difference between this and the market-maker latency is mostly down to pre-trade risk-control and extra validation work done before a trade occurs).

Measuring latency

Since we have already defined our latency measurements as being a full round-trip through our system, terminated at the network, it follows that we measure latency as close to the network as possible. To enable traffic capture, we have a network tap installed at the edge of our system, through which we can inspect network packets as they pass. If we were monitoring a simple website, we would only have to measure the time taken for an HTTP request to complete, but since our system is inherently asynchronous, we must use a more complex method of capturing these data.
The FIX Protocol
Many financial institutions communicate with each other using the FIX Protocol. While it is not the most efficient means of transport, it is ubiquitous, and so for our market-makers, and our customers who place orders through automated algorithms, we support the FIX protocol as a means of placing orders on our exchange. The protocol is a based on stateful, long-lived sessions, meaning that we cannot just measure the time for a single request to complete. Instead, we need to model the 'conversation' that occurs between client and server when an order is placed.

A typical FIX transaction
Request, captured at 2014-05-09 UTC 06:36:12.192208
8=FIX.4.2|9=211|35=D|49=userId|56=LMAX-FIX|34=258790|52=20140509-06:36:12.193|22=8|47=P|21=1|54=1|60=20140509-06:36:12.193|59=0|38=10000|40=2|581=1|11=774876524712244|55=GBP/JPY|48=180415|44=148.72674|10=227|
Response, captured at 2014-05-09 UTC 06:36:12.192569:
8=FIX.4.2|9=269|35=8|34=269501|49=LMAX-FIX|52=20140509-06:36:12.193|56=userId|1=1|6=0|11=774876524712244|14=0|17=QLAL2AAAAAABVSJE|20=0|22=8|37=QLAL2AAAAAAANCJV|38=10000|39=0|44=148.72674|48=180415|54=1|55=GBP/JPY|59=0|60=20140509-06:36:12.193|150=0|151=10000|10=185|


The messages above describe a single transaction used to place an order on a specific instrument.
The incoming message (in this case, a request to place a single order) specifies the instrument, quantity, price and identifier of the order, along with additional data. For latency-measuring purposes, we are interested in two pieces of data here: arrival time (at our network), and request identifier. When incoming messages are received, we store the arrival time along with the identifier for measurement purposes.
The corresponding outgoing message (an execution report) contains confirmation that the order was placed, along with the original request identifier. As this message leaves our network for transport to the client, we retrieve the request identifier, and measure the elapsed time since the original request was received.
In this way, we are able to measure the round-trip latency of every message that passes through our system.

Modelling production load profile

In order to exercise your system in the same way that it is used in production, it is necessary to analyse the shape of traffic when your system is under normal load, and when it is under heavy load (peak traffic times). Your modelling should capture the types of request being made (e.g. place an order, request market data) and the frequency with which they are being made (message rate). There are two approaches to using the results of the analysis; either build a mechanism that is able to replay production traffic (may be quite easy for HTTP requests for a mainly static site), or build a statistical model that is used to drive traffic generators (more efficient when traffic is largely stateful - i.e. FIX).
Though the replay option is arguably more correct, it is difficult to achieve for an asynchronous event-driven system. Consider the example where user A places an order and matches against market-maker B; if we are merely replaying traffic, it would be necessary to coordinate between the process replaying A's traffic and the process replaying B's traffic such that the system will be in the correct state for A to match B. The consequent synchronisation costs and extra complexity required to do this would make such a solution prohibitive.
Another good reason to use a statistical model is the ability to easily scale up the model, either in terms of throughput, or in terms of the number of active sessions. In this way, it is possible to easily observe how your system behaves under 2x, 5x, or 10x the current production load. This is a vital part of capacity planning, and staying ahead of the curve in terms of ensuring that your system will perform under increased load.
Using the statistical model it is possible to build a simulation that closely resembles your production traffic load. In order to have confidence in your modelling, it is of course necessary to measure the applied load in your performance testing environment and analyse it using the same tools used to inspect the production traffic load. If your modelling is correct, and your load-generator harness is working well, your analyses of the two different environments should look broadly similar.

This is an iterative process that needs to be repeated at regular intervals to ensure that there has not been a significant change in production load profile. Failure to detect such a change will mean that your performance tests are no longer accurately simulating the load on your production systems.

Conclusion

Successful performance testing depends on the ability to accurately measure and simulate production-like workloads in a performance environment Performance testing should be carried out as part of your continuous delivery pipeline - no one wants to find out about a degradation in application performance after new code has been shipped to production.
Accurately simulating production load can be difficult - always measure the load profile being applied in your performance environment to ensure that it accurately models production traffic.




Friday, 2 May 2014

DisruptorProxy 1.5 released

The DisruptorProxy previously suffered from a problem whereby primitive method parameters were auto-boxed, thus generating garbage.


Version 1.5 fixes this by generating argument holder objects, which can handle primitive types, meaning that no extra garbage is allocated after initial setup.


The new release is available for building on github.