tag:blogger.com,1999:blog-7860391846255857342024-03-20T08:35:03.944+00:00Technical ItchMark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comBlogger30125tag:blogger.com,1999:blog-786039184625585734.post-20987463498652708042020-12-21T08:33:00.005+00:002020-12-21T08:33:54.950+00:00Multi-Topic Broadcast in Babl WebSocket Server<p>As of version 0.10.0, <a href="https://babl.ws" target="_blank">Babl</a> applications gained the ability to broadcast messages to multiple topics simultaneously. This new functionality makes publishing of market data even more efficient.</p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3e5LuxeJtrytbp4YQXPAsMbKYSIy6okUbRQO03KFQfkvfjaJn2X57mhsi-ERBtwuk9tXNbzlNGhgIpiJXT4PzOC_7bTE1QmwAvcj4tnReO9XXfEnEmgZVyVk7VpM_veyooyxOe9rkokPF/s853/MultiTopicBroadcast.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="726" data-original-width="853" height="544" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3e5LuxeJtrytbp4YQXPAsMbKYSIy6okUbRQO03KFQfkvfjaJn2X57mhsi-ERBtwuk9tXNbzlNGhgIpiJXT4PzOC_7bTE1QmwAvcj4tnReO9XXfEnEmgZVyVk7VpM_veyooyxOe9rkokPF/w640-h544/MultiTopicBroadcast.png" title="Multi Topic Broadcast" width="640" /></a></div></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Using multi-topic broadcast, an application can send a single message to multiple topics, and transform that message for each receiving topic. The message is broadcast to all session containers via IPC, and will be picked up by those containers hosting sessions belonging to the relevant topics.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">This specialisation is designed for situations where the same basic information needs to be sent to a large number of sessions, but with a slightly different view depending on the session.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The obvious example of this use-case is the publication of market data, where a single update needs to be sent to several subscribed sessions. Some of these sessions might be subscribing to the full depth of market data, while others might only require the top-of-book updates. </div><div class="separator" style="clear: both; text-align: left;"> </div><div class="separator" style="clear: both; text-align: left;">Without multi-topic broadcast, it would be necessary to pre-render each of these updates (full-depth, top-of-book) on the application thread, and to publish over IPC to the session containers. Instead, it is possible to configure a topic for each view-type, and publish a single message over IPC. The work of transforming the message is then off-loaded to a message-transformer, running on the session container thread.</div><div class="separator" style="clear: both; text-align: left;"> </div><h2 style="text-align: left;">Multi-Topic Broadcast API</h2><p> </p><p>To use the API, an application must create and maintain membership of topics:</p><pre class="hljs" style="background: rgb(240, 240, 240) none repeat scroll 0% 0%; color: #444444; display: block; overflow-x: auto; padding: 0.5em;"><span class="hljs-keyword" style="font-weight: 700;">private</span> <span class="hljs-keyword" style="font-weight: 700;">static</span> <span class="hljs-keyword" style="font-weight: 700;">final</span> <span class="hljs-class"><span class="hljs-keyword" style="font-weight: 700;">class</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">MarketDataApplication</span>
<span class="hljs-keyword" style="font-weight: 700;">implements</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">Application</span>, <span class="hljs-title" style="color: #880000; font-weight: 700;">BroadcastSource</span>
</span>{
<span class="hljs-keyword" style="font-weight: 700;">private</span> Broadcast broadcast;
<span class="hljs-keyword" style="font-weight: 700;">private</span> int[] topicIds;
<span class="hljs-meta" style="color: #1f7199;">@Override</span>
<span class="hljs-function"><span class="hljs-keyword" style="font-weight: 700;">public</span> <span class="hljs-keyword" style="font-weight: 700;">void</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">setBroadcast</span><span class="hljs-params">(<span class="hljs-keyword" style="font-weight: 700;">final</span> Broadcast broadcast)</span>
</span>{
<span class="hljs-comment" style="color: #888888;">// store the Broadcast implementation</span>
<span class="hljs-keyword" style="font-weight: 700;">this</span>.broadcast = broadcast;
<span class="hljs-keyword" style="font-weight: 700;">this</span>.topicIds = <span class="hljs-keyword" style="font-weight: 700;">new</span> int[] {
MARKET_DATA_FULL_DEPTH, MARKET_DATA_TOP_OF_BOOK};
<span class="hljs-comment" style="color: #888888;">// create a topic for broadcast</span>
broadcast.createTopic(MARKET_DATA_FULL_DEPTH);
broadcast.createTopic(MARKET_DATA_TOP_OF_BOOK);
}
<span class="hljs-meta" style="color: #1f7199;">@Override</span>
<span class="hljs-function"><span class="hljs-keyword" style="font-weight: 700;">public</span> <span class="hljs-keyword" style="font-weight: 700;">int</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">onSessionConnected</span><span class="hljs-params">(<span class="hljs-keyword" style="font-weight: 700;">final</span> Session session)</span>
</span>{
<span class="hljs-comment" style="color: #888888;">// add new sessions to the topic</span>
broadcast.addToTopic(MARKET_DATA_TOP_OF_BOOK, session.id());
<span class="hljs-keyword" style="font-weight: 700;">return</span> SendResult.OK;
}
<span class="hljs-function"><span class="hljs-keyword" style="font-weight: 700;">public</span> <span class="hljs-keyword" style="font-weight: 700;">void</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">onMarketDataUpdate</span><span class="hljs-params">(<span class="hljs-keyword" style="font-weight: 700;">final</span> MarketDataUpdate update)</span>
</span>{
<span class="hljs-keyword" style="font-weight: 700;">final</span> DirectBuffer buffer = serialise(update);
<span class="hljs-comment" style="color: #888888;">// send a message to all sessions registered on the topic</span>
broadcast.sendToTopics(topicIds, buffer, <span class="hljs-number" style="color: #880000;">0</span>, buffer.capacity());
}
}</pre><p>Application messages broadcast to multiple topics will be handed to the registered message transformer before being written to sessions. A message transformer simply creates a topic-dependent view based on the topic ID:</p><pre class="hljs" style="background: rgb(240, 240, 240) none repeat scroll 0% 0%; color: #444444; display: block; overflow-x: auto; padding: 0.5em;"><span class="hljs-keyword" style="font-weight: 700;">final</span> <span class="hljs-class"><span class="hljs-keyword" style="font-weight: 700;">class</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">MarketDataTransformer</span>
<span class="hljs-keyword" style="font-weight: 700;">implements</span> <span class="hljs-title" style="color: #880000; font-weight: 700;">MessageTransformer</span>
</span>{
<span class="hljs-keyword" style="font-weight: 700;">private</span> <span class="hljs-keyword" style="font-weight: 700;">final</span> ExpandableArrayBuffer dst =
<span class="hljs-keyword" style="font-weight: 700;">new</span> ExpandableArrayBuffer();
<span class="hljs-keyword" style="font-weight: 700;">private</span> <span class="hljs-keyword" style="font-weight: 700;">final</span> TransformResult transformResult =
<span class="hljs-keyword" style="font-weight: 700;">new</span> TransformResult();
<span class="hljs-meta" style="color: #1f7199;">@Override</span>
<span class="hljs-function"><span class="hljs-keyword" style="font-weight: 700;">public</span> TransformResult <span class="hljs-title" style="color: #880000; font-weight: 700;">transform</span><span class="hljs-params">(
<span class="hljs-keyword" style="font-weight: 700;">final</span> <span class="hljs-keyword" style="font-weight: 700;">int</span> topicId,
<span class="hljs-keyword" style="font-weight: 700;">final</span> DirectBuffer input,
<span class="hljs-keyword" style="font-weight: 700;">final</span> <span class="hljs-keyword" style="font-weight: 700;">int</span> offset,
<span class="hljs-keyword" style="font-weight: 700;">final</span> <span class="hljs-keyword" style="font-weight: 700;">int</span> length)</span>
</span>{
<span class="hljs-keyword" style="font-weight: 700;">if</span> (topicId == MARKET_DATA_TOP_OF_BOOK)
{
transformResult.set(dst, <span class="hljs-number" style="color: #880000;">0</span>, encodeTopOfBook(input, offset, length));
}
<span class="hljs-keyword" style="font-weight: 700;">else</span>
{
transformResult.set(dst, <span class="hljs-number" style="color: #880000;">0</span>, encodeFullDepth(input, offset, length));
}
<span class="hljs-keyword" style="font-weight: 700;">return</span> transformResult;
}
}</pre><p>The message transformer must be registered before server start-up using the configuration API:</p><pre class="hljs" style="background: rgb(240, 240, 240) none repeat scroll 0% 0%; color: #444444; display: block; overflow-x: auto; padding: 0.5em;"><span class="hljs-keyword" style="font-weight: 700;">new</span> BablConfig().sessionContainerConfig()
.messageTransformerFactory(topicId -> <span class="hljs-keyword" style="font-weight: 700;">new</span> MarketDataTransformer());</pre><p><br /></p><h2 style="text-align: left;">More Info</h2><p>Babl <a href="https://github.com/babl-ws/babl" target="_blank">source-code</a> is available on Github, and releases are published to <a href="https://mvnrepository.com/artifact/com.aitusoftware/babl" target="_blank">maven central</a> and <a href="https://hub.docker.com/r/aitusoftware/babl" target="_blank">Docker hub</a>.<br /></p><p>Full documentation is available at <a href="https://babl.ws" target="_blank">https://babl.ws</a>.<br /></p><p><br /></p><p><br /></p><p>Enjoyed this post? You might be interested in subscribing to the <a href="https://foursteps.software/" target="_blank">Four Steps Newsletter</a>, a monthly newsletter covering techniques for writing faster, more performant software.<br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p>
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-57770567149521218812020-10-13T11:37:00.005+01:002020-10-13T11:40:15.937+01:00Babl High-Performance WebSocket Server<p> </p><h2 dir="ltr" id="docs-internal-guid-0d3772f4-7fff-60c6-86d2-fc10c0294a45" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">A lockdown project</span></h2><p></p><p>A couple of years ago, after evaluating the available open-source solutions, I became interested in the idea of writing a low-latency websocket server for the JVM. The project I was working on required a high-throughput, low-latency server to provide connectivity to a clustered backend service. We started by writing some benchmarks to see whether the most popular servers would be fast enough. In the end, we found that an existing library provided the performance required, so the project moved on. I still had the feeling that a cut-down and minimal implementation of the websocket protocol could provide better overall, and tail latencies than a generic solution.<br /><br />The project in question was the Hydra platform, a product of Adaptive Financial Consulting[1], created to help accelerate development of their clients’ applications. The Hydra platform uses Aeron Cluster[2] for fault-tolerant, high-performance state replication, along with Artio[3] for FIX connectivity, and a Vert.x[4]-based websocket server for browser/API connectivity. Users of Hydra get a “batteries included” platform on which to deploy their business logic, with Hydra taking care of all the messaging, fault-tolerance, lifecycle, and state-management functions.<br /><br />Vert-x was the right choice for the platform, as it provided good performance, and came with lots of other useful functionality such as long-poll fallback. However, I was still curious about whether it would be possible to create a better solution, performance-wise, just for the websocket connectivity component. During the UK’s lockdown period, I found time to begin development on a new project, of which <a href="https://babl.ws" target="_blank">Babl</a> is the result.<br /></p><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">High Performance Patterns</span></h2><p>I have been working on low-latency systems of one sort or another for over 10 years. During that time the single-thread event-loop pattern has featured again and again. The consensus is that a busy-spinning thread is the best way to achieve low latencies. Conceptually, this makes sense, as we avoid the wake-up cost associated with some kind of signal to a waiting thread. In practice, depending on the deployment environment (such as a virtual cloud instance), busy-spinning can be detrimental to latency due to throttling effects. For this reason, it is usually reserved for bare-metal deployments, or cases where there are known to be no multi-tenanting issues.</p><p>Another well-known approach to achieve high performance is to exploit memory locality by using cache-friendly data-structures. In addition, object-reuse in a managed-runtime environment, such as the JVM, can help an application avoid long pause times due to garbage collection.<br /><br />One other performance anti-pattern commonly seen is the usage of excess logging and monitoring. Babl exports metrics counters to shared memory, where they can be read from another process. Other metrics providers, such as JMX will cause allocation within the JVM, contributing to latency spikes. The shared memory approach means that a ‘side-car’ process can be launched, which is responsible for polling metrics and transmitting them to a backend store such as a time-series database.<br /><br />By applying these techniques in a configuration inspired by Real Logic’s Artio FIX engine, I aimed to create a minimally functional websocket server that could be used in low-latency applications, such as cryptocurrency exchanges. The design allows for socket-processing work to be scaled out to the number of available CPUs, fully utilising the available resources on today’s multi-core machines.<br /></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Architecture</span></h2><p>The Babl architecture separates the concern of socket- and protocol-handling from the needs of the application developer. The intent is that the developer needs only implement business logic, with Babl taking care of all protocol-level operations such as framing, heartbeats and connections.<br /><br />Socket processing and session lifecycle is managed in the session container, instances of which can be scaled horizontally. The session container sends messages to the application container via an efficient IPC, provided by Aeron. In the application container, the user’s business logic is executed on a single thread, with outbound messages routed back to the originating session container.<br /><br />The image below shows how the various components are related.</p><p> </p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="border: medium none; display: inline-block; height: 503px; overflow: hidden; width: 624px;"><img height="503" src="https://lh6.googleusercontent.com/XWuXjHgMZQrW2zGmOPRh4XpGO4WlpmAOmc3iP66OrgClm8hvbPskL6ssBFxhH1qb8BJrz4iL4xn02kccNoNlwG5PKjXQdiEPrBdlJg-xbesIP9PAyzPfHleP53mXKJppcE4TwY24" style="margin-left: 0px; margin-top: 0px;" width="624" /></span></span></p><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Use cases</span></h2>Babl is designed for modern applications where latency is a differentiator, rather than as a general web-server. For this reason, there is no long-poll fallback for use in browsers that are unable to establish web-socket connections; nor is there any facility for serving web artifacts over HTTP.<br /><br />If HTTP resources need to be served, the recommended approach is to use some static resource server (e.g. nginx) to handle browser requests, and to proxy websocket connections through the same endpoint. Currently, this must be done through the same domain; CORS/pre-flight checks will be added in a future release. An <a href="https://github.com/babl-ws/babl/blob/master/docker/docker-compose.yaml#L3">example</a> of this approach can be seen in the Babl github repository.<br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Configuration</span></h2>Babl can be launched using its APIs, or via the command-line using properties configuration. The various <a href="https://babl.ws/configuration.html">configuration options</a> control everything from socket parameters, memory usage, and performance characteristics.<br /><br />To quickly get started, add the Babl libraries to your application, implement the <a href="http://babl.ws/application.html">Application</a> interface, and launch Babl server from the <a href="https://babl.ws/getting_started.html#launch-standalone">command-line</a> or in a <a href="https://babl.ws/getting_started.html#launch-docker">docker container</a>.<br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Performance</span></h2><p>Babl has been designed from the beginning to offer the best available performance. Relevant components of the system have been benchmarked and profiled to make sure that they are optimised by the JVM. <br /><br /><a href="https://github.com/babl-ws/babl/tree/master/src/jmh/java/com/aitusoftware/babl/websocket">JMH benchmarks</a> demonstrate that Babl can process millions of websocket messages per-second, on a single thread:</p><p><br /></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: 'Roboto Mono',monospace; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Benchmark (ops/sec) Score Error</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: 'Roboto Mono',monospace; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">FrameDecoder.decodeMultipleFrames 13446695.952 ± 776489.255 </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: 'Roboto Mono',monospace; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">FrameDecoder.decodeSingleFrame 53432264.716 ± 854846.712 </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: 'Roboto Mono',monospace; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">FrameEncoder.encodeMultipleFrames 12328537.512 ± 425162.902 </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: 'Roboto Mono',monospace; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">FrameEncoder.encodeSingleFrame 39470675.807 ± 2432772.310 </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: 'Roboto Mono',monospace; font-size: 9pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">WebSocketSession.processSingleMsg 15821571.760 ± 173051.962 </span></p><p><br /><br />Due to Babl’s design and tight control over threading, it is possible to use thread-affinity to run the processing threads (socket-process and application logic) on isolated cores to further reduce latency outliers caused by system jitter. In a comparative benchmark between Babl and another leading open-source solution, Babl has much lower tail latencies due to its allocation-free design, and low-jitter configuration.<br /><br />In this test, both server implementations use a queue to pass incoming messages to a single-threaded application. In the case of Babl, this is Aeron IPC, for the other server I used the trusted Disruptor with a busy-spin wait strategy. The application logic echoes responses back to the socket processing component via a similar queue, as shown below: </p><p><br /></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="border: medium none; display: inline-block; height: 269px; overflow: hidden; width: 624px;"><img height="269" src="https://lh4.googleusercontent.com/7Iai1wgTGLgL0p5RRA6b-FyYX34295_KsJwprrTaUgHeV37wGtM8NLNp22wYjaKKUUW1OBZKZ0OsV3yEXb521q2NO0topcOTTu5rK_4Ro33L6BbLO4nOuPEqq10CVFNFH1q6tlGm" style="margin-left: 0px; margin-top: 0px;" width="624" /></span></span></p><p><br /><br /><br />Each load generator thread runs on its own isolated CPU, handling an even number of sessions; each session publishes a 200-byte payload to the server at a fixed rate, and records the RTT. <br /><br />Both server and client processes are run using OpenOnload to bypass the kernel networking stack. <br /><br />The load generator runs a 100 second warm-up period at the specified throughput, before running a 100 second measurement period; this ensures that any JIT compilation has had time to complete.<br /><br />The work done on the server includes the following steps:</p><p><br /></p><ol style="margin-bottom: 0px; margin-top: 0px;"><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Socket read</span></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Protocol decoding</span></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Thread hop on inbound queue</span></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Application logic (echo)</span></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Thread hop on outbound queue</span></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Protocol encoding</span></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Socket write</span></p></li></ol><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Benchmark Results</span></h2>Response times are expressed in microseconds.<h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;"><span style="-webkit-text-decoration-skip: none; background-color: transparent; color: #434343; font-family: Arial; font-size: 14pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration-skip-ink: none; text-decoration: underline; vertical-align: baseline; white-space: pre;">10,000 messages per second, 1,000 sessions</span></h3><br /><br /><div align="left" dir="ltr" style="margin-left: 0pt;"><table style="border-collapse: collapse; border: medium none; table-layout: fixed; width: 468pt;"><colgroup><col></col><col></col><col></col><col></col><col></col><col></col><col></col></colgroup><tbody><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><br /></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Min</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">50%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">90%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">99%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">99.99%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Max</span></p></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Babl</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">29.3</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">56.3</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">81.7</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">144.8</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">213.9</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">252.2</span></p></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Vert.x</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">33.3</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">70.0</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">106.8</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">148.0</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">422.4</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">2,013.2</span></p></td></tr></tbody></table></div><br /><h3 dir="ltr" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;"><span style="-webkit-text-decoration-skip: none; background-color: transparent; color: #434343; font-family: Arial; font-size: 14pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration-skip-ink: none; text-decoration: underline; vertical-align: baseline; white-space: pre;">100,000 messages per second, 1,000 sessions</span></h3><br /><br /><div align="left" dir="ltr" style="margin-left: 0pt;"><table style="border-collapse: collapse; border: medium none; table-layout: fixed; width: 468pt;"><colgroup><col></col><col></col><col></col><col></col><col></col><col></col><col></col></colgroup><tbody><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><br /></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Min</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">50%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">90%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">99%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">99.99%</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre;">Max</span></p></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Babl</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">25.2</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">55.3</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">81.5</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">131.8</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">221.1</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">393.2</span></p></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Vert.x</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">34.2</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">73.8</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">95.9</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">132.7</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">422.4</span></p></td><td style="border-bottom: solid #000000 1pt; border-color: rgb(0, 0, 0); border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-style: solid; border-top: solid #000000 1pt; border-width: 1pt; overflow-wrap: break-word; overflow: hidden; padding: 5pt; vertical-align: top;"><p dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt; text-align: right;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">10,002.4</span></p></td></tr></tbody></table></div><br /><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Getting Babl</span></h2>Babl’s <a href="https://github.com/babl-ws/babl">source-code</a> is available on Github, and releases are published to <a href="https://mvnrepository.com/artifact/com.aitusoftware/babl">maven central</a> and <a href="https://hub.docker.com/r/aitusoftware/babl">docker hub</a>.<br /><br />The benchmarking harness used to compare relative performance is also on Github at <a href="https://github.com/babl-ws/ws-harness">https://github.com/babl-ws/ws-harness</a>.<br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Thanks</span></h2>Special thanks to Adaptive for providing access to their performance testing lab for running the benchmarks.<br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 16pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;">Links</span></h2><ol style="margin-bottom: 0px; margin-top: 0px;"><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a href="https://weareadaptive.com" style="text-decoration: none;"><span style="-webkit-text-decoration-skip: none; background-color: transparent; color: #1155cc; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration-skip-ink: none; text-decoration: underline; vertical-align: baseline; white-space: pre;">https://weareadaptive.com</span></a></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a href="https://github.com/real-logic/aeron" style="text-decoration: none;"><span style="-webkit-text-decoration-skip: none; background-color: transparent; color: #1155cc; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration-skip-ink: none; text-decoration: underline; vertical-align: baseline; white-space: pre;">https://github.com/real-logic/aeron</span></a></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a href="https://github.com/real-logic/artio" style="text-decoration: none;"><span style="-webkit-text-decoration-skip: none; background-color: transparent; color: #1155cc; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration-skip-ink: none; text-decoration: underline; vertical-align: baseline; white-space: pre;">https://github.com/real-logic/artio</span></a></p></li><li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a href="https://vertx.io/" style="text-decoration: none;"><span style="-webkit-text-decoration-skip: none; background-color: transparent; color: #1155cc; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration-skip-ink: none; text-decoration: underline; vertical-align: baseline; white-space: pre;">https://vertx.io/</span></a></p></li></ol><p><br /></p><p>Enjoyed this post? You might be interested in subscribing to the <a href="https://foursteps.software" target="_blank">Four Steps Newsletter</a>, a monthly newsletter covering techniques for writing faster, more performant software. <br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</p>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-81914978465939784062019-03-21T07:50:00.001+00:002020-09-14T19:29:16.577+01:00Performance Tuning E-Book<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">I've condensed many of the posts on this blog into a series of e-books, available for free download.<br />
<br />
The first, on performance tuning for low-latency applications, is available here:<br />
<br />
<a href="https://s3-eu-west-1.amazonaws.com/aitusoftware-doc-public/downloads/PerformanceTuningHandbookShare.pdf">https://s3-eu-west-1.amazonaws.com/aitusoftware-doc-public/downloads/PerformanceTuningHandbookShare.pdf</a><br />
<br />
<br /></div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-51099648576000207412019-02-19T18:19:00.000+00:002019-02-22T17:58:12.684+00:00Recall Design<div dir="ltr" style="text-align: left;" trbidi="on">
This article discusses the design of <a href="https://github.com/aitusoftware/recall" target="_blank">Recall</a>, an efficient, low-latency off-heap object store for the JVM. <br />
<br />
Recall is designed for use by single-threaded applications, making it simple and performant.<br />
<br />
In the multi-core era, it may seem odd that high-performance software is designed for use<br />
by a single thread, but there is a good reason for this. <br />
<br />
Developers of low-latency systems will be aware of the Disruptor pattern,<br />
which showed that applying the Single Writer Principal could lead to extreme performance gains.<br />
<br />
Recall is also allocation-free (in steady state), meaning that it will always use pieces of <br />
memory that are more likely to be resident in CPU caches. Recall will never cause unwanted<br />
garbage collection pauses, since it will not exhaust any memory region.<br />
<br />
<h4 style="text-align: left;">
Allocations in Recall</h4>
<br />
While allocation-free in steady state, Recall will allocate new buffers as data is recorded <br />
into a <span style="font-family: "courier new" , "courier" , monospace;">Store</span>. For the most part, these should be direct <span style="font-family: "courier new" , "courier" , monospace;">ByteBuffer</span>s, which will not<br />
greatly impact the JVM heap. Users can pre-empt these allocations by correctly sizing<br />
a <span style="font-family: "courier new" , "courier" , monospace;">Store</span> or <span style="font-family: "courier new" , "courier" , monospace;">SequenceMap</span> before use, to ensure that necessary storage is available at<br />
system start time.<br />
<br />
<h3 style="text-align: left;">
Benefits of single-threaded design</h3>
<div style="text-align: left;">
<br />
Since the data structures in Recall are never accessed by multiple threads, there is no need<br />
for locking, so there is no contention. Resource contention is one of the main things that<br />
can limit throughput in multi-threaded programs, so side-stepping the problem altogether<br />
leads to significant gains over solutions designed for multi-threaded applications.<br />
<br />
Highly peformant multi-threaded map implementations will use <i>compare-and-swap</i> (CAS)<br />
operations to avoid the cost of standard locks, and remain lock-free. While this does<br />
lead to typically better scalability than a map utilising locks, those CAS operations<br />
are still relatively expensive compared to the normal arithmetic operations that<br />
are used in single-threaded designs.</div>
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
Memory layout</h3>
<div style="text-align: left;">
<br />
Recall's <span style="font-family: "courier new" , "courier" , monospace;">Store</span> object is an open-addressed hash map, storing a 64-bit <span style="font-family: "courier new" , "courier" , monospace;">long</span> value against<br />
an arbitrary number of bytes. The developer is responsible for providing functions to<br />
serialise and deserialise objects to and from the <span style="font-family: "courier new" , "courier" , monospace;">Store</span>'s buffer. Each entry is of a fixed<br />
length, eliminating the chance of compaction problems.<br />
<br />
The mapping of <span style="font-family: "courier new" , "courier" , monospace;">long</span> to byte sequence is performed by a simple open-addressed hash map <a href="https://github.com/real-logic/agrona" target="_blank">Agrona</a> library).<br />
Each entry in the map records the <span style="font-family: "courier new" , "courier" , monospace;">long</span> identifier against a<br />
position in the <span style="font-family: "courier new" , "courier" , monospace;">Store</span>'s buffer. This makes the "index" for the data structure incredibly<br />
compact (~16b per entry), and has the nice side-effect of making inserts, updates and removals<br />
occur in effectively constant time.<br />
<br />
Due to the use of a separate index, Recall's <span style="font-family: "courier new" , "courier" , monospace;">Store</span> does not suffer from compaction problems,<br />
and is always 100% efficient in terms of storage space used.</div>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Inserting a new record</h4>
<div style="text-align: left;">
<br />
Record insertion involves adding a new key to the "index", serialising the entry,<br />
and increasing the write pointer by the record size.<br />
<br />
Note that if insertion causes the map to exceed its load factor, then a re-hash<br />
will occur, causing each entry to be copied into a new, larger buffer.<br />
For this reason it is recommended to correctly size the <span style="font-family: "courier new" , "courier" , monospace;">Store</span> when it is<br />
constructed.</div>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Deleting a record</h4>
<div style="text-align: left;">
<br />
Removing a record from the map involves copying the highest entry in the buffer<br />
over the top of the entry to be remove, updating the "index", and decreasing<br />
the write pointer by the record size.</div>
<h2 style="text-align: left;">
</h2>
<h2 style="text-align: left;">
Example usage</h2>
<div style="text-align: left;">
<br />
Consider the following highly contrived example: <br />
<br />
A trading exchange has the requirement to publish possible profit reports to holders <br />
of open positions as market prices fluctuate. When a user creates an open position,<br />
a distribution gateway will need to cache the position, and on every price update<br />
received from the exchange, publish some information indicating the possible profit <br />
associated with the position.<br />
<br />
In order to meet the exchange's strict latency requirements, the gateway must be allocation-free,<br />
and is written according to the single-writer principle.</div>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Messages</h4>
<div style="text-align: left;">
<br />
Orders are represented by the <span style="font-family: "courier new" , "courier" , monospace;">Order</span> class:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">public final class Order {<br /> private long accountId;<br /> private long orderId;<br /> private int instrumentId;<br /> private double quantity;<br /> private double price;<br /><br /> // getters and setters omitted<br />}</span><br />
<br />
New Orders are received on the <span style="font-family: "courier new" , "courier" , monospace;">OrderEvents</span> interface:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">public interface OrderEvents {<br /> void onOrderPlaced(Order orderFlyweight);<br />}</span><br />
<br />
<br />
Market data is received on the <span style="font-family: "courier new" , "courier" , monospace;">OrderBook</span> interface:</div>
<div style="text-align: left;">
<br />
<span style="font-family: "courier new" , "courier" , monospace;">public interface OrderBook {<br /> void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp);<br />}</span><br />
<br />
Profit updates are published on the <span style="font-family: "courier new" , "courier" , monospace;">ProfitEvents</span> interface:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">public interface ProfitEvents {<br /> void onProfitUpdate(long orderId, long accountId, double buyProfit, double sellProfit, long timestamp);<br />}</span></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<h4 style="text-align: left;">
Implementation</h4>
<div style="text-align: left;">
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">public final class ProfitPublisher implements OrderBook, OrderEvents {<br /> private final SingleTypeStore<ByteBuffer, Order> store = createStore();<br /> private final IdAccessor<Order> idAccessor = new OrderIdAccessor();<br /> private final Encoder<Order> encoder = new OrderEncoder();<br /> private final Decoder<Order> decoder = new OrderDecoder();<br /> private final Int2ObjectMap<LongHashSet> instrumentToOrderSetMap = createMap();<br /> private final Order flyweight = new Order();<br /> private final ProfitEvents profitEvents; // initialised in constructor<br /> <br /> public void onOrderPlaced(Order orderFlyweight) {<br /> store.store(orderFlyweight, encoder, idAccessor);<br /> instrumentToOrderSetMap.get(orderFlyweight.instrumentId()).add(orderFlyweight.id());<br /> }<br /><br /> public void onPriceUpdate(int instrumentId, double bid, double ask, long timestamp) {<br /> for (long orderId : instrumentToOrderSetMap.get(instrumentId)) {<br /> store.load(flyweight, decoder, orderId);<br /> double buyProfit = flyweight.quantity() * (flyweight.price() - ask);<br /> double sellProfit = flyweight.quantity() * (flyweight.price() - bid);<br /> profitEvents.onProfitUpdate(orderId, flyweight.accountId(), buyProfit, sellProfit, timestamp);<br /> }<br /> }<br />}</span><br />
<br />
In this example, the incoming <span style="font-family: "courier new" , "courier" , monospace;">Order</span>s are serialised to a <span style="font-family: "courier new" , "courier" , monospace;">Store</span>. When a price update is received,<br />
the system iterates over any <span style="font-family: "courier new" , "courier" , monospace;">Order</span>s associated with the specified instrument, and publishes <br />
a profit update for the <span style="font-family: "courier new" , "courier" , monospace;">Order</span>. <br />
<br />
Operation is allocation-free, meaning that the system will run without any garbage-collection pauses<br />
that could cause unwanted latency spikes. There is no need for object-pooling, as domain objects are<br />
serialised to the <span style="font-family: "courier new" , "courier" , monospace;">Store</span>'s buffer until required.</div>
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-6424636284051145442017-09-18T20:19:00.001+01:002020-09-14T19:27:47.879+01:00Heap Allocation Flamegraphs<p><span></span></p><a name='more'></a><p></p><p> </p><p><span style="font-size: large;">Hi there, and welcome. This content is still relevant, but fairly old. If you are interested in keeping up-to-date with similar articles on profiling, performance testing, and writing performant code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><br /></p><p> <span></span></p><!--more--><p></p><p>The most recent addition to the <a href="https://github.com/epickrram/grav">grav</a> collection of performance visualisation tools is a utility for tracking heap allocations on the JVM.</p>
<p>This is another Flamegraph-based visualisation that can be used to determine hotspots of garbage creation in a running program.</p>
<h2 id="usage-and-mode-of-operation">Usage and mode of operation</h2>
<p>Detailed instructions on installation and pre-requisites can be found in the <a href="https://github.com/epickrram/grav#jvm-heap-allocation-flamegraphs">grav repository</a> on github.</p>
<p>Heap allocation flamegraphs use the built-in user statically-defined tracepoints (USDTs), which have been added to recent versions of OpenJDK and Oracle JDK.</p>
<p>To enable the probes, the following command-line flags are required:</p>
<p><code>-XX:+DTraceAllocProbes -XX:+ExtendedDTraceProbes</code></p>
<p>Once the JVM is running, the <a href="https://github.com/epickrram/grav/blob/master/src/heap/heap_profile.py">heap-alloc-flames script</a> can be used to generate a heap allocation flamegraph:</p>
<pre><code>$ ./bin/heap-alloc-flames -p $PID -e "java/lang/String" -d 10
...
Wrote allocation-flamegraph-$PID.svg</code></pre>
<p>BE WARNED: this is a fairly heavyweight profiling method - on each allocation, the entire stack-frame is walked and hashed in order to increment a counter. The JVM will also use a slow-path for allocations when extended DTrace probes are enabled.</p>
<p>It is possible to limit the profiling to record every <code>N</code> samples with the '<code>-s</code>' parameter (<a href="https://github.com/epickrram/grav#jvm-heap-allocation-flamegraphs">see the documentation for more info</a>).</p>
<p>For a more lightweight method of heap-profiling, see the excellent <a href="https://github.com/jvm-profiling-tools/async-profiler#heap-profiling">async-profiler</a>, which uses a callback on the first TLAB allocation to perform its sampling.</p>
<p>When developing low-garbage or garbage-free applications, it is useful to be able to instrument <i>every</i> allocation, at least within a performance-test environment. This tool could even be used to regression-test allocation rates for low-latency applications, to ensure that changes to the codebase are not increasing allocations.</p>
<h3 id="ebpf-usdt">eBPF + USDT</h3>
<p>The allocation profiler works by attaching a <a href="https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt">uprobe</a> to the <a href="http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/gc_interface/collectedHeap.inline.hpp#l80">dtrace_object_alloc</a> function provided by the JVM.</p>
<p>When the profiler is running, we can confirm that the tracepoint is in place by looking at <code>/sys/kernel/debug/tracing/uprobe_events</code>:</p>
<pre><code>$ cat /sys/kernel/debug/tracing/uprobe_events
p:uprobes/
p__usr_lib_jvm_java_8_openjdk_amd64_jre_lib_amd64_server_libjvm_so_0x967fdf_bcc_7043
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so:0x0000000000967fdf
p:uprobes/
p__usr_lib_jvm_java_8_openjdk_amd64_jre_lib_amd64_server_libjvm_so_0x96806f_bcc_7043
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so:0x000000000096806f</code></pre>
<p>Given that we know the type signature of the <code>dtrace_object_alloc</code> method, it is a simple matter to extract the class-name of the object that has just been allocated.</p>
<p>As the profiler is running, it is recording a count against a compound key of <i>java class-name</i> and <i>stack-trace id</i>. At the end of the sampling period, the count is used to 'inflate' the occurrences of a given stack-trace, and these stack-traces are then piped through the usual flamegraph machinery.</p>
<h2 id="controlling-output">Controlling output</h2>
<div class="figure">
<img alt="Allocation flamegraph" src="https://github.com/epickrram/blog-images/raw/master/2017_09/heap_alloc_flamegraph.png" />
</div>
<p>Stack frames can be included or excluded from the generated Flamegraph by using regular expressions that are passed to the python program.</p>
<p>For example, to exclude all allocations of <code>java.lang.String</code> and <code>java.lang.Object[]</code>, add the following parameters:</p>
<pre><code>-e "java/lang/String" "java.lang.Object\[\]"</code></pre>
<p>To include only allocations of <code>java.util.ArrayList</code>, add the following:</p>
<pre><code>-i "java/util/ArrayList"</code></pre>
<h2 id="inspiration-thanks">Inspiration & Thanks</h2>
<p>Thanks to Amir Langer for collaborating on this profiler.</p>
<p>For more information on USDT probes in the JVM, see Sasha's <a href="http://blogs.microsoft.co.il/sasha/2016/03/31/probing-the-jvm-with-bpfbcc/">blog</a> <a href="http://blogs.microsoft.co.il/sasha/2017/07/07/profiling-the-jvm-on-linux-a-hybrid-approach/">posts</a>.</p>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-85407292124574283932017-05-30T19:33:00.003+01:002020-09-14T19:28:29.388+01:00Performance visualisation tools<div dir="ltr" style="text-align: left;" trbidi="on">
<p style="text-align: left;"><span></span></p><a name='more'></a><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><p></p><p style="text-align: left;"><span style="font-size: large;"><span></span></span></p><!--more--><span style="font-size: large;"> </span> <br /><p></p><h2 style="text-align: left;">Update</h2>
<br />
<a href="https://github.com/langera">Amir</a> has very kindly contributed a <a href="https://github.com/epickrram/grav#vagrant-grav">Vagrant box configuration</a> to enable non-Linux users to work with the tools contained in the grav repository.<br />
<br />
Thanks Amir!<br />
<br />
<br />
<br />
<br />
In my last post, I looked at annotating Flamegraphs with contextual information in order to filter on an interesting subset of the data. One of the things that stuck with me was the idea of using SVGs to render data generated on a server running in headless mode.<br />
Traditionally, I have recorded profiles and traces on remote servers, then pulled the data back to my workstation to filter, aggregate and plot. The scripts used to do this data-wrangling tend to be one-shot affairs, and I've probably lost many useful utilities over the years. I am increasingly coming around to the idea of building the rendering into the server-side script, as it forces me to think about how I want to interpret the data, and also gives the ability to deploy and serve such monitoring from a whole fleet of servers.<br />
Partly to address this, and partly because experimentation is fun, I've been working on some new visualisations of the same ilk as Flamegraphs.<br />
These tools are available in the <a href="https://github.com/epickrram/grav">grav</a> repository.<br />
<h3 id="scheduler-profile">
</h3>
<h3 id="scheduler-profile">
Scheduler Profile</h3>
The scheduler profile tool can be used to indicate whether your application's threads are getting enough time on CPU, or whether there is resource contention at play.<br />
In an ideal scenario, your application threads will only ever yield the CPU to one of the kernel's helper threads, such as <code>ksoftirqd</code>, and then only sparingly. Running the <code>scheduling-profile</code> script will record scheduler events in the kernel and will determine the state of your application threads at the point they are pre-empted and replaced with another process.<br />
Threads will tend to be in one of three states when pre-empted by the scheduler:<br />
<ul class="incremental">
<li>Runnable - happily doing work, moved off the CPU due to scheduler quantum expiry (we want to minimise this)</li>
<li>Blocked on I/O - otherwise known as 'Uninterruptible sleep', still an interesting signal but expected for threads handling IO</li>
<li>Sleeping - voluntarily yielding the CPU, perhaps due to waiting on a lock</li>
</ul>
There are a <a href="http://lxr.free-electrons.com/source/include/linux/sched.h?v=4.4#L207">number of other states</a>, which will not be covered here.<br />
Once the profile is collected, these states are rendered as a bar chart for each thread in your application. The examples here are from a JVM-based service, but the approach will work just as well for other runtimes, albeit without the mapping of <code>pid</code> to thread name.<br />
The bar chart will show the proportion of states encountered per-thread as the OS scheduler swapped out the application thread. The <code>sleeping</code> state is marked green (the CPU was intentionally yielded), the <code>runnable</code> state is marked red (the program thread was pre-empted while it still had useful work to do).<br />
Let's take a look at an initial example running on my 4-core laptop:<br />
<br />
<br />
<figure>
<img alt="Scheduling profile" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_05/scheduler-profile-fragment.png" />
</figure>
<br />
<a href="https://github.com/epickrram/blog-images/blob/master/2017_05/scheduler-profile-initial.svg">original</a><br />
This profile is taken from a simple drop-wizard application, the threads actually processing inbound requests are prefixed with <code>'dw-'</code>. We can see that these request processing threads were ready to yield the CPU (i.e. entering sleep state) about 30% of the time, but they were mostly attempting to do useful work when they were moved off the CPU. This is a hint that the application is resource constrained in terms of CPU.<br />
This effect is magnified due to the fact that I'm running a desktop OS, the application, and a load-generator all on the same laptop, but these effects will still be present on a larger system.<br />
This can be a useful signal that these threads would benefit from their own dedicated pool of CPUs. Further work is needed to annotate the chart with those processes that were switched <i>in</i> by the scheduler - i.e. the processes that are contending with the application for CPU resource.<br />
Using a combination of kernel tuning and thread-pinning, it should be possible to ensure that the application threads are only very rarely pre-empted by essential kernel threads. More details on how to go about achieving this can be found in <a href="http://epickrram.blogspot.co.uk/2015/09/reducing-system-jitter.html">previous</a> <a href="http://epickrram.blogspot.co.uk/2015/11/reducing-system-jitter-part-2.html">posts</a>.<br />
<h3 id="cpu-tenancy">
</h3>
<h3 id="cpu-tenancy">
CPU Tenancy</h3>
One of the operating system's responsibilities is to allocate resources to processes that require CPU. In modern multi-core systems, the scheduler must move runnable threads to otherwise idle CPUs to try to maximise system resource usage.<br />
A good example if this is network packet handling. When a network packet is received, it is (by default) processed by the CPU that handles the network card's interrupt. The kernel may then decide to migrate any task that is waiting for data to arrive (e.g. a thread blocked on a socket read) to the receiving CPU, since the packet data is more likely to be available in the CPU's cache.<br />
While we can generally rely on the OS to do a good job of this for us, we may wish to force this cache-locality by having a network-handling thread on an adjacent CPU to the interrupt handler. Such a set-up would mean that the network-handling thread would always be close to the data, without the overhead and jitter introduced by actually running on a CPU responsible for handling interrupts.<br />
This is a common configuration in low-latency applications in the finance industry.<br />
The <code>perf-cpu-tenancy</code> script can be used to build a picture showing how the scheduler allocates CPU to your application threads. In the example below, the threads named <code>dw-</code> are the message-processing threads, and it is clear that they are mostly executed on CPU 2. This correlates with the network card setup on the machine running the application - the IRQ of the network card is associated with CPU 2.<br />
<br />
<br />
<figure>
<img alt="CPU tenancy by thread" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_05/thread_irq_locality.png" />
</figure>
<br />
<a href="https://github.com/epickrram/blog-images/blob/master/2017_05/cpu-tenancy-17676.svg">original</a><br />
<h3 id="further-work">
</h3>
<h3 id="further-work">
Further work</h3>
To make the <code>scheduling-profile</code> tool more useful, I intend to annotate the <code>runnable</code> state portion of the bar chart with a further breakdown detailing the incoming processes that kicked application threads off-CPU.<br />
This will provide enough information to direct system-tuning efforts to ensure an application has the best chance possible to get CPU-time when required.<br />
If you've read this far, perhaps you're interested in <a href="https://github.com/epickrram/grav">contributing</a>?<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-59131920297958468612017-03-27T22:16:00.003+01:002020-09-14T19:28:50.514+01:00Named thread flamegraphs<div dir="ltr" style="text-align: left;" trbidi="on">
<p style="text-align: left;"><span></span></p><a name='more'></a><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><p></p><p style="text-align: left;"><span style="font-size: large;"><span></span></span></p><!--more--><span style="font-size: large;"> </span> <br /><p></p><h2 style="text-align: left;">Update</h2>
<br />
The utility described in this post has been moved to the <a href="https://github.com/epickrram/grav">grav</a> repository, see <a href="https://epickrram.blogspot.co.uk/2017/05/performance-visualisation-tools.html">"Performance visualisation tools"</a> for more details.<br />
<br />
<br />
After watching a great talk by Nitsan Wakart at this year's QCon London, I started playing around with flamegraphs a little more.<br />
To get the gist of Nitsan's talk, you can read his blog post <a href="http://psy-lob-saw.blogspot.co.uk/2017/02/flamegraphs-intro-fire-for-everyone.html">Java Flame Graphs Introduction: Fire For Everyone!</a>.<br />
The important thing to take away is that the collapsed stack files that are processed by Brendan Gregg's <a href="https://github.com/brendangregg/FlameGraph">FlameGraph scripts</a> are just text files, and so can be filtered, hacked, and modified using your favourite methods for such activities (your author is a fan of <code>grep</code> and <code>awk</code>).<br />
The examples in this post are generated using a fork of the excellent <a href="https://github.com/jvm-profiling-tools/perf-map-agent">perf-map-agent</a>. I've added a couple of scripts to make these examples easier.<br />
To follow the examples, clone <a href="https://github.com/epickrram/perf-map-agent">this repository</a>.<br />
<h2 id="aggregate-view">
Aggregate view</h2>
One feature that was demonstrated in Nitsan's talk was the ability to collapse the stacks by Java thread. Usually, flamegraphs show all samples aggregated into one view. With thread-level detail though, we can start to see where different threads are spending their time.<br />
This can be useful information when exploring the performance of systems that are unfamiliar.<br />
Let's see what difference this can make to an initial analysis. For this example, we're going to look at a very simple microservice built on dropwizard. The service does virtually nothing except echo a response, so we wouldn't expect the business logic to show up in the trace. Here we are primarily interested in looking at what the framework is doing during request processing.<br />
We'll take an initial recording without a thread breakdown:<br />
<pre><code>$ source ./etc/options-without-threads.sh
$ ./bin/perf-java-flames 7731</code></pre>
You can view the rendered svg file here: <a href="https://gist.github.com/epickrram/e3956b86e2a3984b49986ce49a8cf7d0">flamegraph-aggregate-stacks.svg</a>.<br />
<br />
<br />
<figure>
<img alt="Aggregated stacks" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_03/flamegraph-aggregate-stacks.png" /><figcaption>Aggregated stacks</figcaption>
</figure>
<br />
From the aggregated view we can see that most of the time is spent in framework code (jetty, jersey, etc), a smaller proportion is in log-handling (logback), and the rest is spent on garbage collection.<br />
<h2 id="thread-breakdown">
Thread breakdown</h2>
Making the same recording, but this time with the stacks assigned to their respective threads, we see much more detail.<br />
<pre><code>$ source ./etc/options-with-threads.sh
$ ./bin/perf-java-flames 7602</code></pre>
In <a href="https://gist.github.com/epickrram/99f4dda169cbb2540300c90393f79d26">flamegraph-thread-stacks.svg</a> we can immediately see that we have five threads doing most of the work of handling the HTTP requests; they all have very similar profiles, so we can reason that these are probably threads belonging to a generic request-handler pool.<br />
<br />
<br />
<figure>
<img alt="Thread stacks" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_03/flamegraph-thread-stacks.png" /><figcaption>Thread stacks</figcaption>
</figure>
<br />
We can also see another thread that spends most of its time writing log messages to disk. From this, we can reason that the logging framework has a single thread for draining log messages from a queue - something that was more difficult to see in the aggregated view.<br />
Now, we have made some assumptions in the statements above, so is there anything we can do to validate them?<br />
<h2 id="annotating-thread-names">
Annotating thread names</h2>
With the addition of a simple script to replace thread IDs with thread names, we can gain a better understanding of thread responsibilities within the application.<br />
Let's capture another trace:<br />
<pre><code>$ source ./etc/options-with-threads.sh
$ ./bin/perf-thread-flames 8513</code></pre>
Although <a href="https://gist.github.com/epickrram/39adabacecf2cf57dff2868e2e4b555c">flamegraph-named-thread-stacks.svg</a> looks identical when zoomed-out, it contains one more very useful piece of context.<br />
<br />
<br />
<figure>
<img alt="Named stacks" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_03/flamegraph-named-thread-stacks.png" /><figcaption>Named stacks</figcaption>
</figure>
<br />
Rolling the mouse over the base of the image shows that the five similar stacks are all from threads named "dw-XX", giving a little more evidence that these are dropwizard handler threads.<br />
There are a couple of narrower stacks named "dw-XX-acceptor"; these are the threads that manage incoming connections before they are handed off to the request processing thread pool.<br />
Further along is a "metrics-csv-reporter" thread, whose responsibility is to write out performance metrics at regular intervals.<br />
The logging framework thread is now more obvious when we can see its name is "AsyncAppender-Worker-async-console-appender".<br />
With nothing more than an external script, we can now infer the following about our application:<br />
<ul>
<li>this application has a request-handling thread-pool</li>
<li>fronted by an acceptor thread</li>
<li>logging is performed asynchronously</li>
<li>metrics reporting is enabled</li>
</ul>
This kind of overview of system architecture would be much harder to piece together by just reading the framework code.<br />
<h2 id="filtering">
Filtering</h2>
Now that we have this extra context in place, it is a simple matter to filter the flamegraphs down to a finer focus.<br />
The <code>perf-thread-grep</code> script operates on the result of a previous call to <code>perf-java-flames</code> (as seen above).<br />
Suppose we just wanted to look at what the JIT compiler threads were doing?<br />
<pre><code>$ source ./etc/options-with-threads.sh
$ ./bin/perf-thread-grep 8513 "Compiler"</code></pre>
<a href="https://gist.github.com/epickrram/27e93a6da6aa6425084f0c10282633f2">flamegraph-compiler.svg</a><br />
<br />
<br />
<figure>
<img alt="Compiler threads" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_03/flamegraph-compiler-threads.png" /><figcaption>Compiler threads</figcaption>
</figure>
<br />
or to focus in on threads that called into any logging functions?<br />
<pre><code>$ source ./etc/options-with-threads.sh
$ ./bin/perf-thread-grep 8513 "logback"</code></pre>
<a href="https://gist.github.com/epickrram/38cb99b43aa7e20df67f50096cf15c4d">flamegraph-logback.svg</a><br />
<br />
<br />
<figure>
<img alt="Logging functions" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2017_03/flamegraph-logback.png" /><figcaption>Logging functions</figcaption>
</figure>
<br />
<h2 id="summary">
Summary</h2>
Annotating flamegraphs with java thread names can offer insight into how an application's processing resources are configured. We can use this extra context to easily zoom in on certain functionality.<br />
This technique is particularly powerful when exploring unfamiliar applications with large numbers of threads, whose function may not be immediately obvious.<br />
My learned correspondent Nitsan has suggested that I'm being lazy by using <code>jstack</code> to generate the thread-name mapping. His main complaint is that it causes a safe-point pause in the running application. To make these scripts a little more lightweight, I will explore retrieving the thread names via JVMTI or another low-level interface. But that's a blog post for another day.<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-56900712974925866962016-08-12T16:39:00.002+01:002020-09-14T19:29:33.301+01:00Lightweight tracing with BCC<div dir="ltr" style="text-align: left;" trbidi="on">
<p><span></span></p><a name='more'></a><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><p></p><p><span style="font-size: large;"><span></span></span></p><!--more--><span style="font-size: large;"> </span> <br /><p></p><p> </p><p>Ever since I read some initial blogs posts about the upcoming eBPF tracing functionality in the 4.x Linux kernel, I have been looking for an excuse to get to grips with this technology.</p>
<p>With a planned kernel upgrade in progress at <a href="https://lmax.com">LMAX</a>, I now have access to an interesting environment and workload in order to play around with <a href="https://github.com/iovisor/bcc">BCC</a>.</p>
<h2 id="bpf-compiler-collection">BPF Compiler Collection</h2>
<p>BCC is a collection of tools that allows the curious to express programs in C or Lua, and then load those programs as optimised kernel modules, hooked in to the runtime via a number of different mechanisms.</p>
<p>At the time of writing, the <a href="https://github.com/iovisor/bcc#motivation">motivation</a> subsection of the BCC README explains this best:</p>
<pre><code>End-to-end BPF workflow in a shared library
A modified C language for BPF backends
Integration with llvm-bpf backend for JIT
Dynamic (un)loading of JITed programs
Support for BPF kernel hooks: socket filters, tc classifiers, tc actions, and kprobes</code></pre>
<h2 id="tracking-process-off-cpu-time">Tracking process off-cpu time</h2>
<p>One of the more in-depth investigations I've been involved in lately is tracking down a throughput problem on one of our services. We have a service that will sometimes, during a test run in our performance environment, fail to keep up with the message rate.</p>
<p>The service should be reading multicast traffic from the network as fast as possible. We dedicate a physical core to the network processing thread, and make sure that the Java thread doing this work is interrupted as little as possible.</p>
<p>Now, given that we run with the SLAB allocator and want useful <code>vmstat</code> updates, our user-land processes will sometimes get pre-empted by kernel threads. Looking in <code>/proc/sched_debug</code> shows the other processes that are potentially runnable on our dedicated core:</p>
<pre><code>runnable tasks:
task PID tree-key switches prio ...
---------------------------------------------------------
cpuhp/35 334 0.943766 14 120
migration/35 336 0.000000 26 0
ksoftirqd/35 337 -5.236392 6 120
kworker/35:0 338 139152554.767297 456 120
kworker/35:0H 339 -5.241396 12 100
kworker/35:1 530 139227632.577336 902 120
kworker/35:1H 7190 6.745434 3 100
R java 102825 1306133.326251 479 110
kworker/35:2 102845 139065252.390586 2 120 </code></pre>
<p>We know that our Java process will sometimes be kicked off the CPU by one of the <code>kworker</code> threads, so that it can do some house-keeping. In order to find out if there is a correlation between network traffic build-up and the java process being off-cpu, we have traditionally used <code>ftrace</code> and the built-in tracepoint <code>sched:sched_switch</code>.</p>
<h2 id="determining-off-cpu-time-with-ftrace">Determining off-cpu time with <code>ftrace</code></h2>
<p>The kernel scheduler <a href="http://lxr.free-electrons.com/source/kernel/sched/core.c#L3346">emits an event</a> to the tracing framework every time that a context switch occurs. The event arguments report the process that is being switched out, and the process that is about to be scheduled for execution.</p>
<p>The output from <code>ftrace</code> will show these data, along with a microsecond timestamp:</p>
<pre><code><idle>-0 [020] 6907585.873080: sched_switch: 0:120:R ==> 33233:110: java
<...>-33233 [020] 6907585.873083: sched_switch: 33233:110:S ==> 0:120: swapper/20
<idle>-0 [020] 6907585.873234: sched_switch: 0:120:R ==> 33233:110: java</code></pre>
<p>The excerpt above tells us that on CPU 20, the idle process (pid 0) was runnable (R) and was switched out in favour of a Java process (pid 33233). Three microseconds later, the Java process entered the sleep state (S), and was switched out in favour of the idle process. After another 150 microseconds or so, the Java process was again ready to run, so the idle process was switched out to make way.</p>
<p>Using the output from <code>ftrace</code>, we could build a simple script to track the timestamps when each process was switched <i>off</i> the CPU, then subtract that timestamp from the next time that the process gets scheduled back onto a CPU. Aggregating over a suitable period would then give us the maximum time off-cpu in microseconds for each process in the trace. This information would be useful in looking for a correlation between network backlog and off-cpu time.</p>
<h2 id="determining-off-cpu-time-with-bcc">Determining off-cpu time with <code>BCC</code></h2>
<p>So that's the old way, and it has its drawbacks. <code>ftrace</code> can be quite heavy-weight, and we have found that running it can cause significant overhead in our performance-test environment when running under heavy load. Also, we are only really interested in the single dedicated network-processing core. While it is possible to supply <code>ftrace</code> with a cpumask to restrict this, the interface is a little clunky, and doesn't seem to work very will with the <code>trace-cmd</code> front-end to <code>ftrace</code>.</p>
<p>Ideally, we would like a lighter-weight mechanism for hooking in to context switches, and performing the aggregation in kernel-space rather than by post-processing the <code>ftrace</code> output.</p>
<p>Luckily, <code>BCC</code> allows us to do just that.</p>
<p>Looking at the <a href="http://lxr.free-electrons.com/source/kernel/sched/core.c#L3346">kernel source</a>, we can see that the <code>trace_sched_switch</code> event is emitted immediately before the call to <code>context_switch</code>:</p>
<pre><code> trace_sched_switch(preempt, prev, next);
rq = context_switch(rq, prev, next, cookie); /* unlocks the rq */</code></pre>
<p>At the end of the <a href="http://lxr.free-electrons.com/source/kernel/sched/core.c#L2819"><code>context_switch</code></a> function, <code>finish_task_switch</code> is called:</p>
<pre><code> /* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
return finish_task_switch(prev);
}</code></pre>
<p>The argument passed to <code>finish_task_switch</code> is a <code>task_struct</code> representing the process that is being switched out. This looks like a good spot to attach a trace, where we can track the time when a process is switched out of the CPU.</p>
<p>In order to do this using <code>BCC</code>, we use a <a href="https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#1-kprobes"><code>kprobe</code></a> to hook in to the kernel function <code>finish_task_switch</code>. Using this mechanism, we can attach a custom tracing function to the kernel's <code>finish_task_switch</code> function.</p>
<h2 id="bcc-programs"><code>BCC</code> programs</h2>
<p>The current method for interacting with the various probe types is via a Python-C bridge. Trace functions are written in C, then compiled and injected by a Python program, which can then interact with the tracing function via the kernel's tracing subsystem.</p>
<p>There is a lot of detail to cover in the various <a href="https://github.com/iovisor/bcc/tree/master/examples">examples</a> and <a href="https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md">user-guides</a> available on the <code>BCC</code> site, so I will just focus on a walk-through of my use-case.</p>
<p>First off, the C function:</p>
<pre><code>#include <linux/types.h>
#include <linux/sched.h>
#include <uapi/linux/ptrace.h>
struct proc_name_t {
char comm[TASK_COMM_LEN];
};
BPF_TABLE("hash", pid_t, u64, last_time_on_cpu, 1024);
BPF_TABLE("hash", pid_t, u64, max_offcpu, 1024);
BPF_TABLE("hash", pid_t, struct proc_name_t, proc_name, 1024);
int trace_finish_task_switch(struct pt_regs *ctx, struct task_struct *prev) {
// bail early if this is not the CPU we're interested in
u32 target_cpu = %d;
if(target_cpu != bpf_get_smp_processor_id())
{
return 0;
}
// get information about previous/next processes
struct proc_name_t pname = {};
pid_t next_pid = bpf_get_current_pid_tgid();
pid_t prev_pid = prev->pid;
bpf_get_current_comm(&pname.comm, sizeof(pname.comm));
// store mapping of pid -> command for display
proc_name.update(&next_pid, &pname);
// lookup current values for incoming process
u64 *last_time;
u64 *current_max_offcpu;
u64 current_time = bpf_ktime_get_ns();
last_time = last_time_on_cpu.lookup(&next_pid);
current_max_offcpu = max_offcpu.lookup(&next_pid);
// update max offcpu time
if(last_time != NULL) {
u64 delta_nanos = current_time - *last_time;
if(current_max_offcpu == NULL) {
max_offcpu.update(&next_pid, &delta_nanos);
}
else {
if(delta_nanos > *current_max_offcpu) {
max_offcpu.update(&next_pid, &delta_nanos);
}
}
}
// store outgoing process' time
last_time_on_cpu.update(&prev_pid, &current_time);
return 0;
};</code></pre>
<p>Conceptually, all this program does is update a map of <code>pid</code> -> <code>timestamp</code>, every time a process is switched off-cpu.</p>
<p>Then, if a timestamp exists for the task currently being scheduled <i>on-cpu</i>, we calculate the delta in nanoseconds (i.e. the time that the process was not on the cpu), and track the max value seen so far.</p>
<p>This is all executed in the kernel context, with very low overhead.</p>
<p>Next up, we have the Python program, which can read the data structure being populated by the trace function:</p>
<pre><code>while 1:
time.sleep(1)
for k,v in b["max_offcpu"].iteritems():
if v != 0:
proc_name = b["proc_name"][k].comm
print("%s max offcpu for %s is %dus" %
(datetime.datetime.now(), proc_name, v.value/1000))
b["max_offcpu"].clear()</code></pre>
<p>Here we sleep for one second, then iterate over the items in the <code>max_offcpu</code> hash. This will contain an entry for every process observed by the tracing function that has been switched off the CPU, and back in at least once.</p>
<p>After printing the off-cpu duration of each process, we then clear the data-structure so that it will be populated with fresh data on the next iteration.</p>
<p>I'm still a little unclear on the raciness of this operation - I don't understand well enough whether there could be lost updates between the reporting and the call to <code>clear()</code>.</p>
<p>Lastly, this all needs to be wired up in the Python script:</p>
<pre><code>b = BPF(text=prog % (int(sys.argv[1])))
b.attach_kprobe(event="finish_task_switch", fn_name="trace_finish_task_switch")</code></pre>
<p>Here we are requesting that the kernel function <code>finish_task_switch</code> be instrumented, and our custom function <code>trace_finish_task_switch</code> attached.</p>
<h2 id="results">Results</h2>
<p>Using my laptop to test this script, I simulated our use-case by isolating a CPU (3) from the OS scheduler via the <code>isolcpus</code> kernel parameter.</p>
<p>Running my <code>offcpu.py</code> script with <code>BCC</code> installed generated the following output:</p>
<pre><code>[root@localhost scheduler]# python ./offcpu.py 3
2016-08-12 16:13:42.163032 max offcpu for swapper/3 is 5us
2016-08-12 16:13:43.164501 max offcpu for swapper/3 is 18us
2016-08-12 16:13:44.166178 max offcpu for kworker/3:1 is 990961us
2016-08-12 16:13:44.166405 max offcpu for swapper/3 is 10us
2016-08-12 16:13:46.169315 max offcpu for watchdog/3 is 3999989us
2016-08-12 16:13:46.169413 max offcpu for swapper/3 is 6us
2016-08-12 16:13:50.175329 max offcpu for watchdog/3 is 4000011us
2016-08-12 16:13:50.175565 max offcpu for swapper/3 is 9us</code></pre>
<p>This tells me that most of the time the <code>swapper</code> or idle process was scheduled on CPU 3 (this makes sense - the OS scheduler is not allowed to schedule any userland programs on it because of <code>isolcpus</code>).</p>
<p>Every so often, a <code>watchdog</code> or <code>kworker</code> thread is schedued on, kicking of the <code>swapper</code> process for a few microseconds.</p>
<p>If I now simulate our workload by executing a user process on CPU 3 (just like our network-processing thread that is busy-spinning trying to receive network traffic), I should be able to see it being kicked off by the kernel threads.</p>
<p>The user-space task is not complicated:</p>
<pre><code>cat hot_loop.sh
while [ 1 ]; do
echo "" > /dev/null
done</code></pre>
<p>and executed using <code>taskset</code> to make sure it executes on the correct CPU:</p>
<pre><code>taskset -c 3 bash ./hot_loop.sh</code></pre>
<p>Running the <code>BCC</code> script now generates this output:</p>
<pre><code>[root@localhost scheduler]# python ./offcpu.py 3
2016-08-12 16:14:29.120617 max offcpu for bash is 4us
2016-08-12 16:14:30.121861 max offcpu for kworker/3:1 is 1000003us
2016-08-12 16:14:30.121925 max offcpu for bash is 5us
2016-08-12 16:14:31.123140 max offcpu for kworker/3:1 is 999986us
2016-08-12 16:14:31.123201 max offcpu for bash is 4us
2016-08-12 16:14:32.123675 max offcpu for kworker/3:1 is 999994us
2016-08-12 16:14:32.123747 max offcpu for bash is 5us
2016-08-12 16:14:33.124976 max offcpu for kworker/3:1 is 999994us
2016-08-12 16:14:33.125046 max offcpu for bash is 5us
2016-08-12 16:14:34.126287 max offcpu for kworker/3:1 is 999995us
2016-08-12 16:14:34.126366 max offcpu for bash is 5us</code></pre>
<p>Great! The tracing function is able to show that my user-space program is being kicked off the CPU for around 5 microseconds.</p>
<p>I can now deploy this script to our performance environment and see whether the network-processing thread is being descheduled for long periods of time, and whether any of these periods correlate with increases in network buffer depth.</p>
<h2 id="further-reading">Further reading</h2>
<p>There are plenty of examples to work through in the <code>BCC</code> repository. Aside from <code>kprobes</code>, there are <code>uprobes</code> that allow user-space code to be instrumented. Some programs also contain user-defined static probes (UDSTs), which are akin to the kernel's static tracepoints.</p>
<p>Familiarity with the functions and tracepoints within the kernel is a definite bonus, as it helps us to understand what information is available. My first port of call is usually running <code>perf record</code> while capturing stack-traces to get an idea of what a program is actually doing. After that, it's possible to start looking through the kernel source code looking for useful data to collect.</p>
<p>The scripts referenced in this post are <a href="https://github.com/epickrram/tracing/tree/master/scheduler">available on github</a>.</p>
<p>Be warned, they may not be the best example of using <code>BCC</code>, and are the result of an afternoon's hacking.</p>
<br /><br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<br /></div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-20116834081544116792016-06-23T15:26:00.003+01:002020-09-14T19:29:53.210+01:00Angler<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">In my last <a href="http://epickrram.blogspot.co.uk/2016/05/navigating-linux-kernel-network-stack.html">couple</a> of <a href="http://epickrram.blogspot.co.uk/2016/05/navigating-linux-kernel-network-stack_18.html">posts</a>, I've been looking at how UDP network packets are received by the Linux kernel. While diving through the source code, it has been shown that there are a number of statistics available for monitoring receive errors, buffer overruns, and queue depths.<br />
<br />
In the course of investigating network throughput issues in our systems at <a href="https://www.lmax.com/">LMAX</a>, we have written some tooling for monitoring the available statistics. The result of that work is a small utility that provides an interface for monitoring system-wide or socket-specific statistics from a Java program.<br />
<br />
The code is available in the <a href="https://github.com/LMAX-Exchange/angler">Angler</a> github repository.<br />
<br />
<br />
<h3 style="text-align: left;">
Who is it for?</h3>
<br />
This utility may be of use to you if you are interested in metrics and alerting around network throughput on Linux. Currently, only UDP socket monitoring is available, though we have plans to add similar functionality for TCP sockets.<br />
<br />
Angler works by reading and parsing files in the <span style="font-family: "courier new" , "courier" , monospace;">/proc/</span> filesystem, and reporting metrics back to your application. It is then up to the user to determine how to handle these data accordingly. Perhaps the correct action is simply to report the numbers to a time-series database for charting or threshold alerting. Another valid use-case would be to apply back-pressure to a publishing system in the event of buffer overflow or increasing queue depth.<br />
<br />
Angler is designed for use in latency-sensitive systems, and is garbage-free in steady state. It can, of course, be used in systems where garbage-collection is not an issue.<br />
<br />
<br />
<h3 style="text-align: left;">
Available statistics</h3>
<br />
Angler offers an API to monitor individual sockets specified by either a <span style="font-family: "courier new" , "courier" , monospace;">host:port</span> combination (an instance of <span style="font-family: "courier new" , "courier" , monospace;">java.net.InetSocketAddress</span>), or all sockets listening to a particular IP address (an instance of <span style="font-family: "courier new" , "courier" , monospace;">java.net.InetAddress</span>).<br />
<br />
To begin monitoring a socket, use one of the <span style="font-family: "courier new" , "courier" , monospace;">beginMonitoring</span><span style="font-family: inherit;"><span style="font-family: inherit;"> methods on </span><a href="https://github.com/LMAX-Exchange/angler/blob/master/src/main/java/com/lmax/angler/monitoring/network/monitor/socket/udp/UdpSocketMonitor.java"><span style="font-family: "courier new" , "courier" , monospace;">UdpSocketMonitor</span></a><span style="font-family: inherit;">:</span></span><br />
<span style="font-family: inherit;"><br /></span>
<script src="https://gist.github.com/epickrram/b8f1f1f36d834d2f9582fd43d25c8f1a.js"></script>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<br />
Once a socket monitoring request has been made, available socket statistics will be provided to the application on the next invocation of the monitor's <span style="font-family: "courier new" , "courier" , monospace;">poll</span> method.<br />
<br />
The callback method is invoked for each monitored socket reporting the receive queue depth and drop count:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/c2048481aa1cfd7252d8430b405523c9.js"></script>
<br />
<br />
System-wide statistics are available from <span style="font-family: "courier new" , "courier" , monospace;">/proc/net/softnet_stat</span> and <span style="font-family: "courier new" , "courier" , monospace;">/proc/net/snmp</span>. See previous posts for more information on exactly what is reported in these files.<br />
<br />
<br />
The softnet data is provided by <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://github.com/LMAX-Exchange/angler/blob/master/src/main/java/com/lmax/angler/monitoring/network/monitor/system/softnet/SoftnetStatsMonitor.java">SoftnetStatsMonitor</a></span>, and is made available to the following callback method:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/f1a0b8dc975d0fb64c240c1b65c78770.js"></script>
<br />
Changes in these numbers can indicate that the Linux worker threads are not getting enough time to dequeue incoming packets from the network device.<br />
<br />
<br />
SNMP data is provided by <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://github.com/LMAX-Exchange/angler/blob/master/src/main/java/com/lmax/angler/monitoring/network/monitor/system/snmp/SystemNetworkManagementMonitor.java">SystemNetworkManagementMonitor</a></span>, and is provided on the following callback:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/90ff8427ee7cc7b7a56b613ce952790b.js"></script>
<br />
<br />
These statistics report a global view of receive errors, which could be caused by buffer overruns, memory exhausation or other factors.<br />
<br />
A complete example of these methods can be found in the <a href="https://github.com/LMAX-Exchange/angler/blob/master/src/test/java/com/lmax/angler/monitoring/network/monitor/example/ExampleApplication.java">ExampleApplication</a>.<br />
<br />
<br />
<h3 style="text-align: left;">
Production use</h3>
<br />
At LMAX, we have been using Angler in production for some time, so consider it production-ready. We poll the files in <span style="font-family: "courier new" , "courier" , monospace;">/proc/</span> at up to 100 times per second on some services, in order to get a more fine-grained view of receive buffer depths. So far, we have not encountered any issues with this approach; a careful review of the kernel source code responsible for supplying the statistics indicates only a very small change of lock contention.<br />
<br />
Version 1.0.3 is currently available on <a href="http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22angler%22">maven central</a>.<br />
<br />
<a href="https://github.com/LMAX-Exchange/angler">Contributions and feedback are welcome!</a><br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-70123641430230490822016-05-18T14:15:00.001+01:002020-09-14T19:30:09.421+01:00Navigating the Linux kernel network stack: into user land<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">This is a continuation of my <a href="http://epickrram.blogspot.co.uk/2016/05/navigating-linux-kernel-network-stack.html">previous post</a>, in which we follow the path of an inbound multicast packet to the user application.<br />
At the end of the last post, I mentioned that application back-pressure was most likely to be the cause of receiver packet-loss. As we continue through the code path taken by inbound packets up into user-land, we will see the various points at which data will be discarded due to slow application processing, and what metrics can shed light on these events.<br />
<h2 id="protocol-mapping">
</h2>
<h2 id="protocol-mapping">
Protocol mapping</h2>
<br />
First of all, let's take a quick look at how a received data packet is matched up to its <a href="http://lxr.free-electrons.com/source/net/core/dev.c?v=4.0#L1735">handler function</a>.<br />
We can see from the stack-trace below that the top-level function calls when dealing with the packets are <code>packet_rcv</code> and <code>ip_rcv:</code><br />
<br />
<br />
<br />
<pre><code>__netif_receive_skb_core() {
packet_rcv() {
skb_push();
__bpf_prog_run();
consume_skb();
}
bond_handle_frame() {
bond_3ad_lacpdu_recv();
}
packet_rcv() {
skb_push();
__bpf_prog_run();
consume_skb();
}
packet_rcv() {
skb_push();
__bpf_prog_run();
consume_skb();
}
ip_rcv() {
...</code></pre>
<pre><code>
</code></pre>
<pre><code>
</code></pre>
This trace captures the <a href="http://lxr.free-electrons.com/source/net/core/dev.c?v=4.0#L3671">part of the receive path</a> where the inbound packet is passed to each registered handler function. In this way, the kernel handles things like VLAN tagging, interface bonding, and packet-capture. Note the <code>__bpf_prog_run</code> function call, which indicates that this is the entry point for <code>pcap</code> packet capture and filtering.<br />
<br />
<br />
The protocol-to-handler mapping can be viewed in the file <code>/proc/net/ptype</code>:<br />
<pre><code>
</code></pre>
<pre><code>
</code></pre>
<pre><code>[pricem@metal]$ cat /proc/net/ptype
Type Device Function
ALL em1 packet_rcv
0800 ip_rcv
0011 llc_rcv [llc]
0004 llc_rcv [llc]
88f5 mrp_rcv [mrp]
0806 arp_rcv
86dd ipv6_rcv</code></pre>
<br />
<br />
Comparing against <a href="https://en.wikipedia.org/wiki/EtherType#Notable_values">this reference</a>, it is clear that the kernel reads the value of the ethernet frame's <code>EtherType</code> field and dispatches to the corresponding function. There are also some functions that will be executed for all packet types (signified by type <code>ALL</code>).<br />
<br />
<br />
So our inbound packet will be processed by the capture hooks, even if no packet capture is running (this then must presumably be very cheap), before being passed to the correct protocol handler, in this case <code>ip_rcv</code>.<br />
<h2 id="the-socket-buffer">
</h2>
<h2 id="the-socket-buffer">
</h2>
<h2 id="the-socket-buffer">
The socket buffer</h2>
<div>
<br /></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_05/SocketBufferAccess.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="340" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_05/SocketBufferAccess.jpg" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
As discussed previously, each socket on the system has a receive-side FIFO queue that is written to by the kernel, and read from by the user application.<br />
<br />
<code><a href="http://lxr.free-electrons.com/source/net/ipv4/ip_input.c?v=4.0#L376">ip_rcv</a></code> starts by getting its own copy of the incoming packet, copying if the packet is already shared. If the copy fails due to lack of memory, then the packet is discarded, and the <code>Discard</code> count of the global ip statistics table is incremented. Other checks made at this point include the IP checksum, header-length check and truncation check, each of which update the relevant metrics.<br />
<br />
Before calling the <code>ip_rcv_finish</code> function, the packet is diverted through the <code>netfilter</code> module where software-based network filtering can be applied.<br />
<br />
Assuming that the packet is not dropped by a filter, <a href="http://lxr.free-electrons.com/source/net/ipv4/ip_input.c?v=4.0#L312"><code>ip_rcv_finish</code></a> passes the packet on to the next protocol-handler in the chain, in this instance, <code>udp_rcv</code>.<br />
<br />
In the internals of the <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1749"><code>udp_rcv</code></a> function, we finally get to the point of accessing the socket FIFO queue.<br />
Simple validation performed at this point includes packet-length check and checksum. Failure of these checks will cause the relevant statistics to be updated in the global udp statistics table.<br />
<br />
Because the traffic we're tracing <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1801">is multicast</a>, the next handler function is <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1660"><code>__udp4_lib_mcast_deliver</code></a>.<br />
With some exotic locking in place, the kernel determines how many different sockets this multicast packet needs to be delivered to. It is worth noting here that there is a smallish number (<code>256 / sizeof( struct sock)</code>) of sockets that can be listening to a given multicast address before the possibility of greater lock-contention creeps in.<br />
<br />
An effort is made to enumerate the registered sockets with a lock held, then perform the packet dispatching without a lock. However, if the number of registered sockets exceeds the specified threshold, then packet dispatch (the <code>flush_stack</code> function) will be handled while a lock is held.<br />
<br />
Once in <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1614"><code>flush_stack</code></a>, the packet is copied for each registered socket, and pushed onto the FIFO queue.<br />
If the kernel is unable to allocate memory in order to copy the buffer, the socket's drop-count will be incremented, along with the <code>RcvBufErrors</code> and <code>InErrors</code> metrics in the global udp statistics table.<br />
<br />
After another checksum test, we are finally at the point where socket-buffer overflow <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1584">is tested</a>:<br />
<br />
<br />
<pre><code>if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,
is_udplite);
goto drop;
}</code></pre>
<br />
<br />
If the size of the socket's backlog queue, plus the memory used in the receive queue is greater than the socket receive buffer size, then the <code>RcvBufferErrors</code> and <code>InErrors</code> metrics are updated in the global udp statitics table, along with the socket's drop count.<br />
<br />
<br />
To <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1594">safely handle multi-threaded access to the socket buffer</a>, if an application has locked the socket, then the inbound packet will be queued to the socket's <a href="http://lxr.free-electrons.com/source/include/net/sock.h?v=4.0#L335">backlog queue</a>. The backlog queue will be processed when the lock owner releases the lock.<br />
<br />
Otherwise, pending more checks that socket memory limits have not been exceeded, the packet is added to the socket's <a href="http://lxr.free-electrons.com/source/net/core/sock.c?v=4.0#L439"><code>sk_receive_queue</code></a>.<br />
<br />
Finally, once the packet has been delivered to the socket's receive queue, notification is <a href="http://lxr.free-electrons.com/source/net/core/sock.c?v=4.0#L474">sent to any interested listeners</a>.<br />
<h2 id="kernel-statistics">
</h2>
<h2 id="kernel-statistics">
</h2>
<h2 id="kernel-statistics">
Kernel statistics</h2>
<div>
<br /></div>
<br />
Now that we have seen where the various capacity checks are made, we can make some observations in the <code>proc</code> file system.<br />
<br />
There are protocol-specific files in <code>/proc/net</code> that are updated by the kernel whenever data is received.<br />
<h3 id="procnetsnmp">
<code><br /></code></h3>
<h3 id="procnetsnmp">
<code>/proc/net/snmp</code></h3>
<br />
This file contains statistics on multiple protocols, and is the target for updates to the UDP receive buffer errors. So when we want to know about system-wide receive errors, this is the place to look. When this file is read, a <a href="http://lxr.free-electrons.com/source/net/ipv4/proc.c?v=4.0#L374">function call</a> iterates through the stored values and writes them to the supplied destination. Such is the magic of <a href="https://en.wikipedia.org/wiki/Procfs"><code>procfs</code></a>.<br />
<br />
In the output below, we can see that there have been several receive errors (<code>RcvbufErrors</code>), and the number is matched by the <code>InErrors</code> metric:<br />
<br />
<br />
<pre><code>Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 281604018683 4 261929024 42204463516 261929024 0 0</code></pre>
<br />
<br />
Each of the columns maps to a constant specified <a href="http://lxr.free-electrons.com/source/net/ipv4/proc.c?v=4.0#L176">here</a>, so to determine when these values are updated, we simply need to <a href="https://www.google.co.uk/search?q=UDP_MIB_RCVBUFERRORS&sitesearch=lxr.free-electrons.com/source">search for the constant name</a>.<br />
<br />
<br />
There is a subtle difference between <code>InErrors</code> and <code>RcvbufErrors</code>, but in our case when looking for socket buffer overflow, we only care about <code>RcvbufErrors</code>.<br />
<h3 id="procnetudp">
<code><br /></code></h3>
<h3 id="procnetudp">
<code>/proc/net/udp</code></h3>
<br />
This file contains socket-specific statistics, which only live for the lifetime of a socket. As we have seen from the code so far, each time the <code>RcvbufErrors</code> metric has been incremented, so too has the corresponding <code>sk_drops</code> value for the target socket.<br />
<br />
So if we need to differentiate between drop events on a per-socket basis, this file is what we need to observe. The contents below can be explained by looking at the <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c#L2517">handler function</a> for socket data:<br />
<pre><code>
</code></pre>
<pre><code>
</code></pre>
<pre><code>[pricem@metal ~]# cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue ... inode ref pointer drops
47: 00000000:B87A 00000000:0000 07 00000000:00000000 ... 239982339 2 ffff88052885c440 0
95: 4104C1EF:38AA 00000000:0000 07 00000000:00000000 ... 239994712 2 ffff8808507ed800 0
95: BC04C1EF:38AA 00000000:0000 07 00000000:00000000 ... 175113818 2 ffff881054a8b080 0
95: BE04C1EF:38AA 00000000:0000 07 00000000:00000000 ... 175113817 2 ffff881054a8b440 0
...</code></pre>
<br />
<br />
For observing back-pressure, we are interested in the drops column, and the rx_queue column, which is a snapshot of the amount of queued data at the time of reading. Local and remote addresses are hex-encoded <code>ip:port</code> addresses.<br />
<br />
Monitoring this file can tell us if our application is falling behind the packet arrival rate (increase in <code>rx_queue</code>), or whether the kernel was unable to copy a packet into the socket buffer due to it already being full (increase in <code>drops</code>).<br />
<h2 id="tracepoints">
</h2>
<h2 id="tracepoints">
Tracepoints</h2>
<br />
If we wish to capture more data about packets that are being dropped, there a three tracepoints that are of interest.<br />
<ol style="list-style-type: decimal;">
<li><a href="http://lxr.free-electrons.com/source/net/core/sock.c?v=4.0#L447"><code>sock:sock_rcvqueue_full</code></a>: called when the amount of memory already allocated to the socket buffer is greater than or equal to the configured socket buffer size.</li>
<li><a href="http://lxr.free-electrons.com/source/net/core/sock.c?v=4.0#L2084"><code>sock:sock_exceed_buf_limit</code></a>: called when the kernel is unable to allocate more memory for the socket buffer.</li>
<li><a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1473"><code>udp:udp_fail_queue_rcv_skb</code></a>: called shortly after the <code>sock_rcvqueue_full</code> event trace for the same reason.</li>
</ol>
<br />
These events can be captured using Linux kernel tracing tools such as <a href="https://perf.wiki.kernel.org/index.php/Main_Page"><code>perf</code></a> or <a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt"><code>ftrace</code></a>.<br />
<h2 id="summary">
</h2>
<h2 id="summary">
Summary</h2>
<br />
Let's just recap the points of interest for monitoring throughput issues in the network receive path.<br />
<ol style="list-style-type: decimal;">
<li><code>/proc/net/softnet_stat</code>: contains statistics updated by the ksoftirq daemon. Useful metrics are processed, timesqueeze and dropped (discussed in detail in the <a href="https://epickrram.blogspot.co.uk/2016/05/navigating-linux-kernel-network-stack.html">last post</a>).</li>
<li><code>/proc/net/snmp</code>: contains system-wide statistics for IP, TCP, UDP. Useful metrics indicate memory exhaustion, buffer exhaustion, etc.</li>
<li><code>/proc/net/udp</code>: contains per-socket information. Useful for monitoring queue depths/drops for a particular socket.</li>
<li><code>/proc/net/dev</code>: <a href="http://lxr.free-electrons.com/source/net/core/net-procfs.c?v=4.0#L101">contains statistics</a> provided by network devices present on the system. Some supplied statistics are driver-specific.</li>
<li><code>/sys/class/net/DEV_NAME/statistics</code>: provides <a href="http://lxr.free-electrons.com/source/include/uapi/linux/if_link.h?v=4.0#L41">more detailed statistics</a> from the network device driver.</li>
</ol>
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-29763523526571727842016-05-06T15:54:00.002+01:002020-09-14T19:30:28.085+01:00Navigating the Linux kernel network stack: receive path<div dir="ltr" style="text-align: left;" trbidi="on">
<p id="background" style="text-align: left;"><span></span></p><a name='more'></a><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><p></p><p id="background" style="text-align: left;"><span style="font-size: large;"><span></span></span></p><!--more--><span style="font-size: large;"> </span> <br /><p></p><h2 id="background">Background</h2>
<br />
At <a href="https://lmax.com/">work</a> we practice continuous integration in terms of <a href="http://epickrram.blogspot.co.uk/2014/05/performance-testing-at-lmax-part-one.html">performance testing</a> alongside different stages of functional testing.<br />
<br />
In order to do this, we have a performance environment that fully replicates the hardware and software used in our production environments. This is necessary in order to be able to find the limits of our system in terms of throughput and latency, and means that we make sure that the environments are identical, right down to the network cables.<br />
<br />
Since we like to be ahead of the curve, we are constantly trying to push the boundaries of our system to find out where it will fall over, and the nature of the failure mode.<br />
<br />
This involves <a href="http://epickrram.blogspot.co.uk/2014/08/performance-testing-at-lmax-part-three.html">running production-like load</a> against the system at a much higher rate than we have ever seen in production. We currently aim to be able to handle a constant throughput of 2-5 times the maximum peak throughput ever observed in production. We believe that this will give us enough headroom to handle future capacity requirements.<br />
<br />
Our Performance & Capacity team has a constant background task of:<br />
<ol style="list-style-type: decimal;">
<li>increase load applied to the system until it breaks</li>
<li>find and fix the bottleneck</li>
</ol>
Using this process, we aim to ensure that we are able to handle spikes in demand, and increases in user numbers, while still achieving a consistent and low latency-profile.<br />
<br />
There is actually a third step to this process, which is something like 'buy new hardware' or 'modify the system to do less work', since there is only so much than tuning will buy you.<br />
<br />
The code that is inspected in this article is that taken by inbound multicast traffic on Linux kernel version 4.0.<br />
<h2 id="identifying-a-bottleneck">
</h2>
<h2 id="identifying-a-bottleneck">
</h2>
<h2 id="identifying-a-bottleneck">
Identifying a bottleneck</h2>
<br />
During the latest iteration of the break/fix cycle, we identified one particular service as problematic. The service in question is one of those responsible for consuming the output of our matching engine, which can peak at 250,000 messages per second at our current performance-test load.<br />
<br />
When we have a bottleneck in one of our services, it usually follows a familiar pattern of back-pressure from the application processing thread, resulting in packets being dropped by the networking card.<br />
<br />
In this case however, we could see from our monitoring that we were not suffering from any processing back-pressure, so it was necessary to delve a little deeper into the network packet receive path in order to understand the problem.<br />
<h2 id="understanding-the-data-flow">
</h2>
<h2 id="understanding-the-data-flow">
</h2>
<h2 id="understanding-the-data-flow">
</h2>
<h2 id="understanding-the-data-flow">
Understanding the data flow</h2>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_04/SoftIRQProcessing.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="459" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_04/SoftIRQProcessing.jpg" width="640" /></a></div>
<div>
<br /></div>
<div style="text-align: center;">
<i>Components involved in network packet receipt</i></div>
<div>
<br /></div>
The Linux kernel provides a number of counters that can give an indication of any problems in the network stack. Since we are concerned with throughput, we will be most interested in things like queue depths and drop counts.<br />
<br />
Before looking at the available statistics, let's take a look at how a packet is handled once it is pulled off the wire.<br />
<br />
The journey begins in the network driver code; this is vendor-specific and in the majority of cases open source. In this example, we're working with an Intel 10Gb card, which uses the ixgbe driver. You can find out the driver used by a network interface by using <code>ethtool</code>:<br />
<pre><code>
</code></pre>
<pre><code>ethtool -i <device-name></code></pre>
<br />
This will generate output that looks something like:<br />
<pre><code>
</code></pre>
<pre><code>driver: ixgbe
version: 3.19.1-k
firmware-version: 0x546d0001
bus-info: 0000:41:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no</code></pre>
<br />
The driver code can be found in the Linux kernel source <a href="http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/?v=4.0">here</a>.<br />
<h3 id="napi">
</h3>
<h3 id="napi">
</h3>
<h3 id="napi">
</h3>
<h3 id="napi">
NAPI</h3>
<br />
NAPI, or New API is a mechanism introduced into the kernel several years ago. More background can be read <a href="http://www.linuxfoundation.org/collaborate/workgroups/networking/napi">here</a>, but in summary, NAPI increases network receive performance by changing packet receipt from interrupt-driven to polling-mode.<br />
<br />
Previous to the introduction of NAPI, network cards would typically fire a hardware interrupt for each received packet. Since an interrupt on a CPU will always cause suspension of the executing software, a high interrupt rate can interfere with software performance. NAPI addresses this by exposing a poll method to the kernel, which is periodically executed (actually via an interrupt). While the poll method is executing, receive interrupts for the network device are disabled. The effect of this is that the kernel can drain potentially multiple packets from the network device receive buffer, thus increasing throughput at the same time as reducing the interrupt overhead.<br />
<h3 id="interrupt-handling">
</h3>
<h3 id="interrupt-handling">
</h3>
<h3 id="interrupt-handling">
</h3>
<h3 id="interrupt-handling">
Interrupt handling</h3>
<br />
When the network device driver is initially configured, it first associates a handler function with the receive interrupt. This function will be invoked whenever the CPU receives a hardware interrupt from the network card.<br />
<br />
For the card that we're looking at, this happens in a method called <a href="http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c?v=4.0#L2740">ixgbe_request_msix_irqs</a>:<br />
<pre><code>
</code></pre>
<pre><code>request_irq(entry->vector, &ixgbe_msix_clean_rings, 0,
q_vector->name, q_vector);</code></pre>
<br />
So when an interrupt is received by the CPU, the ixgbe_msix_clean_rings method simply <a href="http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c?v=4.0#L2675">schedules a NAPI poll</a>, and returns IRQ_HANDLED:<br />
<pre><code>
</code></pre>
<pre><code>static irqreturn_t ixgbe_msix_clean_rings(int irq, void *data)
{
struct ixgbe_q_vector *q_vector = data;
...
if (q_vector->rx.ring || q_vector->tx.ring)
napi_schedule(&q_vector->napi);
return IRQ_HANDLED;
}</code></pre>
<br />
Scheduling the NAPI poll entails <a href="http://lxr.free-electrons.com/source/net/core/dev.c?v=4.0#L3022">adding some work</a> to the per-cpu poll list maintained in the <code>softnet_data</code> structure:<br />
<pre><code>
</code></pre>
<pre><code>static inline void ____napi_schedule(struct softnet_data *sd,
struct napi_struct *napi)
{
list_add_tail(&napi->poll_list, &sd->poll_list);
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
}</code></pre>
<br />
and then raising a <code>softirq</code> event. Once the <code>softirq</code> event has been raised, the driver knows that the poll function will be called in the near future.<br />
<h3 id="softirq-processing">
</h3>
<h3 id="softirq-processing">
</h3>
<h3 id="softirq-processing">
</h3>
<h3 id="softirq-processing">
softirq processing</h3>
<br />
For more background on interrupt handling, the <a href="https://lwn.net/Kernel/LDD3/">Linux Device Drivers</a> book has a chapter dedicated to this topic. Suffice to say, doing work inside of a hardware interrupt context is generally avoided within the kernel; while handling a hardware interrupt a CPU is not executing user or kernel software threads, and no other hardware interrupts can be handled until the current routine is complete. One mechanism for dealing with this is to use <code>softirqs</code>.<br />
<br />
Each CPU in the system has a bound process called <code>ksoftirqd/<cpu_number></code>, which is responsible for processing <code>softirq</code> events.<br />
<br />
In this manner, when a hardware interrupt is received, the driver raises a softIRQ to be processed on the <code>ksoftirqd</code> process. So it is this process that will be responsible for calling the driver's <code>poll</code> method.<br />
<br />
The <code>softirq</code> handler <a href="http://lxr.free-electrons.com/source/net/core/dev.c?v=4.0#L7475"><code>net_rx_action</code></a> is configured for network packet receive events during device initialisation. All <code>softirq</code> events of type <code>NET_RX_SOFTIRQ</code> will be handled by the <code>net_rx_action</code> function.<br />
<br />
So, having followed the code this far, we can say that when a network packet is in the device's receive ring-buffer, the <code>net_rx_action</code> function will be the top-level entry point for packet processing.<br />
<h3 id="net_rx_action">
</h3>
<h3 id="net_rx_action">
</h3>
<h3 id="net_rx_action">
</h3>
<h3 id="net_rx_action">
net_rx_action</h3>
<br />
At this point, it is instructive to look at a function trace of the <code>ksoftirqd</code> process. This trace was generated using <a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt"><code>ftrace</code></a>, and gives a high-level overview of the functions involved in processing the available packets on the network device.<br />
<pre><code>
</code></pre>
<pre><code>net_rx_action() {
ixgbe_poll() {
ixgbe_clean_tx_irq();
ixgbe_clean_rx_irq() {
ixgbe_fetch_rx_buffer() {
... // allocate buffer for packet
} // returns the buffer containing packet data
... // housekeeping
napi_gro_receive() {
// generic receive offload
dev_gro_receive() {
inet_gro_receive() {
udp4_gro_receive() {
udp_gro_receive();
}
}
}
netif_receive_skb_internal() {
__netif_receive_skb() {
__netif_receive_skb_core() {
...
ip_rcv() {
...
ip_rcv_finish() {
...
ip_local_deliver() {
ip_local_deliver_finish() {
raw_local_deliver();
udp_rcv() {
__udp4_lib_rcv() {
__udp4_lib_mcast_deliver() {
...
// clone skb & deliver
flush_stack() {
udp_queue_rcv_skb() {
... // data preparation
// <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1497">deliver UDP packet</a>
// <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1584">check if buffer is full</a>
__udp_queue_rcv_skb() {
// <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1453">deliver to socket queue</a>
// <a href="http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.0#L1464">check for delivery error</a>
sock_queue_rcv_skb() {
...
_raw_spin_lock_irqsave();
// <a href="http://lxr.free-electrons.com/source/include/linux/skbuff.h?v=4.0#L1481">enqueue packet to socket buffer list</a>
_raw_spin_unlock_irqrestore();
// <a href="http://lxr.free-electrons.com/source/net/core/sock.c?v=4.0#L474">wake up listeners</a>
sock_def_readable() {
__wake_up_sync_key() {
_raw_spin_lock_irqsave();
__wake_up_common() {
ep_poll_callback() {
...
_raw_spin_unlock_irqrestore();
}
}
_raw_spin_unlock_irqrestore();
}
...</code></pre>
<br />
The <code>softirq</code> handler performs the following steps:<br />
<ol style="list-style-type: decimal;">
<li>Call the driver's poll method (in this case <code>ixgbe_poll</code>)</li>
<li>Perform some <a href="https://lwn.net/Articles/358910/">GRO</a> functions to group packets together into a larger work unit</li>
<li>Call the packet type's <a href="http://lxr.free-electrons.com/source/net/core/dev.c?v=4.0#L1735">handler function</a> (<code>ip_rcv</code>) to walk down the protocol chain</li>
<li>Parse IP headers, perform checksumming then call <code>ip_rcv_finish</code></li>
<li>The buffer's destination function is invoked, in this case <code>udp_rcv</code></li>
<li>Since these are multicast packets, <code>__udp4_lib_mcast_deliver</code> is called</li>
<li>The packet is copied and delivered to each registered UDP socket queue</li>
<li>In <code>udp_queue_rcv_skb</code>, buffers are checked and if space remains, the skb is added to the end of the socket's queue</li>
</ol>
<h2 id="monitoring-back-pressure">
</h2>
<h2 id="monitoring-back-pressure">
</h2>
<h2 id="monitoring-back-pressure">
</h2>
<h2 id="monitoring-back-pressure">
Monitoring back-pressure</h2>
<br />
When attempting to increase the throughput of an application, we need to understand where back-pressure is coming from.<br />
<br />
At this point in the data receive path, we could have throughput issues for two reasons:<br />
<ol style="list-style-type: decimal;">
<li>The <code>softirq</code> handling mechanism cannot dequeue packets from the network device fast enough</li>
<li>The application processing the destination socket is not dequeuing packets from the socket buffer fast enough</li>
</ol>
<h3 id="softirq-back-pressure">
</h3>
<h3 id="softirq-back-pressure">
</h3>
<h3 id="softirq-back-pressure">
</h3>
<h3 id="softirq-back-pressure">
softirq back-pressure</h3>
<br />
For the first case, we need to look at softnet stats (<code>/proc/net/softnet_stat</code>), which are maintained by the network receive stack.<br />
<br />
The softnet stats are defined <a href="http://lxr.free-electrons.com/source/include/linux/netdevice.h?v=4.0#L2444">here</a> as the per-cpu struct <code>softnet_data</code>, which contains a few fields of interest: <code>processed</code>, <code>time_squeeze</code> and <code>dropped</code>.<br />
<code><br /></code>
<code>processed</code> is the total number of packets processed, so is a good indicator of total throughput.<br />
<code><br /></code>
<code>time_squeeze</code> is updated if the <code>ksoftirq</code> process cannot process all packets available in the network device ring-buffer before its cpu-time is up. The process is limited to 2 jiffies of processing time, or a certain amount of 'work'.<br />
<br />
There are a couple of sysctls that control these parameters:<br />
<ol style="list-style-type: decimal;">
<li><code>net.core.netdev_budget</code> - the total amount of processing to be done in one invocation of net_rx_action</li>
<li><code>net.core.dev_weight</code> - an indicator to the network driver of how much work to do per invocation of its napi poll method</li>
</ol>
The <code>ksoftirq</code> daemon will continue to <a href="http://lxr.free-electrons.com/source/net/core/dev.c?v=4.0#L4655">call <code>napi_poll</code></a> until either the time has run out, or the amount of work reported as completed by the driver exceeds the value of <code>net.core.netdev_budget</code>.<br />
<br />
This behaviour will be driver-specific; in the Intel 10Gb driver, completed work will always be <a href="http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c?v=4.0#L2727">reported as <code>net.core.dev_weight</code></a> if there are still packets to be processed at the end of a poll invocation.<br />
<br />
Given some example numbers, we can determine how many times the napi_poll function will be called for a <code>softIRQ</code> event:<br />
<pre><code>
</code></pre>
<pre><code>net.core.netdev_budget = 300
net.core.dev_weight = 64
poll_count = (300 / 64) + 1 => 5</code></pre>
<br />
If there are still packets to be processed in the network device ring-buffer, then the <code>time_squeeze</code> counter will be incremented for the given CPU.<br />
<br />
The <code>dropped</code> counter is only used when the <code>ksoftirq</code> process is attemping to add a packet to the backlog queue of another CPU. This can happen if <a href="https://www.kernel.org/doc/Documentation/networking/scaling.txt">Receive Packet Steering</a> is enabled, but since we are only looking at UDP multicast without RPS, I won't go into the detail.<br />
<br />
So if our kernel helper thread is unable to move packets from the network device receive queue to the socket's receive buffer fast enough, we can expect the <code>time_squeeze</code> column in <code>/proc/net/softnet_stat</code> to increase over time.<br />
<br />
In order to interpret the file, it is worth looking at <a href="http://lxr.free-electrons.com/source/net/core/net-procfs.c?v=4.0#L146">the implementation</a>. Each row represents a CPU-local instance of the <code>softnet_stat</code> struct (starting with CPU0 at the top), and the third column is the <code>time_squeeze</code> entry.<br />
<br />
The only tunable that we have at our disposal is the <code>netdev_budget</code> value. Increasing this will allow the <code>ksoftirq</code> process to do more work. The process will still be limited by a total processing time of 2 jiffies though, so there will be an upper ceiling to packet throughput.<br />
<br />
Given the speeds that modern processors are capable of, it is unlikely that the <code>ksoftirq</code> daemon will be unable to keep up with the flow of data, if properly configured.<br />
<br />
In order to give the kernel the best chance to do so, make sure that there is no contention for CPU resources by assigning network interrupts to a number of cores, and then using <a href="http://lxr.free-electrons.com/source/Documentation/cgroups/cpusets.txt?v=4.0#L395">isolcpus</a> to make sure that no other processes will be running on them. This will give the <code>ksoftirq</code> daemon the best chance of copying the inbound packets in a timely manner.<br />
<br />
If the <code>ksoftirq</code> daemon is squeezed frequently enough, or is just unable to get CPU time, then the network device will be forced to drop packets from the wire. In this case, we can use ethtool to find the rx_missed_errors count:<br />
<pre><code>
</code></pre>
<pre><code>ethtool -S <device-name> | grep rx_missed
rx_missed_errors: 0</code></pre>
<br />
alternatively, the same data can be found by looking at the following file:<br />
<pre><code>
</code></pre>
<pre><code>/sys/class/net/<device-name>/statistics/rx_missed_errors</code></pre>
<br />
For a full description of each of the statistics reported by <code>ethtool</code>, refer to <a href="http://lxr.free-electrons.com/source/Documentation/ABI/testing/sysfs-class-net-statistics">this document</a>.<br />
<h3 id="application-back-pressure">
</h3>
<h3 id="application-back-pressure">
</h3>
<h3 id="application-back-pressure">
</h3>
<h3 id="application-back-pressure">
Application back-pressure</h3>
<br />
It is far more likely that our user programs will be the bottleneck here, and in order to determine whether that is the case, we need to look at the next stage in the message receipt path. A continuation of this post will explore that area in more detail.<br />
<h2 id="summary">
</h2>
<h2 id="summary">
</h2>
<h2 id="summary">
</h2>
<h2 id="summary">
Summary</h2>
<br />
For UDP-multicast traffic, we have seen in detail the code paths involved in moving an inbound network packet from a network device to a socket's input buffer. This stage can be broadly summarised as follows:<br />
<ol style="list-style-type: decimal;">
<li>On packet receipt, the network device fires a hardware interrupt to the configured CPU</li>
<li>The hardware interrupt handler schedules a <code>softIRQ</code> on the same CPU</li>
<li>The <code>softIRQ</code> handler thread (<code>ksoftirqd</code>) will disable receive interrupts and poll the card for received data</li>
<li>Data will be copied from the network device's receive buffer into the destination socket's input buffer</li>
<li>After a certain amount of work has been done, or no inbound packets remain, the <code>ksoftirq</code> daemon will re-enable receive interrupts and return</li>
</ol>
In order to optimise for throughput, there are a couple of things to try tuning:<br />
<ol style="list-style-type: decimal;">
<li>Increase the amount of work that the <code>ksoftirq</code> daemon is allowed to do (<code>net.core.netdev_budget</code>)</li>
<li>Make sure that the <code>ksoftirq</code> daemon is not contending for CPU resource or being descheduled due to other hardware interrupts</li>
<li>Increase the size of the network device's ring-buffer (<code>ethtool -g <device-name></code>)</li>
</ol>
As with all performance-related experiments, never attempt to tune the system without being able to measure the impact of any changes in isolation.<br />
<br />
First, make sure that you know what the problem is (i.e. <code>rx_missed_errors</code> or <code>time_squeeze</code> is increasing), the add the relevant monitoring. For this particular case, we would want to be able to correlate the application experiencing message loss with a change in the relevant counters, so recording and charting the numbers would be a good start.<br />
<br />
Once this has been done, changes can be made to system configuration to see if an improvement can be made.<br />
<br />
Lastly, any changes to the tuning parameters that I've mentioned MUST be configured via automation. We have sadly lost a fair amount of time to manual changes being made on machines that have not persisted across reboots.<br />
<br />
It is all too easy (and I speak as a repeat offender) to make adjustments, find the optimal configuration, and then move on to something else. Do yourself and your colleagues a favour and automate configuration management!<br />
<br />
<br />
<b><br /></b>
<b>Resources</b><br />
<b><br /></b>
I read lots of really useful documents while learning about this area of the kernel. Here are a few of them:<br />
<br />
<br />
<a href="https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/">https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/</a><br />
<br />
<a href="https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf">https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf</a><br />
<br />
<a href="https://www.kernel.org/doc/Documentation/networking/scaling.txt">https://www.kernel.org/doc/Documentation/networking/scaling.txt</a><br />
<br />
<a href="https://wiki.openwrt.org/doc/networking/praxis">https://wiki.openwrt.org/doc/networking/praxis</a><br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-61941821148755195972016-03-30T09:40:00.001+01:002020-09-14T19:30:46.727+01:00Further notes on Hotspot compiler flags<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">Continuing on from my <a href="http://epickrram.blogspot.com/2016/03/observing-jvm-warm-up-effects.html">last post</a>, here we'll be looking at flags used to control the C2 or server compiler of the Hotspot JVM.<br />
<br />
<br />
<h2 id="configuration">
Configuration</h2>
In order to reduce the noise created in the compilation logs, we'll be disabling <i>tiered compilation</i> so that only the server compiler will be used. This is done using the following flag:<br />
<pre><code>-XX:-TieredCompilation</code></pre>
We'll also be producing more detailed compiler logs using:<br />
<pre><code>-XX:+UnlockDiagnosticVMOptions
-XX:+LogCompilation</code></pre>
These flags will cause the JVM to generate a file called <code>hotspot_<pid>.log</code> in the current working directory, containing detailed information on the operation of the compiler.<br />
<br />
<h2 id="c2-compile-thresholds">
C2 Compile Thresholds</h2>
The server compiler has the same threshold flags as the profile-guided client compiler.<br />
Looking at the flags related to <i>Tier4</i> compilation options (Tier4 is the server compiler), we can see a similar set to those for the Tier3 thresholds described in the last post:<br />
<pre></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ java -XX:+PrintFlagsFinal 2>&1 | grep Tier4
intx Tier4BackEdgeThreshold = 40000
intx Tier4CompileThreshold = 15000
intx Tier4InvocationThreshold = 5000
intx Tier4LoadFeedback = 3
intx Tier4MinInvocationThreshold = 600</code></pre>
<br />
So we might expect to be able to run the same experiments regarding the triggering of invocation, back-edge and compile thresholds.<br />
However, since we are disabling tiered-compilation to reduce noise, these thresholds will not affect compilation. In order to control the operation of the server compiler when tiered-mode is disabled, we need to use the following flags:<br />
<pre><code>-XX:CompileThreshold
-XX:BackEdgeThreshold</code></pre>
For the server compiler, <code>CompileThreshold</code> acts as the <i>invocation threshold</i>. Setting an artificially low threshold (of <code>-XX:CompileThreshold=200</code>) shows this:<br />
<pre></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/c2-invocation-threshold.sh
LOG: Loop count is: 200
181 75 com.epickrram.t.w.e.t.C2InvocationThresholdMain::
exerciseServerCompileThreshold (6 bytes)</code></pre>
<pre><code>
</code></pre>
Note that we are no longer seeing information about the <i>tier</i> in the <code>PrintCompilation</code> output. In order to confirm that the server compiler is operating here, we can look at the more detailed <code>LogCompilation</code> output for compile task <i>75</i>:<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ grep "id='75'" hotspot_pid31473.log
<task_queued compile_id='75'
method='com/epickrram/talk/warmup/example/threshold/C2InvocationThresholdMain
exerciseServerCompileThreshold (J)J'
bytes='6' count='100' backedge_count='1' iicount='200'
stamp='0.089' comment='count' hot_count='200'/>
<nmethod compile_id='75' compiler='C2' ...
method='com/epickrram/talk/warmup/example/threshold/C2InvocationThresholdMain
exerciseServerCompileThreshold (J)J'
bytes='6' count='100' backedge_count='1' iicount='200' stamp='0.182'/>
<task compile_id='75'
method='com/epickrram/talk/warmup/example/threshold/C2InvocationThresholdMain
exerciseServerCompileThreshold (J)J'
bytes='6' count='100' backedge_count='1' iicount='200' stamp='0.182'></code></pre>
<pre><code>
</code></pre>
We can see that the compiler being used in this compile task is <i>C2</i> and that the interpreter invocation count <i>iicount</i> is <i>200</i>.<br />
<br />
<h2 id="c2-backedge-threshold">
C2 BackEdge Threshold</h2>
The server compiler's handling of loop back-edge thresholds is again different to the tiered C1 flags. Using this <a href="https://github.com/epickrram/jvm-warmup-talk/blob/master/src/main/java/com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain.java">example program</a> we can see that an on-stack replacement is triggered when the back-edge count is <i>14563</i>.<br />
This is despite the <code>BackEdgeThreshold</code> flag value being set to a lower value.<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/c2-loop-backedge-threshold.sh 14600
LOG: Loop count is: 14600
133 2 % com.epickrram.t.w.e.t.C2LoopBackedgeThresholdMain::
exerciseServerLoopBackedgeThreshold @ 5 (25 bytes)
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ grep "id='2'" hotspot_pid32675.log
<task_queued compile_id='2' compile_kind='osr'
method='com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain
exerciseServerLoopBackedgeThreshold (JI)J'
bytes='25' count='1' backedge_count='14563' iicount='1' osr_bci='5'
stamp='0.134' comment='backedge_count' hot_count='14563'/>
<nmethod compile_id='2' compile_kind='osr' compiler='C2' ...
method='com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain
exerciseServerLoopBackedgeThreshold (JI)J' bytes='25'
count='10000' backedge_count='5037' iicount='1' stamp='0.136'/>
<task compile_id='2' compile_kind='osr'
method='com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain
exerciseServerLoopBackedgeThreshold (JI)J'
bytes='25' count='10000' backedge_count='5037' iicount='1' osr_bci='5' stamp='0.134'></code></pre>
<pre><code>
</code></pre>
What is interesting is that the <i>nmethod</i> node contains a <i>count</i> that is equal to the value of <code>-XX:CompileThreshold</code>. If we reduce this threshold to <i>5000</i>, we can see that the on-stack replacement happens sooner:<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/c2-loop-backedge-threshold.sh
LOG: Loop count is: 10000
126 6 % com.epickrram.talk.warmup.example.threshold.C2LoopBackedgeThresholdMain::
exerciseServerLoopBackedgeThreshold @ 5 (25 bytes)
</code></pre>
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ grep "id='6'" hotspot_pid1598.log
<task_queued compile_id='6' compile_kind='osr'
method='com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain
exerciseServerLoopBackedgeThreshold (JI)J'
bytes='25' count='1' backedge_count='7793' iicount='1' osr_bci='5'
stamp='0.126' comment='backedge_count' hot_count='7793'/>
<nmethod compile_id='6' compile_kind='osr' compiler='C2' ...
method='com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain
exerciseServerLoopBackedgeThreshold (JI)J' bytes='25'
count='5000' backedge_count='2659' iicount='1' stamp='0.128'/>
<task compile_id='6' compile_kind='osr'
method='com/epickrram/talk/warmup/example/threshold/C2LoopBackedgeThresholdMain
exerciseServerLoopBackedgeThreshold (JI)J'
bytes='25' count='5000' backedge_count='2659' iicount='1' osr_bci='5' stamp='0.126'></code></pre>
<pre><code>
</code></pre>
Here, OSR occurs after a back-edge count of <i>7793</i>, while the <i>nmethod</i> node has <i>count='5000'</i>.<br />
From these observations, we can infer that loop back-edge compilation triggers are related to the <code>CompileThreshold</code> flag, and that if we wish to control when the server compiler kicks in, we need to alter only the <code>CompileThreshold</code> flag.<br />
<br />
<h2 id="inlining">
Inlining</h2>
When a method is converted to a native method, the compiler has the option to perform a further optimisation: <i>inlining</i>.<br />
Inlining callee methods reduce method-dispatch overhead, and can allow the compiler a broader scope for further optimisation, e.g. dead-code elimination or escape analysis.<br />
Inlining decisions are based on the size of the method to be inlined. There are two thresholds that we need be concerned with:<br />
<pre><code>-XX:MaxInlineSize
-XX:FreqInlineSize</code></pre>
These thresholds are specified in byte-codes. Let's start with an example of a method that is small enough for inlining:<br />
<pre><code>
</code></pre>
<pre><code>private static long shouldInline(final long input)
{
return (input * System.nanoTime()) + 37L;
}</code></pre>
<pre><code>
</code></pre>
Using <code>javap</code> to inspect the byte-code of this method, we can see that it is only 10 byte-codes in length:<br />
<pre><code>
</code></pre>
<pre><code>private static long shouldInline(long);
descriptor: (J)J
flags: ACC_PRIVATE, ACC_STATIC
Code:
stack=4, locals=2, args_size=1
0: lload_0
1: invokestatic #18 // Method java/lang/System.nanoTime:()J
4: lmul
5: ldc2_w #19 // long 37l
8: ladd
9: lreturn</code></pre>
<pre><code>
</code></pre>
Running the example and adding the <code>-XX:+PrintInlining</code> flag will cause the compiler to interleave information about inlining decisions into the compilation output.<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/small-method-inlining-threshold.sh 250
72 19 3 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
inlineCandidateCaller (5 bytes)
@ 1 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes)
@ 1 java.lang.System::nanoTime (0 bytes) intrinsic
72 20 3 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes)
@ 1 java.lang.System::nanoTime (0 bytes) intrinsic</code></pre>
<pre><code>
</code></pre>
In this log excerpt, we can see that after the usual output of <code>PrintCompilation</code>, we also get information about methods being inlined.<br />
If we look at the byte-code of the caller method:<br />
<pre><code>
</code></pre>
<pre><code>private static long inlineCandidateCaller(long);
descriptor: (J)J
flags: ACC_PRIVATE, ACC_STATIC
Code:
stack=2, locals=2, args_size=1
0: lload_0
1: invokestatic #17 // Method shouldInline:(J)J
4: lreturn</code></pre>
<pre><code>
</code></pre>
Here we can see that the invocation of the <i>shouldInline</i> method is at byte-code <i>1</i>, so the output of <code>PrintInlining</code> is referring to the call-site that is inlined (the <i>@ 1</i> part of the log entry).<br />
If we reduce the <code>MaxInlineSize</code> parameter to be less than 10 byte-codes using <code>-XX:MaxInlineSize=9</code>, then inlining will fail:<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/reduced-inlining-threshold.sh 250
105 19 3 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
inlineCandidateCaller (5 bytes)
@ 1 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes) callee is too large</code></pre>
<pre><code>
</code></pre>
Note the message <i>callee is too large</i> - this is something to look out for if you expect methods in hot code-paths to be inlined; it means that the compiler did not inline this method due to its size.<br />
Now, the default value of <code>MaxInlineSize</code> is 20 byte-codes, which is not a lot of code. The compilation process is a trade-off between achieving good performance, and the space overhead of compiled code, among other things.<br />
<br />
The compiler <i>will</i> inline your 21 byte-code method, <i>if it is called often enough</i>. In called frequently enough, the size threshold that determines inlining is <code>FreqInlineSize</code>.<br />
Let's re-run our experiment, and increase the number of invocations:<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/reduced-inlining-threshold.sh 25000
79 19 3 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
inlineCandidateCaller (5 bytes)
@ 1 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes) callee is too large
...
80 22 4 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes)
@ 1 java.lang.System::nanoTime (0 bytes) (intrinsic)
@ 1 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes) inline (hot)</code></pre>
<pre><code>
</code></pre>
First, we see the same message declaring the callee method to be too large, but later on in the compilation process, the callee method is inlined. This corresponds with the message <i>inline (hot)</i>, meaning that the runtime has decided this method is called frequently enough to inline.<br />
<br />
If we reduce the <code>FreqInlineSize</code> to be less than 10 byte-codes using <code>-XX:FreqInlineSize=9</code>, then inline will once again fail:<br />
<pre><code>
</code></pre>
<pre><code>[mark@metal jvm-warmup-talk]$ bash ./scripts/reduced-freq-inlining-threshold.sh 25000
77 22 4 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes)
@ 1 java.lang.System::nanoTime (0 bytes) (intrinsic)
@ 1 com.epickrram.talk.warmup.example.threshold.InliningThresholdMain::
shouldInline (10 bytes) hot method too big</code></pre>
<pre><code>
</code></pre>
Here denoted by the message <i>hot method too big</i>.<br />
<br />
<h2 id="summary">
Summary</h2>
We have seen that further to the Tier3 client compiler thresholds, there are Tier4 thresholds for the longer-running C2 compiler. When tiered-compilation is disabled, other threshold flags come into play.<br />
<br />
Inlining decisions are based on the size of the callee method, and the frequency with which is it called. The Hotspot compiler will attempt to aggressively inline hot methods, so it is important to understand whether the design of our code is hindering the ability of the compiler to perform available optimisations.<br />
<br />
In my next post, I'll be looking at some of the tooling available to help analyse and understand the operation of the JVM Hotspot compiler.<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-50751303572610863112016-03-05T15:45:00.002+00:002020-09-14T19:31:02.347+01:00Observing JVM warm-up effects<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">In this post, we will explore some of the various flags that can affect the operation of the JVM's JIT compiler.<br />
Anything demonstrated in this post should come with a public health warning - these options are explored for reference only, and modifying them without being able to observe and reason about their effects should be avoided.<br />
You have been warned.<br />
<h2 id="the-two-compilers">
</h2>
<h2 id="the-two-compilers">
The two compilers</h2>
<br />
The JVM that ships with OpenJDK contains two compiler back-ends:<br />
<ol type="1">
<li>C1, also known as 'client'</li>
<li>C2, also known as 'server'</li>
</ol>
The C1 compiler has a number of different modes, and will alter its response to a compilation request given a number of system factors, including, but not limited to, the current workload of the C1 & C2 compiler thread pool.<br />
<br />
Given these different modes, the JDK refers to different <i>tiers</i>, which can be broken down as follows:<br />
<ol type="1">
<li>Tier1 - client compiler with no profiling information</li>
<li>Tier2 - client compiler with basic counters</li>
<li>Tier3 - client compiler with profiling information</li>
<li>Tier4 - server compiler</li>
</ol>
From this point on, when referring to the C1 compiler, I'm talking about Tier3.<br />
<h2 id="thresholds">
</h2>
<h2 id="thresholds">
Thresholds</h2>
<br />
At a very high level, the JVM bytecode interpreter uses method invocation and loop back-edge counting in order to decide when a method should be compiled.<br />
Since it would be wasteful and expensive to compile methods that are only ever called a small number of times, the interpreter will wait until a method invocation count is over a particular threshold before it is compiled.<br />
<br />
Thresholds for various levels of compilation can be modified using flags passed to the JVM on the command line.<br />
<br />
The first such threshold that is likely to be triggered is the C1 Compilation Threshold.<br />
<h2 id="flags-side-note">
</h2>
<h2 id="flags-side-note">
Flags side-note</h2>
<br />
To view all the available flags that can be passed to the jvm, run the following command:<br />
<br />
<code>java -XX:+PrintFlagsFinal</code><br />
<br />
Running this on my local install of <code>JDK 1.8.0_60-b27</code> shows that there are 772 flags available:<br />
<pre><code>[pricem@metal ~]$ java -XX:+PrintFlagsFinal 2>&1 | wc -l
772</code></pre>
For the truly intrepid, there are even more tunables available if we unlock diagnostic options (more on this later):<br />
<pre><code>[pricem@metal ~]$ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal 2>&1 | wc -l
873</code></pre>
<h2 id="example-code">
</h2>
<h2 id="example-code">
Example code</h2>
<br />
Code samples from this post are available on <a href="https://github.com/epickrram/jvm-warmup-talk">github</a>.<br />
Clone the repository, then build with:<br />
<br />
<code>./gradlew clean jar</code><br />
<h2 id="c1-compilation-invocation-threshold">
</h2>
<h2 id="c1-compilation-invocation-threshold">
C1 Compilation Invocation Threshold</h2>
<br />
The first trigger that a method is likely to hit is for C1 compilation threshold. This threshold is specified by the flag:<br />
<pre><code>[pricem@metal ~]$ java -XX:+PrintFlagsFinal 2>&1 | grep Tier3InvocationThreshold
intx Tier3InvocationThreshold = 200</code></pre>
This setting informs the interpreter that it should emit a compile task to the C1 compiler when an interpreted method is executed <code>200</code> times.<br />
<br />
Observing this should be simple - all we need to do is <a href="https://github.com/epickrram/jvm-warmup-talk/blob/master/src/main/java/com/epickrram/talk/warmup/example/threshold/C1InvocationThresholdMain.java#L43">write a method</a>, call it <code>200</code> times and watch the compiler doing its work.<br />
Enabling logging of compiler operation is a simple matter of supplying another JVM argument on start-up:<br />
<br />
<code>-XX:+PrintCompilation</code><br />
<br />
<br />
Without further ado, let us try to observe our method being compiled after <code>200</code> invocations. The script being called will log any statements from the program, and also any other output to stdout that is relevant to compilations for this project.<br />
<br />
We would expect to see a message saying that the <a href="https://github.com/epickrram/jvm-warmup-talk/blob/master/src/main/java/com/epickrram/talk/warmup/example/threshold/C1InvocationThresholdMain.java#L43">exerciseTier3InvocationThreshold</a> method is compiled.<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ bash ./scripts/c1-invocation-threshold.sh
LOG: Loop count is: 200</code></pre>
No compilation message. I'll shortcut a bit of investigation here and point out that the Tier3 compile threshold seems to work on boundaries of power-two numbers:<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ bash ./scripts/c1-invocation-threshold.sh 255
LOG: Loop count is: 255</code></pre>
Still no compilation; let's perform one more invocation...<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ bash ./scripts/c1-invocation-threshold.sh 256
LOG: Loop count is: 256
132 47 3 com.epickrram.t.w.e.t.C1InvocationThresholdMain::exerciseTier3InvocationThreshold (6 bytes)</code></pre>
Finally our method is compiled. This pattern is repeated for larger numbers of invocation threshold:<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ java -cp build/libs/jvm-warmup-talk-0.0.1.jar \</code></pre>
<pre><code>-XX:+PrintCompilation \
-XX:Tier3InvocationThreshold=1000 \</code></pre>
<pre><code>com.epickrram.t.w.e.t.C1InvocationThresholdMain 1023 | grep -E "(LOG|epickrram)"
LOG: Loop count is: 1023</code></pre>
No compilation at 1023 invocations.<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ java -cp build/libs/jvm-warmup-talk-0.0.1.jar \</code></pre>
<pre><code>-XX:+PrintCompilation \
-XX:Tier3InvocationThreshold=1000 \</code></pre>
<pre><code>com.epickrram.t.w.e.t.C1InvocationThresholdMain 1024 | grep -E "(LOG|epickrram)"
LOG: Loop count is: 1024
128 18 3 com.epickrram.t.w.e.t.C1InvocationThresholdMain::exerciseTier3InvocationThreshold (6 bytes)</code></pre>
1024 invocations triggers compilation.<br />
<h2 id="c1-loop-back-edge-threshold">
</h2>
<h2 id="c1-loop-back-edge-threshold">
C1 Loop Back-edge Threshold</h2>
<br />
As mentioned earlier, the JVM bytecode interpreter will also monitor loop counts within a method. This mechanism allows the runtime to spot that a method is <i>hot</i> despite it not being invoked many times.<br />
For example, if we have a method that contains a loop executing many thousands of times, we would want that method to be compiled, even if it was only invoked relatively infrequently.<br />
<br />
The relevant flag for this setting is:<br />
<br />
<code>Tier3BackEdgeThreshold</code><br />
<pre><code>[pricem@metal jvm-warmup-talk]$ java -XX:+PrintFlagsFinal 2>&1 | grep Tier3BackEdgeThreshold
intx Tier3BackEdgeThreshold = 60000</code></pre>
Using <a href="https://github.com/epickrram/jvm-warmup-talk/blob/master/src/main/java/com/epickrram/talk/warmup/example/threshold/C1LoopBackedgeThresholdMain.java">another example program</a>, we can observe the interpreter emitting a compile task once the loop count within a method reaches the specified threshold:<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ bash scripts/c1-loop-backedge-threshold.sh 60416
LOG: Loop count is: 60416
137 48 % 3 com.epickrram.t.w.e.t.C1LoopBackedgeThresholdMain::exerciseTier3LoopBackedgeThreshold @ 5 (25 bytes)</code></pre>
<br />
Once again, there seems to be a slight difference in the required number of loop iterations and the specified threshold. In this case, we need to execute the loop <code>60416</code> times in order for the interpreter to recognise this method as <i>hot</i>. <code>60416</code> just happens to be <code>1024 * 59</code>, it's almost as though there's a pattern here...<br />
<h2 id="printcompilation-format">
</h2>
<h2 id="printcompilation-format">
PrintCompilation format</h2>
<br />
In order to understand what is happening here, we need to take a brief foray into understanding the output from the <code>PrintCompilation</code> command. Rather than draw my own fancy graphic, I'm going to reference a slide from Doug Hawkins' excellent talk <a href="http://www.slideshare.net/dougqh/jvm-mechanics-when-does-the"><i>JVM Mechanics</i></a>.<br />
<br />
<br />
<br />
<figure>
<img alt="PrintCompilation log format" src="https://raw.githubusercontent.com/epickrram/jvm-warmup-talk/master/img/print-compilation-format.png" title="PrintCompilation format" /><figcaption style="text-align: center;"><i>PrintCompilation log format</i></figcaption>
</figure>
<br />
<br />
<br />
Using this reference, we can break down the information in the log output from our test program:<br />
<br />
<pre><code> 137 48 % 3 com.epickrram.t.w.e.t.C1LoopBackedgeThresholdMain::exerciseTier3LoopBackedgeThreshold @ 5 (25 bytes)</code></pre>
<pre></pre>
<ol type="1">
<li>This compile happened 137 milliseconds after JVM startup</li>
<li>Compilation ID was 48</li>
<li>This was an <i>on-stack replacement</i> (more on this later)</li>
<li>This compilation happened at Tier3 (C1 profile-guided)</li>
<li>The OSR loop bytecode index is <i>5</i></li>
<li>The compiled method was 25 bytecodes</li>
</ol>
<h2 id="verifying-the-detail">
</h2>
<h2 id="verifying-the-detail">
Verifying the detail</h2>
<br />
Let's go and take a quick look at what these bytecode references are. If we decompile the method using <code>javap</code>:<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ javap -cp build/libs/jvm-warmup-talk-0.0.1.jar -c -p \
com.epickrram.t.w.e.t.C1LoopBackedgeThresholdMain</code></pre>
We can see the disassembled bytecode of the method in question:<br />
<pre><code> private static long exerciseTier3LoopBackedgeThreshold(long, int);
Code:
0: lload_0
1: lstore_3
2: iconst_0
3: istore 5
5: iload 5
7: iload_2
8: if_icmpge 23
11: ldc2_w #22 // long 17l
14: lload_3
15: lmul
16: lstore_3
17: iinc 5, 1
20: goto 5
23: lload_3
24: lreturn</code></pre>
<br />
This tells us that the method contains 25 bytecodes, so that explains one number. We can also see the <code>goto</code> instruction at bytecode index <code>20</code>, and its target bytecode index <code>5</code>.<br />
<br />
Comparing this with the method source:<br />
<pre><code>private static long exerciseTier3LoopBackedgeThreshold(final long input, final int loopCount)
{
long value = input;
for(int i = 0; i < loopCount; i++)
{
value = 17L * value;
}
return value;
}</code></pre>
<br />
With a little bit of reasoning, we can figure out that bytecode <code>5</code> is the point at which we load the loop counter variable <i>i</i> in order to do the comparison to the <i>loopCount</i> parameter.<br />
<br />
This bytecode index then, is at the start of the loop, and would be an ideal place to jump to executing the newly compiled method.<br />
<h2 id="on-stack-replacement">
</h2>
<h2 id="on-stack-replacement">
On-Stack Replacement</h2>
<br />
On-Stack replacement is a mechanism that allows the interpreter to take advantage of compiled code, even when it is still executing a loop for that method in interpreted mode.<br />
If we imagine a hypothetical workflow for our JVM to be:<br />
<ol type="1">
<li>Start executing a method <i>loopyMethod</i> in the interpreter</li>
<li>Within <i>loopyMethod</i>, we execute an expensive loop body 1,000,000 times</li>
<li>The interpreter will see that the loop count has exceeded the Tier3BackedgeThreshold setting</li>
<li>The interpreter will request compilation of <i>loopyMethod</i></li>
<li>The method body is expensive and slow, and we want to start using the compiled version immediately. Without OSR, the interpreter would have to complete the 1,000,000 iterations of slow interpreted code, dispatching to the complied method on the next call to <i>loopyMethod()</i></li>
<li>With OSR, the interpreter can dispatch to the compiled frame at the start of the next loop iteration</li>
<li>Execution will now continue in the compiled method body</li>
</ol>
<h2 id="c1-compilation-threshold">
</h2>
<h2 id="c1-compilation-threshold">
C1 Compilation Threshold</h2>
<br />
There is one other threshold that we need to concern ourselves with, and that is the <code>Tier3CompileThreshold</code>. This particular setting is used to catch a method containing a hot loop, whose back-edge count is not high enough to trigger on-stack replacement due to a high loop back-edge count.<br />
<br />
The heuristic for determining whether a method should be compiled, <a href="http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-November/004239.html">described here</a>, looks something like this:<br />
<pre><code>boolean shouldCompileMethod (invocationCount, backEdgeCount) {
if ( invocationCount > Tier3InvocationThreshold ) {
return true;
}
if ( invocationCount > Tier3MinInvocationThreshold &&
invocationCount + backEdgeCount > Tier3CompileThreshold ) {
return true;
}
return false;
}</code></pre>
<br />
Given this formula, we should be able to create a scenario for triggering this threshold. In order to exercise the trigger, let's look at the defaults on my version of java:<br />
<pre><code>intx Tier3CompileThreshold = 2000
intx Tier3InvocationThreshold = 200
intx Tier3MinInvocationThreshold = 100</code></pre>
We need to make sure that the method is called fewer than <code>Tier3InvocationThreshold</code> times and greater than <code>Tier3MinInvocationThreshold</code> times, while increasing the back-edge count to greater than <code>Tier3CompileThreshold</code>. On the next invocation of the method, compilation should occur.<br />
<br />
So, if we invoke a method 100 times, and it generates a loop back-edge count of 21 per invocation, then we should exceed the <code>Tier3CompileThreshold</code>:<br />
<pre><code>100 + (100 * 21) == 2200 > Tier3CompileThreshold</code></pre>
On the 101st invocation, the interpreter should trigger a compilation.<br />
<br />
Of course, given that so far each threshold seems to have had some power-of-two-based wiggle room as far as the interpreter is concerned, this magic formula doesn't work out exactly. In fact, in this example, the method must be executed 147 times in order for compilation to occur!<br />
<br />
Executing <a href="https://github.com/epickrram/jvm-warmup-talk/blob/master/src/main/java/com/epickrram/talk/warmup/example/threshold/C1CompilationThresholdMain.java">this test program</a> yields the following output:<br />
<pre><code>[pricem@metal jvm-warmup-talk]$ bash ./scripts/c1-compilation-threshold.sh
LOG: Loop count is: 21
LOG: Finished invocation: 1, back-edge count should be 21
LOG: Finished invocation: 2, back-edge count should be 42
LOG: Finished invocation: 3, back-edge count should be 63
LOG: Finished invocation: 4, back-edge count should be 84
...
LOG: Finished invocation: 145, back-edge count should be 3045
LOG: Finished invocation: 146, back-edge count should be 3066
LOG: Pausing for a few seconds to make sure compile hasn't been triggered yet...
LOG: About to perform invocation 147
5151 176 3 com.epickrram.t.w.e.t.C1CompilationThresholdMain::exerciseTier3CompilationThreshold (27 bytes)</code></pre>
<br />
<br />
It can be seen that in this scenario, we have not triggered the invocation threshold (i.e. invocation count < 200), nor have we triggered the back-edge threshold. The interpreter has correctly identified the method as being worthy of compilation, so the runtime is able to provide an optimised version for future invocations.<br />
<h2 id="summary">
<br /></h2>
<h2 id="summary">
Summary</h2>
<br />
We have seen that for the C1 compiler when operating in tiered mode, there are 3 flags that control when a method is considered for compilation.<br />
<br />
In my next post, I'll be looking at the corresponding flags for the C2 compiler, and how they are affected by tiered and non-tiered mode.
<br /><br /><br /><br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-43146636683546911492016-02-10T10:08:00.002+00:002020-09-14T19:31:24.495+01:00Qcon London Talks<div dir="ltr" style="text-align: left;" trbidi="on">
<span><a name='more'></a></span><p style="text-align: left;"><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></p> </div><span><!--more--></span><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">LMAX Exchange developers are giving two talks at QCon London this year.<br />
<br />
<br />
<br />
Sam Adams, our Head of Software, will be discussing the awesome LMAX Continuous Delivery process in his talk "<a href="https://qconlondon.com/presentation/cd-lmax-testing-production-and-back-again">CD at LMAX: Testing into production and back again</a>".<br />
<br />
<br />
<br />
I will be talking about JVM warm-up strategies and how to inspect the machinations of the Hotspot compiler in "<a href="https://qconlondon.com/presentation/hot-code-faster-code-addressing-jvm-warm">Hot code is faster code - addressing JVM warm-up</a>".<br />
<br />
<br />
<br />
If you're at the conference, please come and say hello. We're always happy to talk about the work and technology at LMAX Exchange.<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<br />
<br /></div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-51683835734892154182016-01-27T16:29:00.001+00:002020-09-14T19:31:57.897+01:00Timing is everything<div dir="ltr" style="text-align: left;" trbidi="on">
<br /><p style="text-align: left;"><span></span></p><a name='more'></a> <p></p><p style="text-align: left;"> <span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></p><p style="text-align: left;"><span></span></p><!--more--><span><!--more--></span><p></p></div><div dir="ltr" style="text-align: left;" trbidi="on">Monitoring of various metrics is a large part of ensuring that our systems are behaving in the way that we expect. For low-latency systems in particular, we need to be able to develop an understanding of where in the system any latency spikes are occurring.<br />
<br />
Ideally, we want to be able to detect and diagnose a problem before it's noticed by any of our customers. In order to do this, at <a href="https://www.lmax.com/">LMAX Exchange</a> we have developed extensive tracing capabilities that allow us to inspect request latency at many different points in our infrastructure.<br />
<br />
This (often) helps us to narrow down the source of a problem to something along the lines of <i>"a cache battery has died on host A, so disk writes are causing latency spikes"</i>. Sometimes of course, there's no such easy answer, and we need to take retrospective action to improve our monitoring when we find a new and interesting problem.<br />
<br />
One such problem occurred in our production environment a few months ago. This was definitely one of the cases where we couldn't easily identify the root cause of the issue. Our only symptom was that order requests were being processed much slower than expected.<br />
<br />
Since we <a href="http://epickrram.blogspot.co.uk/2015/05/improving-journalling-latency.html">rebuilt</a> and <a href="http://epickrram.blogspot.co.uk/2015/09/reducing-system-jitter.html">tuned</a> our exchange, our latencies have been so good that in this case we were able to detect the problem before any of our users complained about a deterioration in performance. Even though we could see that there was a problem, overall latency was still within our SLAs.<br />
<br />
<br />
<h3 style="text-align: left;">
Measuring, not sampling</h3>
<br />
<br />
In order to describe the symptom we observed, I'll describe in more detail the in-application monitoring that we use to measure latencies within the core of our system.<br />
<br />
For a more comprehensive view of our overall architecture, please refer to <a href="http://epickrram.blogspot.co.uk/2015/05/improving-journalling-latency.html">previous posts</a>.<br />
<br />
The image below depicts the data-flow through the matching engine at the core of our exchange. Moving from left-to-right, the steps involved in message processing are:<br />
<br />
<br />
<ol style="text-align: left;">
<li>Message arrives from the network</li>
<li>Message is copied into an in-memory ring-buffer (a <a href="https://github.com/LMAX-Exchange/disruptor">Disruptor</a> instance)</li>
<li>In parallel, the message is replicated to a secondary, and journalled to disk</li>
<li>Once replicated, the message is handled by the application thread, executing business logic</li>
<li>The business logic will publish zero-to-many response messages into publisher ring-buffers</li>
<li>A consumer thread for each publisher ring-buffer will write outbound messages to the network</li>
</ol>
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/376661bf9567c2b649995255cbd6cceb808ea129/2016_01/LMAX-arch-MonitoringDetail-publisher.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="482" src="https://raw.githubusercontent.com/epickrram/blog-images/376661bf9567c2b649995255cbd6cceb808ea129/2016_01/LMAX-arch-MonitoringDetail-publisher.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="text-align: center;">
<i><br /></i></div>
<div style="text-align: center;">
<i>Message processing in the matching engine</i></div>
<br />
<br />
While performing investigation that eventually led to a decent increase in <a href="http://epickrram.blogspot.co.uk/2015/07/seek-write-vs-pwrite.html">journalling performance</a>, we found that it was extremely useful to instrument our code and monitor exactly how long it was taking to process a message at various points within our matching engine.<br />
<br />
This led to a technique whereby we trace the execution time of every message through the service, and report out a detailed breakdown if the total processing time exceeds some reporting threshold.<br />
<br />
The image below shows where in the message flow we record nanosecond-precision timestamps during message processing. For more detail on just how precise those timestamps are likely to be, please refer to the mighty Shipilev's '<a href="http://shipilev.net/blog/2014/nanotrusting-nanotime/">Nanotrusting the Nanotime</a>'.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_01/LMAX-arch-MonitoringDetail-durations.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_01/LMAX-arch-MonitoringDetail-durations.png" width="640" /></a></div>
<br />
<div style="text-align: center;">
<i>Recording 'infrastructure' and 'logic' processing times</i></div>
<br />
<br />
<br />
For the purposing of our monitoring, what we care about is that <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> calls are cheap. Due to this property, we can perform very low-overhead monitoring all the time, and only report out detail when something interesting happens, such as a threshold being triggered.<br />
<br />
We classify two types of processing duration within the matching engine:<br />
<br />
- '<i>infrastructure</i>' duration, which is the time taken between message receipt from the network, and the business logic beginning to execute. This includes the time taken to journal and replicate synchronously to the secondary. As such, it is a good indicator of problems external to the application.<br />
<br />
- '<i>logic</i>' duration, which is the time taken from the start of logic execution until processing of that particular message is complete.<br />
<br />
Within the logic duration, we also have a breakdown of the time taken between each published outbound message, which we will refer to as '<i>inter-publish</i>' latency. Consider some example order-matching logic:<br />
<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/bca9261d279a09dbb447.js"></script>
<br />
In this example, <span style="font-family: "courier new" , "courier" , monospace;">transactionReporter</span> and <span style="font-family: "courier new" , "courier" , monospace;">tradeReporter</span> are both proxies to a publisher ring-buffer. Given our instrumented trace-points in the code, we can determine how long it took the intervening methods to execute (the 'inter-publish' latency).<br />
<br />
Now if we consider the scenario where due to a code or data change, the time taken to execute <span style="font-family: "courier new" , "courier" , monospace;">engine.updateAggregateOrderStatistics()</span> has dramatically increased (enough to trip the reporting threshold), we will have the necessary information to pinpoint this function as the culprit.<br />
<br />
This monitoring capability has proved to be extremely useful in tracking down a number of performance issues, some of which have been very difficult to replicate in performance environments.<br />
<br />
<br />
<h3 style="text-align: left;">
<span style="font-family: inherit;">The symptom</span></h3>
<br />
<br />
In addition to reporting out detailed information when a threshold is triggered, we also utilise the recorded data and report some aggregated metrics every second. This can be a valuable tool in detecting modal changes in behaviour over time, for example after a new code release.<br />
<br />
The first change in behaviour we noticed was the mean average 'logic' time per-second. This is calculated by summing total logic time within a second and dividing by the number of messages processed. Now, we all know that relying on averages for anything is evil, but they do have a use when comparing overall behaviour.<br />
<br />
Below is a screen-grab of a chart comparing our average logic processing time (blue) to the same metric from the previous week (green). In this chart, lower is better, so we could see that there had been a clear regression in performance at some point in the last week.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_01/mean-logic-processing-time.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="376" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2016_01/mean-logic-processing-time.png" width="640" /></a></div>
<br />
<div style="text-align: center;">
<i>Comparing the change in average logic time</i></div>
<br />
<br />
This regression was not evident in any of our other production environments, nor possible to replicate in performance environments.<br />
<br />
Given that the average logic time was increased, it followed that the per-request logic time had also increased. Careful comparison of data from different environments running the same code release showed us that our 'inter-publish' latency was up to <b>15x</b> higher in the affected environment.<br />
<br />
After making this comparison, we were fairly sure that this problem was environmental, as the inter-publish latencies are recorded during the execution of a single thread, without locks or I/O of any kind. Since the execution is all in-memory, we were unable to come up with a scenario in which the code would run slower in one environment compared to another, given that the systems were running on identical hardware.<br />
<br />
One data item we did have was the fact that we had performed a code release over the previous weekend. This pointed towards a code change of some sort, that did not affect all instances of the matching engine equally.<br />
<br />
<br />
<h3 style="text-align: left;">
A complicating factor</h3>
<br />
<br />
Looking at the code changes released at the weekend, we could see that one of the most major changes was a different <a href="https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/WaitStrategy.java"><span style="font-family: "courier new" , "courier" , monospace;">WaitStrategy</span></a> that we were using in our main application Disruptor instances. We had deployed a hybrid implementation that merged the behaviour of both the <a href="https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/BusySpinWaitStrategy.java"><span style="font-family: "courier new" , "courier" , monospace;">BusySpinWaitStrategy</span></a> (for lowest inter-thread latency) and the <a href="https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/TimeoutBlockingWaitStrategy.java"><span style="font-family: "courier new" , "courier" , monospace;">TimeoutBlockingWaitStrategy</span></a> (for time-out events when the ring-buffer is empty).<br />
<br />
The implementation of our <span style="font-family: "courier new" , "courier" , monospace;">TimeoutBusySpinWaitStrategy</span> involved some busy-spinning for a set number of loops, followed by a poll of <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span><span style="font-family: inherit;"> to check whether the time-out period has elapsed. This change meant that we were making 1000s of extra calls to </span><span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span><span style="font-family: inherit;"> every second.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">This in itself seemed fairly innocuous - the change had passed through our performance environment without any hint of regression, but the evidence pointed towards some kind of contention introduced with a higher frequency of </span><span style="font-family: "courier new" , "courier" , monospace;">nanoTime</span><span style="font-family: inherit;"> calls. This hypothesis was backed up by the fact that our inter-publish latency was higher - if we consider that in order to calculate this latency, we need to call </span><span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span><span style="font-family: inherit;">.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">So, if there is some kind of contention in retrieving the timestamp, and we have massively increased the rate at which we are making the call, then we would expect to see an increase in processing times, due to the recording of inter-publish latency.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<br />
<h3 style="text-align: left;">
<span style="font-family: inherit;">The root cause</span></h3>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Given our theory that the calls to </span><span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span><span style="font-family: inherit;"> are taking longer than usual in this particular environment, we starting digging a bit deeper on the box in question. Very quickly, we found the relevant line in the syslog:</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">WARNING: Clocksource tsc unstable (delta = XXXXXX ns)</span><br />
<br />
followed a little later by:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">WARNING: CPU: 27 PID: 2470 at kernel/time/tick-sched.c:192 can_stop_full_tick+0x13b/0x200()</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">NO_HZ FULL will not work with unstable sched clock</span><br />
<br />
<br />
So now we had a hint that the clocksource had changed on that machine. Comparing the clocksource on all other machines showed that the affected host was using the <span style="font-family: "courier new" , "courier" , monospace;">hpet</span> clocksource instead of <span style="font-family: "courier new" , "courier" , monospace;">tsc</span>.<br />
<br />
To view the available and currently-selected clocksources on a Linux machine, consult the following files:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">/sys/devices/system/clocksource/clocksource0/available_clocksource</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">/sys/devices/system/clocksource/clocksource0/current_clocksource</span><br />
<br />
<br />
<br />
<h3 style="text-align: left;">
Experimentation and verification</h3>
<br />
<br />
Before making any changes or deploying fixes, we wanted to make sure that we definitely understood the problem at hand. To test our theory it was a simple matter to design an experiment that would replicate the change in behaviour due to the code release: create a thread that continuously calls <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span>, recording the time taken between two calls; scale up a number of worker threads that are just calling <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> many times per second.<br />
<br />
There is a <a href="https://github.com/epickrram/nanotiming">small application</a> on Github to do exactly this, and it clearly demonstrates the difference between clock sources.<br />
<br />
When the clocksource is <span style="font-family: "courier new" , "courier" , monospace;">tsc</span>, calls to retrieve the system nanoseconds have a granularity of ~25ns. Increasing the number of threads does not seem to impact this number a great deal.<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">java -jar ./nanotiming-all-0.0.1.jar 0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Measuring time to invoke System.nanoTime() with 0 contending threads.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Available logical CPUs on this machine: 32</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Current clocksource is: tsc</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">15:14:02.861 avg. time between calls to System.nanoTime() 25ns</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">15:14:03.760 avg. time between calls to System.nanoTime() 25ns</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">15:14:04.760 avg. time between calls to System.nanoTime() 25ns</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">15:14:05.760 avg. time between calls to System.nanoTime() 25ns</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java -jar ./nanotiming-all-0.0.1.jar 4</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Measuring time to invoke System.nanoTime() with 4 contending threads.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Available logical CPUs on this machine: 32</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Current clocksource is: tsc</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:14:58.751 avg. time between calls to System.nanoTime() 25ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:14:59.730 avg. time between calls to System.nanoTime() 25ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:15:00.730 avg. time between calls to System.nanoTime() 25ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:15:01.730 avg. time between calls to System.nanoTime() 25ns</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java -jar ./nanotiming-all-0.0.1.jar 16</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Measuring time to invoke System.nanoTime() with 16 contending threads.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Available logical CPUs on this machine: 32</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Current clocksource is: tsc</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:16:17.637 avg. time between calls to System.nanoTime() 27ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:16:18.616 avg. time between calls to System.nanoTime() 29ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:16:19.616 avg. time between calls to System.nanoTime() 29ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:16:20.616 avg. time between calls to System.nanoTime() 30ns</span></div>
<div>
<br /></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Things look somewhat worse once we are using the <span style="font-family: "courier new" , "courier" , monospace;">hpet</span> clocksource.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java -jar ./nanotiming-all-0.0.1.jar 0</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Measuring time to invoke System.nanoTime() with 0 contending threads.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Available logical CPUs on this machine: 32</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Current clocksource is: hpet</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:00.029 avg. time between calls to System.nanoTime() 612ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:01.267 avg. time between calls to System.nanoTime() 675ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:02.267 avg. time between calls to System.nanoTime() 610ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:03.267 avg. time between calls to System.nanoTime() 610ns</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java -jar ./nanotiming-all-0.0.1.jar 4</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Measuring time to invoke System.nanoTime() with 4 contending threads.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Available logical CPUs on this machine: 32</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Current clocksource is: hpet</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:24.522 avg. time between calls to System.nanoTime() 1443ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:25.498 avg. time between calls to System.nanoTime() 1451ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:26.498 avg. time between calls to System.nanoTime() 1438ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:27.497 avg. time between calls to System.nanoTime() 1443ns</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java -jar ./nanotiming-all-0.0.1.jar 16</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Measuring time to invoke System.nanoTime() with 16 contending threads.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Available logical CPUs on this machine: 32</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Current clocksource is: hpet</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:44.542 avg. time between calls to System.nanoTime() 7949ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:45.465 avg. time between calls to System.nanoTime() 7466ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:46.464 avg. time between calls to System.nanoTime() 9202ns</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">15:19:47.464 avg. time between calls to System.nanoTime() 7082ns</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
The <span style="font-family: "courier new" , "courier" , monospace;">tsc</span> clock source seems to scale perfectly with a number of competing threads attempting to read the time. This stands to reason, as the <span style="font-family: "courier new" , "courier" , monospace;">tsc</span> clock reads the CPU-local <a href="https://en.wikipedia.org/wiki/Time_Stamp_Counter">Time Stamp Counter</a>, and so there is no contention.</div>
<div>
<br /></div>
<div>
The <span style="font-family: "courier new" , "courier" , monospace;">hpet</span> clock, apart from being slower by 20x even when uncontended, does not scale with contending threads, so calls to <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> take longer as more threads try to read the current timestamp. </div>
<div>
<br /></div>
<div>
This notable difference is due to the <span style="font-family: "courier new" , "courier" , monospace;">hpet</span> clock being provided to the kernel as <a href="https://en.wikipedia.org/wiki/High_Precision_Event_Timer">memory-mapped I/O</a>, rather than a CPU-local register read.</div>
<div>
<br /></div>
<div>
While investigating this problem and in trying to understand why <span style="font-family: "courier new" , "courier" , monospace;">hpet</span> was so much slower and more prone to contention, the only reference to the phenomenon that I could find was <a href="https://lkml.org/lkml/2015/8/31/456">this quote</a>:</div>
<div>
<br /></div>
<div>
<div>
"I'm quite sure that you are staring at the HPET scalability bottleneck</div>
<div>
and not at some actual kernel bug."</div>
</div>
<div>
<br /></div>
<div>
In some circles, this is obviously a known issue. </div>
<div>
<br /></div>
<div>
Application developers who happily sprinkle calls to <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> (or the equivalent <span style="font-family: "courier new" , "courier" , monospace;">gettime(CLOCK_MONOTONIC)</span> call in other languages) around their code should be aware that these calls could be more costly than expected.</div>
<div>
<br /></div>
<div>
<br /></div>
<h3 style="text-align: left;">
The Fix</h3>
<div>
<br /></div>
<div>
<br /></div>
<div>
Having identified the problem, we were keen to roll out a fix - surely we could just switch back to the <span style="font-family: "courier new" , "courier" , monospace;">tsc</span> clocksource? Unfortunately, once the kernel watchdog has <a href="http://lxr.free-electrons.com/source/kernel/time/clocksource.c?v=3.18#L298">marked the clocksource as unstable</a>, it is removed from the list of available clocksources. In this case, we had to wait until the following weekend to switch out the hardware before the problem was properly solved.</div>
<div>
<br /></div>
<div>
Longer-term, how can we ensure that we don't have the same problem if the kernel once again determines that the <span style="font-family: "courier new" , "courier" , monospace;">tsc</span> clock is unstable? Since we use Azul's Zing JVM at LMAX, we were able to take advantage of their <span style="font-family: "courier new" , "courier" , monospace;">-XX:+UseRdtsc</span> runtime flag, which forces the runtime to <a href="http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/2681e95121ac/src/share/vm/classfile/vmSymbols.hpp#l724">intrinsify</a> <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> calls to a direct read of the local TSC, rather than going through the kernel's <a href="http://lxr.free-electrons.com/source/arch/x86/vdso/vclock_gettime.c?v=3.18#L292">gettime vdso</a>.</div>
<div>
<br /></div>
<div>
This neatly side-steps the problem if you're running Zing, but on Oracle/OpenJDK there is no such flag.</div>
<div>
<br /></div>
<div>
<br /></div>
<h3 style="text-align: left;">
Conclusion</h3>
<div>
<br /></div>
<div>
<br /></div>
<div>
Good monitoring strategies require precise and accurate timestamping capabilities in order to correctly measure latencies<span style="font-family: inherit;">. Using </span><span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> is the recommended way to sample an accurate timestamp from Java. Depending on the clock in use by the (Linux) system under test, actual timestamp granularity may vary wildly.</div>
<div>
<br /></div>
<div>
As usual, digging down through the layers of abstraction provided by various libraries and virtual machines has proven to be a valuable experience. Through the application of scientific principles and ensuring that problems are repeatable and understood, we can more successfully reason about what is actually happening in these complex machines that run our software.<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script><br /></div>
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-18890393906359673732015-12-09T12:40:00.001+00:002020-09-14T19:32:18.820+01:00Journalling Revisited<div dir="ltr" style="text-align: left;" trbidi="on"><p style="text-align: left;"><span></span></p><a name='more'></a><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><p></p><p style="text-align: left;"><span style="font-size: large;"><span></span></span></p><!--more--><span style="font-size: large;"> </span> <br /><p></p></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">A few months ago, I wrote about how we had improved our journalling write latency at <a href="https://www.lmax.com/">LMAX</a> by <a href="http://epickrram.blogspot.co.uk/2015/05/improving-journalling-latency.html">upgrading our kernel and file-system</a>. As a follow up to some discussion on write techniques, I then explored the <a href="http://epickrram.blogspot.co.uk/2015/07/seek-write-vs-pwrite.html">difference between a seek/write and positional write strategy</a>.<br />
<br />
The journey did not end at that point, and we carried on testing to see if we could improve things even further. Our initial upgrade work involved changing the file-system from <span style="font-family: "courier new" , "courier" , monospace;">ext3</span> to <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> (reflecting the default choice of the kernel version that we upgraded to).<br />
<br />
However, it turns out that there is <a href="http://lxr.free-electrons.com/source/fs/">quite a lot of choice</a> available in the kernel in terms of file-systems. After doing a little reading, it seemed as though it would be worth trying out <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> for our journalling file-system.<br />
<br />
The rest of this post looks at the suitability of the <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> file-system for a very specific workload.<br />
<br />
At LMAX our low-latency core services journal every inbound message to disk before they are processed by business-logic, giving us recoverability in the event of a crash. This has the nice side-effect of allowing us to replay events at a later date, which has proven invaluable in tracking down the occasional logic error.<br />
<br />
Due to the nature of our workload, we are interested in write-biased, append-only performance. This is in contrast to a database or desktop system, which will necessarily require a more random access strategy.<br />
<br />
<br />
<h3 style="text-align: left;">
Test Setup</h3>
<div>
<br /></div>
<div style="text-align: left;">
Since I already have a <a href="https://github.com/epickrram/journalling-benchmark">test harness</a> used for this kind of testing, I can use it to do a quick comparison of a production-like workload on both <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> and <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>. Tests are performed on a Dell PowerEdge 720xd, file-systems are mounted as logical volumes on the same battery-backed RAID array. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Mount options for <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> are <span style="font-family: "courier new" , "courier" , monospace;">noatime,barrier=0</span>; mount options for <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> are <span style="font-family: "courier new" , "courier" , monospace;">noatime</span>.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
No other IO is performed on the file-systems while the tests are in progress.<br />
<br />
Over the years, our journalling code has acquired some performance-related features. While running on <span style="font-family: "courier new" , "courier" , monospace;">ext3</span>, we found that the best performance was achieved by preallocating file blocks, and using a reader thread to ensure that the files were hot in the kernel's page cache before attempting to write to them.<br />
<br />
We could assume that this feature will still be beneficial when using <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> and <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>, since it seems logical that keeping things cache-hot and reducing the work needed to write will make things quicker. When making such a fundamental change to a system though, it's best to re-validate any previous findings.<br />
<br />
In order to do this, I ran three different configurations on each file-system:<br />
<br />
1) <a href="https://github.com/epickrram/journalling-benchmark/blob/master/src/main/java/com/epickrram/benchmark/journal/setup/FilePreallocator.java#L48">no pre-allocation of journal files</a> - i.e. first write creates a sparse file and writes to it<br />
2) <a href="https://github.com/epickrram/journalling-benchmark/blob/master/src/main/java/com/epickrram/benchmark/journal/setup/FilePreallocator.java#L57">pre-allocate sparse files</a> - i.e. the inodes exist, but size-on-disk is zero<br />
3) <a href="https://github.com/epickrram/journalling-benchmark/blob/master/src/main/java/com/epickrram/benchmark/journal/setup/FilePreallocator.java#L59">pre-allocate and fill files</a> - i.e. write zeroes for the total file-length so that the page-cache can be populated<br />
<br /></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<h3 style="text-align: left;">
Initial Results</h3>
</div>
<div style="text-align: left;">
<br />
As I've stated in previous posts, what we care about most is consistent low-latency. Our customers expect instructions to be processed in a timely manner, <i>all of the time</i>. For this reason, we really care about taming long-tail latency, while still providing excellent best-case response times.<br />
<br />
The test harness records a full histogram of write latencies over the course of several test runs, these data can be used to inspect the difference in maximum write latency for each configuration:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2015_12/img_max_latency_no_read_load.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2015_12/img_max_latency_no_read_load.png" width="640" /></a></div>
<br />
<br />
<br />
The best results are from <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> when there is sparse pre-allocation; actually priming the pages before writing does not seem to make things any better. This result isn't particularly intuitive, but seems to hold true for <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> also. For all configurations on <span style="font-family: "courier new" , "courier" , monospace;">ext4</span>, multi-millisecond write latencies are apparent.<br />
<br />
In order to get an idea of what sort of write performance we can expect <i>most of the time</i>, we can look at the mean write latency over the same set of experiment data:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2015_12/img_mean_latency_no_read_load.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2015_12/img_mean_latency_no_read_load.png" width="640" /></a></div>
<br />
<br />
<br />
Here it's clear that <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> gives better results, with the average write beating all <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> modes by a few hundred nanoseconds. Interestingly, in the average case, pre-allocating and priming files seems to give the best results for both file-systems. This result fits better with intuition, since writes to existing cache-hot pages <i>should</i> be doing less work.<br />
<br />
<br />
So far, the average and worst-case latencies seem to be best served by using <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> with sparsely allocated files. Looking more closely at the results however shows that <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> has a better four-nines latency profile when writing to fully-primed journal files:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/blog-images/master/2015_12/img_four_nines_latency_no_read_load.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/blog-images/master/2015_12/img_four_nines_latency_no_read_load.png" width="640" /></a></div>
<br />
<br />
Still, since we are mostly concerned with long-tail latency, <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> is the winner so far.</div>
<div style="text-align: left;">
<br /></div>
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
Digging Deeper</h3>
<div>
<br /></div>
<div>
Before making the transition to <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>, we spent a long time trying to figure out where these poor <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> write times were coming from. Conceptually, both file-systems are just writing to battery-backed cache, so it must be something within the file-system layer that is causing the problem.</div>
<div>
<br /></div>
<div>
After much detailed investigation, we tracked the problem down to interaction between the writer code and the <a href="http://lxr.free-electrons.com/source/fs/ext4/ext4_jbd2.c?v=3.18">ext4 journalling daemon</a> (jbd2). The system would regularly get into a state where the writer thread needed to write the file, but the journalling daemon owned the lock required by the writer thread.</div>
<div>
<br /></div>
<div>
The response of the writer thread is to register to wait on the resource to be freed, and then call schedule(), effectively yielding the CPU. The writer thread will not be scheduled to run again until the journalling daemon has done its work. We found that this interaction could cause latency spikes of hundreds of milliseconds.</div>
<div>
<br /></div>
<div>
After switching over to <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> in our performance and production environments, we saw a greatly reduced latency in our journalling code.</div>
<div>
<br /></div>
<div>
<br /></div>
<h3 style="text-align: left;">
You wanted a free lunch?</h3>
<div>
<br /></div>
<div>
Of course, nothing comes for free, and <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> has its own particular foibles. During investigating another unrelated issue, we came across an interesting feature of <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>' design.<br />
<br />
We have another workload for which we care about the file-system performance - one of our other services performs fast append-only writes to a number of journal files, but also requires that random-access reads can be done on those files.<br />
<br />
Since we had such good results using <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> for our core services, we naturally thought it would be worth trying out for other services. Switching to <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> on the mixed-mode workload was disastrous however, with frequent long stalls on the writer thread. No such long stalls were observed when using <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> for this workload.</div>
<div>
<br />
The reason for the difference became clear when we took a closer look at what actually happens when performing reads with the two file-systems.<br />
<br />
When performing a file read using the <span style="font-family: "courier new" , "courier" , monospace;">read</span> syscall, we end up calling the <a href="http://lxr.free-electrons.com/source/fs/read_write.c?v=3.18#L415"><span style="font-family: "courier new" , "courier" , monospace;">vfs_read</span></a> function, which in turn delegates the call to the file-system's <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://lxr.free-electrons.com/source/include/linux/fs.h?v=3.18#L1486">file_operations</a></span> struct by invoking the <span style="font-family: "courier new" , "courier" , monospace;">read</span> method. In the case of both <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> and <span style="font-family: "courier new" , "courier" , monospace;">ext4</span>, the <span style="font-family: "courier new" , "courier" , monospace;">file_operations</span>' read method is a pointer to the <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://lxr.free-electrons.com/source/fs/read_write.c?v=3.18#L394">new_sync_read</a></span> function.</div>
<div>
<br />
Within <span style="font-family: "courier new" , "courier" , monospace;">new_sync_read</span>, the call is delegated back to the <span style="font-family: "courier new" , "courier" , monospace;">file_operations</span>' <span style="font-family: "courier new" , "courier" , monospace;">read_iter</span> function pointer:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/e51ae703eb5a860546e4.js"></script>
<br />
Here's where things start to get interesting. The <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> <span style="font-family: "courier new" , "courier" , monospace;">file_operations</span> <a href="http://lxr.free-electrons.com/source/fs/ext4/file.c?v=3.18#L588">simply forward</a> the <span style="font-family: "courier new" , "courier" , monospace;">read_iter</span> call to <span style="font-family: "courier new" , "courier" , monospace;">generic_file_read_iter</span>, which ends up invoking <a href="http://lxr.free-electrons.com/source/mm/filemap.c?v=3.18#L1467"><span style="font-family: "courier new" , "courier" , monospace;">do_generic_file_read</span></a>, the function that actually reads data from the kernel's page cache.<br />
<br />
The behaviour of <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> differs in that when <span style="font-family: "courier new" , "courier" , monospace;">read_iter</span> is called, it delegates to <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://lxr.free-electrons.com/source/fs/xfs/xfs_file.c?v=3.18#L233">xfs_file_read_iter</a></span>. The <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>-specific function wraps its own call to <span style="font-family: "courier new" , "courier" , monospace;">generic_file_read_iter</span> in a lock:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/feabdf8620b809b1825a.js"></script>
So, in order to perform a file read in <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>, it is necessary to acquire a read-lock for the file. The locks used in <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> are the kernel's multi-reader locks, which allow a single writer or multiple readers.<br />
<br />
In order to perform a write to an <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>-managed file, the writer must acquire an exclusive write lock:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/922387b56cec6a77f291.js"></script>
Herein lies the problem we encountered. If you want to write to an <span style="font-family: "courier new" , "courier" , monospace;">xfs</span>-managed file, you need an exclusive lock. If reads are in progress, the writer will block until they are complete. Likewise, reads will not occur until the current write completes.<br />
<br />
In summary, <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> seems to be a safer option than <span style="font-family: "courier new" , "courier" , monospace;">ext4</span> from a data-integrity point of view, since no reads will ever occur while a writer is modifying a file's pages. There is a cost to this safety however, and whether it's the correct choice will be dependent on the particular workload.<br />
<br />
<br />
<h3 style="text-align: left;">
Conclusion</h3>
<br />
Since moving to using <span style="font-family: "courier new" , "courier" , monospace;">xfs</span> as our journalling file-system, we have observed no re-occurrence of the periodic latency spikes seen when running on <span style="font-family: "courier new" , "courier" , monospace;">ext4</span>. This improvement is highly workload-dependent, and the findings presented here will not suit all cases.<br />
<br />
When making changes to the system underlying your application, be it kernel, file-system, or hardware, it is important to re-validate any tunings or features implemented in the name of improved performance. What holds true for one configuration may not have the same positive impact, or could even have a negative impact on a different configuration.<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<br /></div>
<div>
<br /></div>
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-35828206062520197692015-11-04T14:21:00.004+00:002020-09-14T19:32:36.301+01:00Reducing system jitter - part 2<div dir="ltr" style="text-align: left;" trbidi="on"><p style="text-align: left;"><span></span></p><a name='more'></a><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span><p></p><p style="text-align: left;"><span style="font-size: large;"><span></span></span></p><!--more--><span style="font-size: large;"> </span> <br /><p></p></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">In my last post, I demonstrated some techniques for reducing system jitter in low-latency systems. Those techniques brought worst-case inter-thread latency from 11 millliseconds down to 15 <i>microseconds</i>.<br />
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Next steps</h3>
<br />
I left off having reserved some CPU capacity using the <span style="font-family: "courier new" , "courier" , monospace;">isolcpus</span> boot parameter. While this command stopped any other user-land processes being scheduled on the specified CPUs, there were still some kernel threads restricted to running on those CPUs. These threads can cause unwanted jitter if they are scheduled to execute while your application is running.<br />
<br />
To see what other processes are running on the reserved CPUs (in this case 1 & 3), and how long they are running for, <span style="font-family: inherit;"><a href="https://perf.wiki.kernel.org/index.php/Main_Page">perf_events</a></span> can be used to monitor <a href="https://en.wikipedia.org/wiki/Context_switch">context switches</a> during a run of the benchmark application:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">perf record -e "sched:sched_switch" -C 1,3</span><br />
<br />
A context switch will occur whenever the scheduler decides that another process should be executed on the CPU. This will certainly be a cause of latency, so monitoring this number during a run of the benchmark harness can reveal whether these events are having a negative effect.<br />
<br />
Removing <span style="font-family: "courier new" , "courier" , monospace;">swapper</span> (the idle process) and <span style="font-family: "courier new" , "courier" , monospace;">java</span> from the trace output, it is possible to determine how many times these other threads were executed on the supposedly isolated CPUs:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">count process</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">--------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 1 ksoftirqd/1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 4 kworker/1:0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 2 kworker/3:1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 4 migration/1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 4 migration/3</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 13 watchdog/1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 13 watchdog/3</span><br />
<div>
<br /></div>
<div>
<br /></div>
<div>
In order to determine whether these processes are contributing significantly to system jitter, the <span style="font-family: "courier new" , "courier" , monospace;">sched_switch</span> event can be studied in more detail to find out how long these processes take to execute.</div>
<div>
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Cost of context switching</h3>
</div>
<div>
<br /></div>
<div>
Given a pair of <span style="font-family: "courier new" , "courier" , monospace;">sched_switch</span> events, the delta between the trace timestamps will reveal the interesting piece of data:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java 4154 [003] 5456.403072: sched:sched_switch: java:4154 [120] <b>R</b> ==> watchdog/3:27 [0]</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">watchdog/3:27 [003] 5456.403074: sched:sched_switch: watchdog/3:27 [0] <b>S</b> ==> java:4154 [120]</span> </div>
<div>
<br /></div>
<div>
<br /></div>
<div>
In the example above, the runnable <span style="font-family: "courier new" , "courier" , monospace;">java</span> process was switched out in favour of the <span style="font-family: "courier new" , "courier" , monospace;">watchdog</span> process. After 2 microseconds, the now sleeping watchdog is replaced by the java process that was previously running on the CPU.<br />
<br />
In this case, the two microsecond context switch doesn't match the worst-case of 15 microseconds, but it is a good indicator that these events could contribute significantly to multi-microsecond levels of jitter within the application.</div>
<div>
<br /></div>
<div>
Armed with this knowledge, it is a simple matter to write a script to report the observed runtimes of each of these kernel threads:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ksoftirqd/1 min: 11us, max: 11us, avg: 11us</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/1:0 min: 1us, max: 7us, avg: 5us</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/3:1 min: 6us, max: 18us, avg: 12us</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">watchdog/1 min: 3us, max: 5us, avg: 3us</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">watchdog/3 min: 2us, max: 5us, avg: 2us</span></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
From these results, it can be seen that the <span style="font-family: "courier new" , "courier" , monospace;">kworker</span> threads took up to 18us, soft IRQ processing took 11us, and <span style="font-family: "courier new" , "courier" , monospace;">watchdog</span> up to 5us.</div>
<div>
<br /></div>
<div>
Given the current worst-case latency of ~15us, it appears as though these kernel threads could be responsible for the remaining inter-thread latency.</div>
<div>
<br /></div>
<div>
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Dealing with kernel threads</h3>
<br /></div>
<div>
Starting with the process that is taking the most time, the first thing to do is to work out what that process is actually doing.</div>
<div>
<br /></div>
<div>
My favourite way to see what a process is up to is to use another Linux tracing tool, <a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt">ftrace</a>. The name of this tool is a shortening of 'function tracer', and it's purpose is exactly as described - to trace functions.<br />
<br />
Ftrace has many uses and warrants a whole separate post; the interested reader can find more information in the online documentation. For now, I'm concerned with the function_graph plugin, which provides the ability to record any kernel function calls made by a process. Ftrace makes use of the debug filesystem (usually mounted at <span style="font-family: "courier new" , "courier" , monospace;">/sys/kernel/debug/</span>) for control and communication. Luckily, the authors have provided a nice usable front-end in the form of <span style="font-family: "courier new" , "courier" , monospace;">trace-cmd</span>.</div>
<div>
<br /></div>
<div>
To discover why the <span style="font-family: "courier new" , "courier" , monospace;">kworker</span> process is being scheduled, <span style="font-family: "courier new" , "courier" , monospace;">trace-cmd</span> can be used to attach and record function calls:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">trace-cmd record -p function_graph -P <PID> </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> plugin 'function_graph'</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Hit Ctrl^C to stop recording</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">...</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">trace-cmd report</span></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/3:1-656 [003] 8440.438017: funcgraph_entry: | process_one_work() {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">...</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/3:1-656 [003] 8440.438025: funcgraph_entry: | vmstat_update() {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/3:1-656 [003] 8440.438025: funcgraph_entry: | refresh_cpu_vm_stats() {</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
The important method here is <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://lxr.free-electrons.com/source/kernel/workqueue.c?v=4.0#L1921">process_one_work()</a></span>. It is the function called when there is work to be done in the kernel's per-CPU <a href="https://lwn.net/Articles/11360/">workqueue</a>. In this particular invocation, the work to be done was a call to the <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://lxr.free-electrons.com/source/mm/vmstat.c?v=4.0#L1364">vmstat_update()</a></span> function.</div>
<div>
<br /></div>
<div>
Since it can be tedious (though enlightening) to walk through the verbose ftrace output, perf_events can be used to get a better view of what workqueue processing is being done via the <span style="font-family: "courier new" , "courier" , monospace;">workqueue</span> trace points compiled into the kernel:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">perf record -e "workqueue:*" -C 1,3</span></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Workqueue start/end events can be seen in the trace output:</div>
<div>
<br /></div>
<div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/3:1 656 [003] 9411.223605: workqueue:workqueue_execute_start: work struct 0xffff88021f590520: function vmstat_update</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">kworker/3:1 656 [003] 9411.223607: workqueue:workqueue_execute_end: work struct 0xffff88021f590520</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
A simple script can then be used to determine how long each workqueue item takes:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">4000us function vmstat_update ended at 9592.370051</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">4000us function vmstat_update ended at 9592.370051</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">4000us function vmstat_update ended at 9593.366948</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">13000us function vmstat_update ended at 9594.367875</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Here the kernel is scheduling calls to update vm per-CPU stats, sometimes taking over 10 microseconds to complete. The function is scheduled to run every second by default.<br />
<br />
In order to determine whether this is the source of max inter-thread latency, it is possible to defer this work so that the <span style="font-family: "courier new" , "courier" , monospace;">kworker</span> thread remains dormant. This can be deferred by changing the system setting <span style="font-family: "courier new" , "courier" , monospace;">vm.stat_interval</span>:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">sysctl -w vm.stat_interval=<b>3600</b></span></div>
<div>
<br /></div>
<div>
<br />
This command will tell the kernel to only update these values every hour. Note that this may not be a particularly good thing to do - but for the purposes of this investigation it is safe.</div>
<div>
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Validating changes</h3>
</div>
<div>
<br /></div>
<div>
Since a change has been made, it is good practice to validate that the desired effect has occurred. Re-running the test while monitoring workqueue events should show that there are no more calls to <span style="font-family: "courier new" , "courier" , monospace;">vmstat_update()</span>. Using the method already covered above to observe context switches, it can be seen that other than application threads, only the <span style="font-family: "courier new" , "courier" , monospace;">watchdog</span> processes have executed on the isolated CPUs, proving that kernel worker threads are no longer interfering with the scheduling of the application threads:<br />
<br /></div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">count process</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">-------------------</span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> 58 watchdog/1</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> 58 watchdog/3</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
So, has this helped the inter-thread latency? Charting the results against the baseline shows that there is still something else happening to cause jitter, though the worst-case is now down to 7us. There is a clear difference at the higher percentiles - something that used to take 5-7us is gone; this is almost certainly the change to remove calls to <span style="font-family: "courier new" , "courier" , monospace;">vmstat_update()</span>, which aside from doing actual work, also caused a context switch, which comes with its own overhead.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/baseline_vs_no_workqueue.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/baseline_vs_no_workqueue.png" width="640" /></a></div>
<div style="text-align: center;">
<i>Effect of removing kernel workqueue tasks from isolated cpus</i></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Next on the list are the <span style="font-family: "courier new" , "courier" , monospace;">watchdog</span> threads. These are processes created by the kernel to <a href="http://lxr.free-electrons.com/source/kernel/watchdog.c?v=4.0">monitor the operating system</a>, making sure that there aren't any runaway software processes causing system instability. It is possible to disable these with the boot parameter <span style="font-family: "courier new" , "courier" , monospace;">nosoftlockup</span>. An exhaustive list of kernel boot parameters can be found <a href="https://www.kernel.org/doc/Documentation/kernel-parameters.txt">here</a>. The effect of disabling these processes means that the system is no longer able to detect CPU-hogging processes, so should be done with caution.<br />
<br />
After rebooting and confirming that the watchdog processes are no longer created, the test must be executed again to observe whether this change has had any effect.</div>
<div>
<br /></div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/no_watchdog.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/no_watchdog.png" width="640" /></a></div>
<div style="text-align: center;">
<i>Effect of removing kernel watchdog processes</i></div>
<br /></div>
<div>
<br /></div>
<div>
After removing the watchdog processes and re-running the test, the worst-case latency is down to 3us - a clear win. Since the watchdog's workload is actually very small, it's reasonable to put most of the saving down to the fact that the application threads are no longer context-switched out.</div>
<div>
<br /></div>
<div>
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Deeper down the stack</h3>
<br />
Despite significant progress, there is still a large variation in inter-thread latency. At this point, it is clear that our user code is executing all the time (no context switches are now observed); runtime stalls such as safepoints and garbage-collection have been ruled out, and the OS is no longer causing scheduler jitter by switching out application threads in favour of kernel threads.<br />
<br />
Clearly something else is still causing the application to take a variable amount to time to pass messages between threads.<br />
<br />
<br />
Once the software (runtime & OS) is out of the frame, it is time to turn to the hardware. One source of system jitter that originates in the hardware is <a href="https://en.wikipedia.org/wiki/Interrupt">interrupts</a>. Hardware interrupts sent to the CPU can be viewed in <span style="font-family: "courier new" , "courier" , monospace;">/proc/interrupts</span>:<br />
<br />
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">cat /proc/interrupts</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> CPU0 CPU1 CPU2 CPU3 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0: 43 0 0 0 IR-IO-APIC-edge timer</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">1: 342 96 3598 16 IR-IO-APIC-edge i8042</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">3: 3 0 0 0 IR-IO-APIC-fasteoi AudioDSP, dw_dmac</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">6: 0 0 0 0 IR-IO-APIC-fasteoi dw_dmac</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">7: 930 5 198 134 IR-IO-APIC-fasteoi INT3432:00, INT3433:00</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">8: 0 0 1 0 IR-IO-APIC-edge rtc0</span></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
This file can be used to check whether any hardware interrupts were sent to the isolated CPUs during the course of a test run. Note that perf_events can also be used via the <span style="font-family: "courier new" , "courier" , monospace;">irq</span> trace points.</div>
<div>
<br /></div>
<div>
Comparing snapshots of the file show that the following interrupts occurred during the run:</div>
<div>
<br />
<span style="font-family: "courier new" , "courier" , monospace;"> CPU0 CPU1 CPU2 CPU3 </span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">47: 66878 1362-><b>1366</b> 29700 1951-><b>1952</b> IR-PCI-MSI-edge i915</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">LOC: 123968 449426-><b>489201</b> 627266 469447-><b>509306</b> Local timer interrupts</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">RES: 35257 227-><b>247</b> 42175 241-><b>261</b> Rescheduling interrupts</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">TLB: 4198 59 4388 76-><b>92</b> TLB shootdowns</span><br />
<br />
<br />
Here I'm highlighting interrupt counts that changed during the execution of the benchmark, and that may have contributed to inter-thread latency.<br />
<br />
Each interrupt has an associated name that, in some cases, can be used to determine what it is used for.<br />
<br />
On my laptop, IRQ 47 is associated with the <a href="http://lxr.free-electrons.com/source/drivers/gpu/drm/i915/?v=4.0">i915 intel graphics driver</a>, and some hardware events have been raised on the isolated CPUs. This will have an impact on the executing processes, so these interrupts should be removed if possible. Bear in mind that some systems have a dynamic affinity allocator in the form of <span style="font-family: "courier new" , "courier" , monospace;">irqbalance</span><span style="font-family: inherit;">; if you want to manually set interrupt affinity, make sure that nothing is going to interfere with your settings.</span><br />
<br />
Most of the numbered IRQs can be shifted onto another CPU by interacting with the <span style="font-family: "courier new" , "courier" , monospace;">proc</span> filesystem. To find out where the device can currently send interrupts, inspect the current value of <span style="font-family: "courier new" , "courier" , monospace;">smp_affinity_list</span>, which is a list of allowed CPUs for handling this interrupt:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">cat /proc/irq/47/smp_affinity_list</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0-3</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">The value can be updated (in most cases) to steer those interrupts away from protected resources:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">echo 0,2 > /proc/irq/47/smp_affinity_list </span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">cat /proc/irq/47/smp_affinity_list </span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">0,2</span></div>
</div>
<div>
<br /></div>
<div>
<br />
Once again, a change has been made, and effects should be observed:</div>
<div>
<br /></div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/reduced_interrupts.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/reduced_interrupts.png" width="640" /></a></div>
<div style="text-align: center;">
<i>Effect of reducing driver interrupts</i></div>
<br /></div>
<div>
<br />
That change looks positive, but doesn't remove all the variation. There is still the matter of the other listed interrupts, and these cannot be so easily dealt with. They will be the subject of a later post.<br />
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Conclusion</h3>
<br /></div>
<div>
Protecting CPU resource from unwanted scheduler pressure is vital when tuning for low-latency. There are several mechanisms available for reducing scheduler jitter, with a minimum being to remove the required CPUs from the default scheduling domain.<br />
<br />
Further steps must be taken to reduce work done on the kernel's behalf, as this workload is able to pre-empt your application at any time. Kernel processes run at a higher priority than user-land processes managed by the default scheduler, so will always win.<br />
<br />
Even ruling out software interaction is not enough to fully remove adverse effects from running code - hardware interrupts will cause your program's execution to stop while the interrupt is being serviced.<br />
<br />
There is information available in the kernel via the proc file-system, or from tracing tools, that will shine a light on what is happening deeper in the guts of the kernel.<br />
<br />
Further tuning using the methods described above has reduced inter-thread latency jitter from 15 microseconds to 2.5 microseconds. Still a long way off the mean value of a few hundred nanoseconds, but definitely getting closer.<br />
<br /></div>
<br /><br /><br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-22524362747925699812015-09-26T21:57:00.001+01:002020-09-14T19:32:53.933+01:00Reducing system jitter<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">For the next instalment of this series on low-latency tuning at <a href="https://www.lmax.com/">LMAX Exchange</a>, I'm going to talk about reducing jitter introduced by the operating system.<br />
<br />
Our applications typically execute many threads, running within a JVM, which in turns runs atop the Linux operating system. Linux is a general-purpose multi-tasking OS, which can target phones, tablets, laptops, desktops and server-class machines. Due to this broad reach, it can sometimes be necessary to supply some guidance in order to achieve the lowest latency.<br />
<br />
LMAX Exchange services rely heavily on the <a href="https://github.com/LMAX-Exchange/disruptor">Disruptor</a> for fast inter-thread communication, and as such, we have a handful of 'hot' threads that we wish to always be on-CPU.<br />
<br />
Below is a simplified diagram of one of the low-latency paths through our exchange. We receive FIX requests from customers at a gateway, these requests are multiplexed into a Disruptor instance, where the consumer thread sends messages onto a 10Gb network via UDP. Those messages then arrive at the matching engine, where they are processed before a response is sent out to the customer via the FIX gateway.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/LMAX-Low-latency-path.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="http://img.epickrram.com/blog/LMAX-Low-latency-path.png" width="380" /></a></div>
<br />
<br />
Focussing on the (simplified) matching engine, we can see that there are 4 threads of execution that will affect end-to-end latency if there is jitter present in the system (in reality, there are more, this diagram is for illustrative purposes only!).<br />
<br />
<ol style="text-align: left;">
<li>The thread which is polling the network for inbound traffic</li>
<li>The thread that executes business logic, generating responses</li>
<li>The journaller thread, which writes all in-bound messages to disk</li>
<li>The publisher thread, responsible for sending responses back to the gateway</li>
</ol>
<div>
<br />
To ensure data consistency, the business-logic thread will not process a message until it has been written to the journal (this is covered in more detail in previous posts). So it can be seen that jitter experienced on any of these threads will affect the end-to-end latency experienced by customers.</div>
<div>
<br /></div>
<div>
The aim is to reduce that jitter to an absolute minimum. To do this, we use the Disruptor's <a href="https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/BusySpinWaitStrategy.java">BusySpinWaitStrategy</a> so that message-passing between publisher and consumer is as 'instantaneous' as allowed by the platform. The Disruptor has different strategies for waiting, and each is suited to different situations. In this case, since we want to reduce latency, busy-spinning is the best choice. It does however, come with caveats.</div>
<div>
<br /></div>
<div>
If these threads are to be always runnable, then they need to have access to CPU resource at all times. As mentioned before, Linux is a multi-tasking general-purpose operating system, whose default mode is to schedule a wide variety of tasks with different latency requirements. If the operating system decides to run another task on the CPU currently executing one of the busy-spinning threads, then unwanted and unpredictable latency will creep into the system.</div>
<div>
<br /></div>
<div>
<br />
Enter the dark arts of CPU-isolation, thread-pinning and Linux tracing tools.</div>
<div>
<br />
<br /></div>
<h3 style="text-align: left;">
An example</h3>
<div>
<br /></div>
<div>
In order to demonstrate the techniques we used at LMAX Exchange to achieve consistent, low inter-thread latency, I'm going to refer to an example application that can be used to measure such latencies introduced by the host platform.</div>
<div>
<br /></div>
<div>
The application has three threads with low-latency requirements. A 'producer' thread, which reads messages from a datasource, and writes them to a Disruptor instance, a 'logic' thread, which performs some arbitrary logic, and a 'journaller' thread, which writes messages to disk. Both logic and journaller threads are consumers of the Disruptor instance, using a busy-spin wait strategy.</div>
<div>
<br /></div>
<div>
The producer thread performs a call to <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> and writes the result into the message before passing to the Disruptor. The logic thread reads a message from the Disruptor, and immediately calls <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span>. The delta between these two timestamps is the time taken to transit the Disruptor. These deltas are stored in an <a href="https://github.com/HdrHistogram/HdrHistogram">HdrHistogram</a> and reported at application exit.</div>
<div>
<br /></div>
<div>
Given that very little work is being done by the logic thread, it is reasonable to expect that inter-thread latency will be low, and consistent. In reality however, this is not the case.</div>
<div>
<br /></div>
<div>
I'm running these tests on my 4-CPU laptop, so operating system scheduling jitter is magnified; it would be less pronounced on a 64-CPU server-class machine, for instance, but the techniques used to investigate and reduce jitter are effectively the same.</div>
<div>
<br /></div>
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
System-jitter baseline</h3>
<div>
<br /></div>
<div>
Executing the example application for a period of time and inspecting the results shows that there is a large variation in the time taken to transit the Disruptor:</div>
<div>
<br /></div>
<br />
<span style="font-family: "courier new" , "courier" , monospace;">== Accumulator Message Transit Latency (ns) == </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">mean 60879 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">min 76 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">50.00% 168 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">90.00% 256 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.00% 2228239 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.90% 8126495 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.99% 10485823 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.999% 11534399 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.9999% 11534399 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">max 11534399 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">count 3595101
</span><br />
<div>
</div>
<br />
<div>
<br /></div>
<div>
<br /></div>
<div>
The fastest message to get through the Disruptor was 76 nanoseconds, but things rapidly degrade from there: 1 in 100 messages took longer than 2 milliseconds to pass between threads. The longest delay was 11 mill<span style="font-family: inherit;">isec</span>onds - a difference of several orders of magnitude.<br />
<br />
Clearly something is happening on the system that is negatively affecting latency. Pauses introduced by the runtime (JVM) can be ruled out, as the application is garbage-free, performs warm-up cycles to exercise the JIT, and guaranteed safepoints are disabled. This can be confirmed by enabling safepoint logging, looking at the GC log and stdout output when <span style="font-family: "courier new" , "courier" , monospace;">-XX:+PrintCompilation</span> is enabled.</div>
<div>
<br /></div>
<div>
<br />
<h3 style="text-align: left;">
CPU Speed</h3>
</div>
<div>
<br /></div>
<div>
Modern CPUs (especially on laptops) are designed to be power efficient, this means that the OS will typically try to scale down the clock rate when there is no activity. On Intel CPUs, this is partially handled using power-states, which allow the OS to reduce CPU frequency, meaning less power draw, and less thermal overhead.
<br />
<br />
On current kernels, this is handled by the CPU scaling governor. You can check your current setting by looking in the file
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
</span><br />
<br />
there is one directory entry in <span style="font-family: "courier new" , "courier" , monospace;">/sys/devices/system/cpu/</span> per available CPU on the machine. On my laptop, this is set to powersave mode. To see available governors:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
</span><br />
<br />
which tells me that I have two choices:<br />
<br />
<ol style="text-align: left;">
<li><span style="font-family: "courier new" , "courier" , monospace;">performance</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;">powersave</span></li>
</ol>
<br />
Before making a change though, let's make sure that powersave is actually causing issues.<br />
<br />
To do this, <a href="https://perf.wiki.kernel.org/index.php/Main_Page" style="box-sizing: border-box; color: #4078c0; text-decoration: none;">perf_events</a> can be used to monitor the CPU's P-state while the application is running:
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">perf record -e "power:cpu_frequency" -a
</span><br />
<br />
This command will sample the cpu_frequency trace point written to by the <a href="http://lxr.free-electrons.com/source/drivers/cpufreq/intel_pstate.c">intel cpufreq driver</a> on all CPUs. This information comes from an MSR on the chip which holds the FSB speed.
<br />
<br />
Filtering entries to include only those samples taken when java was executing shows some variation in the reported frequency; clearly not ideal for achieving the lowest latency:
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">java 2804 [003] 3327.796741: power:cpu_frequency: state=1500000 cpu_id=3 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">java 2804 [003] 3328.089969: power:cpu_frequency: state=3000000 cpu_id=3 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">java 2804 [003] 3328.139009: power:cpu_frequency: state=2500000 cpu_id=3 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">java 2804 [003] 3328.204063: power:cpu_frequency: state=1000000 cpu_id=3
</span><br />
<br />
<a href="https://github.com/epickrram/perf-workshop/blob/master/src/main/shell/set_cpu_governor.sh">This script</a> can be used to set the scaling governor to performance mode to reduce the variation:
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">sudo bash ./set_cpu_governor.sh performance
</span><br />
<br />
Running the application again with the performance governor enabled produces better results for inter-thread latency. Monitoring with <span style="font-family: "courier new" , "courier" , monospace;">perf</span> shows that the cpu_frequency events are no longer emitted.
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">== Accumulator Message Transit Latency (ns) == </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">mean 23882 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">min 84 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">50.00% 152 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">90.00% 208 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.00% 589827 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.90% 4456479 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.99% 7340063 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.999% 7864351 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.9999% 8126495 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">max 8126495 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">count 3595101
</span><br />
<br />
<br />
Though there is still a max latency of 8ms, it has been reduced from the previous value of 11ms.<br />
<div style="box-sizing: border-box; color: #333333; font-size: 16px; line-height: 25.6px; margin-bottom: 16px;">
<span style="font-family: inherit;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/performance-chart.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/performance-chart.png" width="640" /></a></div>
<div style="box-sizing: border-box; color: #333333; font-size: 16px; line-height: 25.6px; margin-bottom: 16px;">
<span style="font-family: inherit;"><br /></span></div>
</div>
<br />
<br />
<h3 style="text-align: left;">
Process migration</h3>
<div>
<br /></div>
Another possible cause of scheduling jitter is likely to be down to the OS scheduler moving processes around as different tasks become runnable. The important threads in the application are at the mercy of the scheduler, which can, when invoked decide to run another process on the current CPU. When this happens, the running thread's context will be saved, and it will be shifted back into the schedulers run-queue (or possibly migrated to another CPU entirely).
<br />
<div>
<br /></div>
To find out whether this is happening to the threads in our application, we can turn to <span style="font-family: "courier new" , "courier" , monospace;">perf</span> again and sample trace events emitted by the scheduler. In this case, sampling the <span style="font-family: "courier new" , "courier" , monospace;">sched_stat_runtime</span> event will show what CPU has been playing host to the application threads.
<br />
<div>
<br /></div>
One row of output from <span style="font-family: "courier new" , "courier" , monospace;">perf script</span> shows that the java thread executed on CPU1 for a duration of 1.000825 milliseconds:
<br />
<div>
<br /></div>
<span style="font-family: "courier new" , "courier" , monospace;">java 11372 [001] 3055.140623: sched:sched_stat_runtime: comm=java pid=11372 runtime=1000825 [ns] vruntime=81510486145 [ns]
</span><br />
<div>
<br /></div>
A bit of sorting and counting will show exactly how the process was moved around the available CPUs during its lifetime:
<br />
<div>
<br /></div>
<span style="font-family: "courier new" , "courier" , monospace;">perf script | grep "java 11372" | awk '{print $3}' | sort | uniq -c </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">... </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">16071 [000] </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">10858 [001] </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 5778 [002] </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 7230 [003]
</span><br />
<div>
<br /></div>
<br />
So this thread mostly ran on CPUs 0 and 1, but also spent some time on CPUs 2 and 3. Moving the process around is going to require a context switch, and produce cache invalidation effects. While these are unlikely to be the sources of maximum latency, in order to start improving the worst-case, it will be necessary to stop migration of these processes.
<br />
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
Thread affinity</h3>
<div>
<br /></div>
Thread affinity can be used to force processes to run on a specific CPU or set of CPUs. This achieved by either using <span style="font-family: "courier new" , "courier" , monospace;">taskset</span> when launching a program, or the <span style="font-family: "courier new" , "courier" , monospace;">sched_setaffinity</span> system call from within an application. Using this technique to stop process migration of latency-sensitive threads has a positive effect on the latency jitter experienced in the application:
<br />
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/pinned-thread-chart.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/pinned-thread-chart.png" width="640" /></a></div>
<br />
<br /></div>
<div>
<br /></div>
This result implies that forcing the threads to run on a single CPU can help reduce inter-thread latency. Whether this is down to the scheduler making better decisions about where to run <i>other</i> processes, or simply because there is less context switching is not clear.<br />
<br />
One thing to look out for is the fact that no effort has been made to stop the scheduler from running other tasks on those CPUs. There are still multi-millisecond delays in message passing, and this could be down to other processes being run on the CPU that the application thread has been restricted to.<br />
<br />
Returning to <span style="font-family: "courier new" , "courier" , monospace;">perf</span> and this time capturing all <span style="font-family: "courier new" , "courier" , monospace;">sched_stat_runtime</span> events for a specific CPU (in this case 1) will show what other processes are being scheduled while the application is running:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">perf record -e "sched:sched_stat_runtime" -C 1</span><br />
<br />
<br />
Stripping out everything but the process name, and counting occurrences in the event trace shows that while the java application was running most of the time, there are plenty of other processes that were scheduled during the application's execution time:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">45514 java </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 60 kworker/1:2 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 26 irq/39-DLL0665: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 24 rngd </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 15 rcu_sched </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 9 gmain </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 8 goa-daemon </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 7 chrome </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 6 ksoftirqd/1 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 5 rtkit-daemon
</span><br />
<div>
<br />
<br />
<h3 style="text-align: left;">
CPU Isolation</h3>
<div>
<br /></div>
At this point, it's time to remove the target CPUs from the OS's scheduling domain. This can be done with the <span style="font-family: "courier new" , "courier" , monospace;">isolcpus</span> boot parameter (i.e. add <span style="font-family: "courier new" , "courier" , monospace;">isolcpus=<cpu-list></span> to <span style="font-family: "courier new" , "courier" , monospace;">grub.conf</span>), or by using the <span style="font-family: "courier new" , "courier" , monospace;">cset</span> command from the <span style="font-family: "courier new" , "courier" , monospace;">cpuset</span> package.<br />
<br />
Using this method, the scheduler is restricted from running other user-space processes on the CPUs hosting the latency-sensitive application threads. In combination with setting the thread affinity, this should mean that the application threads will always have CPU resource and will be effectively always running.<br />
<br />
The difference in inter-thread latency is dramatic - maximum latency is down to 14 microseconds:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">== Accumulator Message Transit Latency (ns) == </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">mean 144 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">min </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 84 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">50.00% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 144 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">90.00% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 160 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.00% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 208 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.90% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 512 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.99% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 2432 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.999% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 3584 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">99.9999% </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 11776 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">max </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 14848 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">count </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 3595101 </span><br />
<br />
<br />
The difference is so great, that it's necessary to use a log-scale for the y-axis of the chart.
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/isolcpus-chart-log-scale.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="426" src="https://raw.githubusercontent.com/epickrram/perf-workshop/master/doc/isolcpus-chart-log-scale.png" width="640" /></a></div>
<br />
<br />
Note that the difference will not be so great on a server-class machine with lots of spare processing power. The effect here is magnified by the fact that the OS only has 4 CPUs (on my laptop) to work with, and a desktop distribution of Linux. So there is much more scheduling pressure than would be present on a server-class machine.<br />
<br />
Using <span style="font-family: "courier new" , "courier" , monospace;">perf</span> once again to confirm that other processes are not running on the reserved CPUs shows that there is still some contention to deal with:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">81130 java </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 2 ksoftirqd/1 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 43 kworker/1:0 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 1 kworker/1:1H </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 2 kworker/3:1 </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 1 kworker/3:1H </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 11 swapper </span><br />
<br />
<br />
These processes starting with 'k' are kernel threads that deal with house-keeping tasks on behalf of the OS, 'swapper' is the Linux idle process, which is scheduled whenever there is no work to be executed on a CPU.
<br />
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
Conclusion</h3>
<div>
<br /></div>
<div>
CPU isolation and thread affinity are powerful tools that can help reduce runtime jitter introduced by the OS scheduler. Linux tracing tools such as perf_events are invaluable for inspecting the state of running processes when determining sources of jitter. For low-latency applications, orders-of-magnitude reductions in jitter can be achieved by applying these techniques.<br />
<br />
This post has introduced some techniques for observing and fixing system jitter. Examples in the post were generated using the application available in <a href="https://github.com/epickrram/perf-workshop">this github repository</a>, where there is also a walk-through of the steps used to generate the data for this post.</div>
<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
There's more..</h3>
<br />
This post describes the start of the journey towards tuning Linux for low-latency applications taken at LMAX Exchange. Dealing with other causes of runtime jitter are covered in the <a href="http://epickrram.blogspot.co.uk/2015/11/reducing-system-jitter-part-2.html">follow-up post</a>.</div>
<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-4022285614263649092015-09-07T13:07:00.001+01:002020-09-14T19:33:09.167+01:00Runtime Jitter - Zing vs Hotspot<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">This is an update to my last post exploring behaviour in the Oracle/OpenJDK JVM that forces a periodic safepoint under normal conditions.<br />
<br />
After that post was published, <a href="https://twitter.com/giltene">Gil Tene</a> from Azul Systems <a href="http://epickrram.blogspot.co.uk/2015/08/jvm-guaranteed-safepoints.html?showComment=1438964693407#c2542696725323660399">asked</a> how their Zing JVM matched up against Oracle/OpenJDK in terms of runtime jitter.<br />
<br />
At LMAX Exchange, we have been using the Zing JVM for several years, after we found that it removed a large proportion of our latency outliers. Namely those caused by garbage collection pauses.<br />
<br />
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Measuring runtime jitter</h4>
<div>
<br /></div>
<div>
These measurements were initially made to try to demonstrate jitter introduced to an application by the Linux kernel's scheduler. Since the kernel is responsible for allocating CPU time to runnable processes, these decisions can show up as sources of latency in well-tuned applications.</div>
<div>
<br /></div>
<div>
In this application, a 'producer' thread makes a call to <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span> (which under the hood uses the monotonic clock on Linux) and passes the result into an instance of the <span style="font-family: "courier new" , "courier" , monospace;">Disruptor</span>. On the consuming side, an 'accumulator' thread also calls <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span>, then records the delta (in nanoseconds) into an <span style="font-family: "courier new" , "courier" , monospace;">HdrHistogram</span>.</div>
<div>
<br /></div>
<div>
Using this method, we can explore the time taken to pass a message between two threads. In the majority of cases, this will be very quick, but there will be outliers introduced by the runtime (i.e. JVM) and the operating system (i.e. Linux scheduler).</div>
<div>
<br /></div>
<div>
During developing this application, I came across the 100-microsecond jitter introduced by the Oracle/OpenJDK's forced safepoint behaviour. This is discussed in more detail in my previous post.</div>
<div>
<br /></div>
<div>
So how do these two JVMs fare against each other?</div>
<div>
<br /></div>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Comparing JVMs</h4>
<div>
<br /></div>
<div>
In the results below, the effect of the forced safepoints are clear, giving a maximum jitter of around 100 microseconds:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<b>Oracle/OpenJDK with forced safepoints enabled (default)</b>:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">== Accumulator Message Transit Latency (ns) ==</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">mean 269</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">min 168</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">50.00% 216</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">90.00% 464</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.00% 608</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.90% 736</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.99% 960</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.999% 4352</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.9999% 15872</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">max 106496</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">count 3595101</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Disabling the forced safepoints removes the outliers:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<b>Oracle/OpenJDK with forced safepoints disabled</b>:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">== Accumulator Message Transit Latency (ns) ==</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">mean 385</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">min 152</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">50.00% 352</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">90.00% 464</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.00% 640</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.90% 768</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.99% 864</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.999% 3072</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.9999% 17408</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">max 20480</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">count 3595101</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">Comparing this to Zing's default behaviour:</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;"><b>Zing:</b></span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">== Accumulator Message Transit Latency (ns) ==</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">mean 263</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">min 136</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">50.00% 256</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">90.00% 288</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.00% 448</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.90% 512</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.99% 608</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.999% 3200</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">99.9999% 10240</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">max 13312</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">count 3595101</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
For those who like a visual representation, here's the comparison in chart form:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/percentiles_chart.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="http://img.epickrram.com/blog/percentiles_chart.png" width="640" /></a></div>
<div>
<br /></div>
<div>
And in log-scale:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/percentiles_chart_logy.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="http://img.epickrram.com/blog/percentiles_chart_logy.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Conclusion</h4>
<div>
<br /></div>
<div>
Zing doesn't need to force periodic safepoints during normal operation, so assuming that you don't have any other sources of jitter in your program, you'll get a flatter latency profile with out-of-the-box behaviour.</div>
<div>
<br /></div>
<div>
It is possible to restrict the forced safepoint behaviour of the Oracle/OpenJDK, but the consequences of doing so are unclear. Dragons may well be involved.</div>
<div>
<br /></div>
<div>
At LMAX Exchange, we currently run our microbenchmarks on Oracle JDK, and we do suppress the periodic forced safepoint behaviour in order to reduce jitter in the results. So far, there have been no adverse effects, but our microbenchmarks only run for short periods of time.</div>
<div>
<br /></div>
<div>
<br /></div>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
About the tests</h4>
<div>
<br /></div>
<div>
OracleJDK version:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java version "1.8.0_60"</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Java(TM) SE Runtime Environment (build 1.8.0_60-b27)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)</span></div>
</div>
<div>
<br /></div>
<div>
Zing version:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">java version "1.8.0-zing_15.05.0.0"</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Zing Runtime Environment for Java Applications (build 1.8.0-zing_15.05.0.0-b8)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">Zing 64-Bit Tiered VM (build 1.8.0-zing_15.05.0.0-b16-product-azlinuxM-X86_64, mixed mode)</span></div>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
OracleJDK flags:</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Baseline run:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">-XX:+DisableExplicitGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:-UseBiasedLocking -Xmx4g -Xms4g</span></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Disabled forced safepoints run:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">-XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=600000 -XX:+DisableExplicitGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:-UseBiasedLocking -Xmx4g -Xms4g</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">Zing JVM flags:</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">-XX:-UseMetaTicks -XX:-UseTickProfiler -XX:GenPauselessNewThreads=2 -XX:GenPauselessOldThreads=2 -XX:+ConcurrentDeflation -XX:+UseRdtsc</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">These tests were run on a highly-tuned Linux system, utilising such marvellous techniques as CPU isolation, thread affinity, cache-friendly location, and other magic fairy dust.</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;">An upcoming blog post will go into more detail on how to get OS scheduler jitter down to the low tens-of-microseconds.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<span style="font-family: inherit;"><br /></span></div>
<div>
</div>
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-17814109365799539052015-08-05T09:34:00.001+01:002020-09-14T19:33:23.362+01:00JVM guaranteed safepoints<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">I have been working on a small tool to measure the effects of system jitter within a JVM; it is a very simple app that measures inter-thread latencies. The tool's primary purpose is to demonstrate the use of linux performance tools such as perf_events and ftrace in finding causes of latency.<br />
<br />
Before using this tool for a demonstration, I wanted to make sure that it was going to actually behave in the way I intended. During testing, I seemed to always end up with a max inter-thread latency of around 100us. I diligently got stuck in to investigating with perf_events, but could never detect any jitter at the system level greater than around 50us.<br />
<br />
Where was this mysterious extra 50us latency coming from? The first assumption, of course, was that I had made a mistake, so I took some time in looking at the code and doing extra logging. Eventually, I went the whole hog and just ftraced the entire CPU running the thread that I was measuring. Still no sign of system-level jitter that would explain the latency.<br />
<br />
In exasperation, I enabled GC logging for the application. Previously, I had not done this as I knew that the application did not produce any garbage after its initial setup, and I also excluded the first sets of results to reduce pollution by warm-up effects.<br />
<br />
Lo and behold, the GC log contained the following entries:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:11:51.498+0100: <b>20.204</b>: Total time for which application threads were stopped: <b>0.0000954</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:11:52.498+0100: <b>21.204</b>: Total time for which application threads were stopped: <b>0.0000948</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:11:56.498+0100: <b>25.205</b>: Total time for which application threads were stopped: <b>0.0001147</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:12:06.499+0100: <b>35.206</b>: Total time for which application threads were stopped: <b>0.0001127</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:12:10.499+0100: <b>39.206</b>: Total time for which application threads were stopped: <b>0.0000983</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:12:11.499+0100: <b>40.206</b>: Total time for which application threads were stopped: <b>0.0000900</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:12:12.500+0100: <b>41.206</b>: Total time for which application threads were stopped: <b>0.0001421</b> seconds</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">2015-08-05T09:12:17.500+0100: <b>46.207</b>: Total time for which application threads were stopped: <b>0.0000865</b> seconds</span><br />
<br />
<br />
Note the rather suspicious timing of these events - always ~205ms into the second, and ~100us in magnitude.<br />
<br />
Now, because I used the extremely useful <span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>-XX:+PrintApplicationStoppedTime</b></span> flag, the log also contained safepoint pauses. Given that I knew my application was not allocating memory, and therefore should not be causing the collector to run, I assumed that these pauses were down to safepoints.<br />
<br />
The next useful flag in the set of "<i>Hotspot flags you never knew you needed, until you need them</i>" is <span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>-XX:+PrintSafepointStatistics</b></span>.<br />
<br />
With this flag enabled, the output was immediately enlightening:<br />
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>20.204</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>21.204</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>25.205</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>35.205</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>39.206</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>40.206</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>41.206</b>: no vm operation [...</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>46.207</b>: no vm operation [...</span><br />
<div>
<br /></div>
<div>
<br /></div>
The timestamps from the safepoint statistics match up perfectly with the entries in the GC log, so I can be sure of my assumption that the pauses are due to safepointing.<br />
<br />
A little time spent with your favourite search engine will yield up the information that 'no vm operation' is the output produced when the runtime forced a safepoint.<br />
<br />
Further searching revealed that the frequency of these forced safepoints can be controlled with the following flags:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"><b>-XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=300000</b></span><br />
<br />
This command instructs the runtime to only guarantee a safepoint every 300 seconds, and since the duration of my test is less than that, there should be no forced safepoints. The default interval can be discovered by running the following command:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">]$ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal 2>&1 | grep Safepoint</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;"> intx GuaranteedSafepointInterval = <b>1000</b> {diagnostic}</span><br />
<div>
<br /></div>
<br />
Running my test tool with these settings removed the 100us jitter, so I'm just left with 50us introduced by the OS and hardware.<br />
<br />
As mentioned in some of the mailing lists I stumbled across, it's probably not a good idea to stop the JVM from performing cleanup that requires a safepoint. For this testing tool however, it helps to remove a source of uncertainty that was dominating the jitter produced by the OS.<br />
<br />
<h3 style="text-align: left;">
Update</h3>
<br />
At LMAX we run a number of microbenchmarks to ensure that our code does not suffer any regressions in performance. By their very nature, these tests tend to have a fair amount of noise in terms of single-shot execution time.<br />
<br />
I applied the previously mentioned flags to the JVM running the tests; the reduction in the max execution time is clearly displayed in the chart below:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/safepoint_jitter_microbenchmarks.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="275" src="http://img.epickrram.com/blog/safepoint_jitter_microbenchmarks.png" width="640" /></a></div>
<br />
<br />
So aside from just being a curiosity for those who like digging into JVM internals, these flags could have a real benefit for reducing noise in benchmark tests.<br />
<br />
<i>Tests were performed using Oracle JDK 1.8.0_20.</i><br />
<i><br /></i>
<i><br /></i>
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<i><br /></i>
<i><br /></i></div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-36003949221822282092015-07-03T13:27:00.001+01:002020-09-14T19:33:37.367+01:00seek() + write() vs pwrite()<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">In my <a href="http://epickrram.blogspot.com/2015/05/improving-journalling-latency.html">last post</a>, I described the benefits of upgrading hardware, operating system and file-system in order to improve the latency of file writes.<br />
<br />
After it was published, <a href="https://twitter.com/mjpt777/status/606057487360512000">Martin asked</a> whether we were using <span style="font-family: "courier new" , "courier" , monospace;">FileChannel.write(ByteBuffer, long)</span>, which calls the <span style="font-family: "courier new" , "courier" , monospace;">pwrite</span> function, in order to make fewer system calls. We are not currently using this method, though we have experimented with it in the past. Before our recent changes to reduce write latency, we probably didn't notice the overhead of the extra system call due to the background noise on the system.<br />
<br />
Since improvements have been made, it is worth re-testing using the <span style="font-family: "courier new" , "courier" , monospace;">pwrite</span> method to see if there is any measurable benefit. We'll get to the numbers later, but first let's just have a look at the difference between the two approaches.<br />
<br />
<b>WARNING</b>: Wall of text approaching. If you are averse to JDK source code, you can skip down to "Enough code already".<br />
<br />
<h4 style="text-align: left;">
Seek...</h4>
<br />
Our standard journalling technique, that we previously discovered to be the best for latency & throughput, is to write using a <span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile</span>. The API of <span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile</span> requires that the programmer set the write position on the file, then call the write method with the data that is to be written.<br />
<br />
The call starts in <span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile.seek</span>, which calls the native method <span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile.seek0</span>:<br />
<br />
<script src="https://gist.github.com/epickrram/53e7de49b4e3977f9fbf.js"></script>
Across the JNI bridge in native code, some bounds checking is performed before a call to the utility function <span style="font-family: "courier new" , "courier" , monospace;">IO_Lseek</span>:<br />
<br />
<script src="https://gist.github.com/epickrram/e299bf98de9375a01e40.js"></script><span style="font-family: "courier new" , "courier" , monospace;">
IO_Lseek</span> is mapped to the <span style="font-family: "courier new" , "courier" , monospace;">lseek64</span> system call:<br />
<br />
<script src="https://gist.github.com/epickrram/21c831fd3da286db86f4.js"></script><span style="font-family: "courier new" , "courier" , monospace;">
lseek64</span> is responsible for updating the position of a file offset. See <span style="font-family: "courier new" , "courier" , monospace;">man lseek</span> for more details.<br />
<br />
<h4 style="text-align: left;">
... then write</h4>
<br />
Once we have the pointer in the correct position, then we call the <span style="font-family: "courier new" , "courier" , monospace;">write</span> method. This delegates to the native function <span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile.writeBytes</span>:<br />
<br />
<script src="https://gist.github.com/epickrram/987e89bab669b1010568.js"></script>
This function then delegates to the utility method <span style="font-family: "courier new" , "courier" , monospace;">IO_Util.writeBytes</span>:<br />
<br />
<script src="https://gist.github.com/epickrram/29914403afd1bd17942d.js"></script>
Now we get to the actual work - if your data is greater than 8k in length, a new char[] buffer is allocated using <span style="font-family: "courier new" , "courier" , monospace;">malloc</span><span face=""arial" , "helvetica" , sans-serif">,</span><span style="font-family: inherit;"> otherwise</span><span face=""arial" , "helvetica" , sans-serif"> a stack buffer is used</span>. The runtime then copies data from the java byte array to the native buffer, and the result is passed on to <span style="font-family: "courier new" , "courier" , monospace;">IO_Write</span>.<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/6162cdefe29a8fe01bfe.js"></script><span style="font-family: "courier new" , "courier" , monospace;">
IO_Write</span> is mapped to <span style="font-family: "courier new" , "courier" , monospace;">handleWrite</span>, which finally calls the kernel's <span style="font-family: "courier new" , "courier" , monospace;">write</span> function:<br />
<br />
<script src="https://gist.github.com/epickrram/e63d968c31ae5c6f510e.js"></script>
<br />
<h3 style="text-align: left;">
</h3>
<h4 style="text-align: left;">
Direct write</h4>
<br />
Performing a direct write with <span style="font-family: "courier new" , "courier" , monospace;">FileChannel.write</span> is quite different by comparison. After a quick check to see that the file is open, it's straight off the the JDK-internal <span style="font-family: "courier new" , "courier" , monospace;">IOUtil</span> class:<br />
<br />
<script src="https://gist.github.com/epickrram/f530936b403794d68737.js"></script>
Here we can see the benefit of using a <span style="font-family: "courier new" , "courier" , monospace;">DirectBuffer</span> (assigned with <span style="font-family: "courier new" , "courier" , monospace;">ByteBuffer.allocateDirect()</span> or <span style="font-family: "courier new" , "courier" , monospace;">FileChannel.map()</span>). If you're using an on-heap <span style="font-family: "courier new" , "courier" , monospace;">ByteBuffer</span>, then the runtime will copy your data to a pooled <span style="font-family: "courier new" , "courier" , monospace;">DirectBuffer</span> before attempting to continue:<br />
<br />
<script src="https://gist.github.com/epickrram/44526194f4657f7e8b5a.js"></script>
Next in the chain is <span style="font-family: "courier new" , "courier" , monospace;">FileDispatcherImpl</span>, an implementation of <span style="font-family: "courier new" , "courier" , monospace;">NativeDispatcher</span>, which calls the native function <span style="font-family: "courier new" , "courier" , monospace;">pwrite0</span>:
<br />
<br />
<script src="https://gist.github.com/epickrram/2fb9b5339361b69d5bdd.js"></script>
This function is simply a wrapper around the native <span style="font-family: "courier new" , "courier" , monospace;">pwrite64</span> call:
<br />
<br />
<script src="https://gist.github.com/epickrram/8bbdb9600cf41614c6ce.js"></script>
See <span style="font-family: "courier new" , "courier" , monospace;">man pwrite64</span> for more details.<br />
<br />
<h4 style="text-align: left;">
Enough code already</h4>
<br />
The two call-chains can be summarised as:
<br />
<br />
<h3 style="text-align: left;">
seek/write</h3>
<br />
<span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile.seek (java)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> RandomAccessFile.seek0 (native)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> lseek (syscall)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile.write (java)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> RandomAccessFile.writeBytes (native)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> io_util#writeBytes (native)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> bounds check</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> allocate buffer if payload > 8k</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> copy data to buffer</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> io_util#handleWrite (native)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> write (syscall)</span><br />
<br />
<br />
<h3 style="text-align: left;">
pwrite</h3>
<br />
<span style="font-family: "courier new" , "courier" , monospace;">FileChannelImpl.write (java)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> IOUtil.write (java)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> copy to direct buffer if necessary</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> FileDispatcherImpl.pwrite (java)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> FileDispatcherImpl.pwrite0 (native)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> -> pwrite (syscall)</span><br />
<br />
So it certainly seems as though the direct write method should be faster - fewer system calls, and less copying of data (assuming that a <span style="font-family: "courier new" , "courier" , monospace;">DirectBuffer</span> is used in the JVM).<br />
<br />
<h4 style="text-align: left;">
Is it faster?</h4>
<div>
<br /></div>
<div>
Benchmarks. They're dangerous. As has been pointed out on <a href="http://shipilev.net/talks/oredev-Nov2013-benchmarking.pdf">numerous</a> <a href="http://shipilev.net/talks/jvmls-July2014-benchmarking.pdf">occasions</a>, any results you get from benchmarks should be taken with a pinch of salt, and will not necessarily represent the measurements you think that they do.</div>
<div>
<br /></div>
<div>
Before jumping with both feet, and replacing our journaller implementation, I thought I would try to isolate the change with a small benchmark that performs a similar workload to our exchange, with the ability to swap out the implementation of the write call.</div>
<div>
<br /></div>
<div>
Code is available <a href="https://github.com/epickrram/journalling-benchmark">here</a>.</div>
<div>
<br /></div>
<div>
These tests were run on a Dell PowerEdge R720xd, tmpfs file system with flags 'noatime', with arguments:<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">java -jar journalling-benchmark-all-0.0.1.jar -i 50 -d /mnt/jnl -f SHORT -t pwrite -w 20</span></div>
<div>
<br />
<span style="font-family: inherit;">CPUs were isolated to reduce scheduling jitter, though I didn't go as far as thread-pinning. I assume that 20 measurement iterations is enough to weed out jitter caused by the scheduler.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Why tmpfs? Since this is a micro-benchmark, I wanted to test just the code path, separated from the underlying storage medium. If there is a significant difference, then this is something to try on a real file-system, under real load.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Since a picture paints a thousand words, and I've already written far too many for one post, let's have a look at some numbers:</span><br />
<span style="font-family: inherit;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/min_write_latency.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="292" src="http://img.epickrram.com/blog/min_write_latency.png" width="640" /></a></div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<div style="text-align: center;">
<i>Min write latency</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
Min write latency is ~900ns for pwrite, ~1400ns for seek/write - this is explained by the fact that we do double the work for seek/write (i.e. two JNI calls, two syscalls) compared to pwrite.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/four_nines_write_latency.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="296" src="http://img.epickrram.com/blog/four_nines_write_latency.png" width="640" /></a></div>
<div style="text-align: left;">
<br /></div>
</div>
<div>
<div style="text-align: center;">
<i>Four-nines write latency</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
Apart from the obvious warm-up issue in seek/write, it looks as though at the four-nines, pwrite is solid at ~30us, seek/write stabilises at ~50us.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/max_write_latency.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="292" src="http://img.epickrram.com/blog/max_write_latency.png" width="640" /></a></div>
<div style="text-align: center;">
<i>Max write latency</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
Maximum write latency is where pwrite really shines - there is far less jitter with the max ~50us, whereas seek/write suffers from more variability up to 100us. Again, this is probably down to the extra JNI/syscall work. Further work would be needed to figure out why the max time for seek/write is more than double that of pwrite in some cases.</div>
<div style="text-align: center;">
<br /></div>
<div style="text-align: left;">
<i><br /></i></div>
</div>
<div>
<h4 style="text-align: left;">
Conclusions</h4>
<br />
Using <span style="font-family: "courier new" , "courier" , monospace;">FileChannel.write(ByteBuffer, long)</span> in order to perform file operations <i>should</i> result in better performance, with less jitter.<br />
<br />
The cost of any computation done in the JDK classes for I/O appears to be overshadowed by the cost of JNI/syscall overhead.<br />
<br />
These results were generated using an artificial workload, on a memory-based filesystem, so should be viewed as best-possible results. Results for real-world workloads may vary....<br />
<br />
In the near future, I'll make this change in our performance environment and do the same measurements in order to observe the effects of such a change in a real system. Results will be published when I have them.<br />
<br />
<br />
<br />
<i>Thanks to Mike Barker for reviewing my benchmark code. Any remaining mistakes are wholly my own!</i><br />
<i><br /></i>
<i><br /></i>
<i><br /></i>
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<i><br /></i></div>
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-70320895930952863952015-05-15T16:28:00.002+01:002020-09-14T19:33:54.890+01:00Improving journalling latency<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">For the last few months at LMAX Exchange, we've been working on building out our next generation platform. Every few years we refresh our hardware and upgrade the machines that run our systems, and this time we decided to have a look at upgrading the operating system at the same time.<br />
<br />
When our first generation exchange was built, we were happy with low-millisecond-level mean latencies. After a couple of years of operation, we upgraded to newer hardware, made some significant software changes and ended up with mean end-to-end latencies of around 250 microseconds. With our latest set of changes, we are aiming for sub-100 microsecond mean latency and significantly reduced jitter.<br />
<br />
These changes should stand us in good stead for another year or two, before we repeat the cycle to further improve performance. In order to achieve this goal, we have modified and tuned our hardware, system architecture, operating system and application software.<br />
<br />
In my next few posts, I will be describing our experiences of doing this, and lessons we've learned along the way.<br />
<br />
<br />
<h3 style="text-align: left;">
Recap - LMAX Exchange Architecture</h3>
<div style="text-align: left;">
<br />
For a detailed description of our architecture and how it all fits together, I recommend watching my colleague's talk here: <a href="https://skillsmatter.com/skillscasts/5247-the-lmax-exchange-architecture-high-throughput-low-latency-and-plain-old-java">Lmax Exchange Architecture: High-throughput Low-latency and plain old Java</a>.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
For now, I'll give a high-level overview of message flow in our system.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/LMAX-Macro-arch-DataFlow.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" border="0" height="540" src="http://img.epickrram.com/blog/LMAX-Macro-arch-DataFlow.png" title="LMAX Exchange Architecture Overview" width="640" /></a></div>
<div style="text-align: left;">
<br /></div>
<br />
<br />
<ol style="text-align: left;">
<li>Inputs to our exchange arrive either from market makers who are generally responsible for making prices in the market, or customers who generally take the prices. All market maker traffic is based on the FIX protocol, customer traffic can be either FIX protocol, or XML/JSON over HTTP. The services responsible for access control and protocol translation are referred to as 'Gateways'.</li>
<li>Inbound and outbound traffic is tapped at the edge of our network. This allows us to have an authoritative record of data transfer with our customers, and also provides the ideal place to measure end-to-end latency. </li>
<li>Market maker orders are converted to our internal message protocol, then routed over UDP straight to the matching engine.</li>
<li>Customer orders are translated to internal messages, then routed to the order management and pre-trade risk engine (<b>4a</b>). Assuming that the customer has sufficient funds, their order will be forwarded to the matching engine (<b>4b</b>).</li>
<li>The matching engine and order management engine are what we refer to as 'Core Services'. </li>
<li>These services journal inbound messages to disk (<b>6a</b>), and have an HA pair that receive and acknowledge messages (<b>6b</b>). Using these mechanisms, we protect ourselves from single-node service failure. Once messages have been journalled and replicated, they are passed on to the business-logic.</li>
<li>Responses are published over UDP to the gateways for transmission back the the market maker or customer.</li>
</ol>
This describes the data flow in what we refer to as our latency-sensitive path. In the diagram above, each wheel icon represents a <a href="https://github.com/LMAX-Exchange/disruptor">Disruptor</a> instance, which is used extensively in our system.<br />
<br />
Given this architecture, we necessarily tend to focus our attention on the core services, since these do the most work and actually model the business domain. The gateways are very lightweight, and are mainly just doing the translation work, they also have the nice property of being horizontally scalable if we need to lower the load of work being performed.<br />
<br />
Two of the main costs that we need to address is the time taken to journal messages to disk, and to synchronously replicate messages out to a secondary. For this reason, it made sense to start looking at disk journalling performance.<br />
<br />
<br />
<h3 style="text-align: left;">
Journalling Performance</h3>
<br />
For the last few years, we've been running our systems on CentOS 6.4, kernel 2.6.32 and journalling messages to an ext3 file-system. The file-system is backed by a battery-backed RAID array, and we perform asynchronous writes - meaning that the data is only guaranteed to be in the operating system's page cache after the write() call has returned. At the time, our testing showed this configuration to be the most performant, given the trade-offs of maturity of other file-systems and safety guarantees.<br />
<br />
Testing also showed that from a software point-of-view, using the JDK's <span style="font-family: "courier new" , "courier" , monospace;">RandomAccessFile</span> gave the best performance for writes that always append to the end of a file. Using this technique, as messages arrive at a core service, they are appended to the current journal. When a journal reaches a certain size, the journaller rolls to the next file and continues appending data.<br />
<br />
In order to determine what benefit we would get from changing the operating system/file-system/storage hardware, we needed to be able to accurately measure the time taken to journal incoming messages to disk.<br />
<br />
First off of course, it's necessary to be able to replicate production-like inbound traffic in a performance-test environment; see previous posts on how you might go about getting to this point.<br />
<br />
<br />
<h3 style="text-align: left;">
Measuring the baseline</h3>
<br />
Once we were happy that we could adequately stress the system-under-test, we found that the best way to measure journalling latency was just to wrap the write call with a timer.<br />
<br />
Our journaller was instrumented with a couple of calls to <span style="font-family: "courier new" , "courier" , monospace;">System.nanoTime()</span>:<br />
<br />
<br />
<script src="https://gist.github.com/epickrram/e4a1e414a320a37a9693.js"></script>
The <span style="font-family: "courier new" , "courier" , monospace;">ValueRecorder</span> component referenced here simply maintains a maximum-seen value and publishes this value to a monitoring endpoint every second or so. Using this small change, we were able to see exactly how long it was taking to perform an asynchronous write to underlying storage.<br />
<br />
Armed with this ability to extract accurate metrics from our journaller, we ran a baseline test to see how the system was currently performing.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/jnl_baseline_2.6.32_ext3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="298" src="http://img.epickrram.com/blog/jnl_baseline_2.6.32_ext3.png" width="640" /></a></div>
<br />
<div style="text-align: center;">
<i>Max write latency in microseconds per-second (kernel 2.6.32/ext3)</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
In our existing configuration, we had a background noise of 200 - 400 microseconds write latency, with spikes up to 1 millisecond. Clearly, in order to get to consistently low latencies, we needed to address this. When inspected in detail, we can see that the best case latency for the write call is about 10 microseconds:</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/jnl_baseline_2.6.32_ext3_detail.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://img.epickrram.com/blog/jnl_baseline_2.6.32_ext3_detail.png" width="640" /></a></div>
<div style="text-align: center;">
<br /></div>
<div style="text-align: center;">
<i>Best case write latency of ~10 microseconds</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
<br />
<h3 style="text-align: left;">
Measuring improvements</h3>
<br />
Although we knew that we were planning to upgrade to new hardware, when performance testing, it is always advisable to change only one thing at a time, otherwise it's impossible to know what single change had any benefit or detrimental impact. With this methodology in mind, we first decided to upgrade the kernel, then file-system using the same hardware, each time recording the results. For brevity's sake, I'll present the outcome of those tests - we found that the best combination using the old hardware was to upgrade the kernel to a more recent version (3.17), and to use the ext4 file-system in place of ext3. The results of these changes was obvious when we re-ran the previous test.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/jnl_upgrade_3.17_ext4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="298" src="http://img.epickrram.com/blog/jnl_upgrade_3.17_ext4.png" width="640" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<i>Max write latency in microseconds per-second (kernel 2.6.32/ext3 vs 3.17/ext4)</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
Background noise was now down to around 50-100 microseconds, with spikes of around 200 microseconds. Looking in detail again, we can see that the best-case write latency is still around 10 microseconds - suggesting that this is the real time for a write call + JNI overhead when the kernel really just performs async IO (essentially just a memory write).</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/jnl_upgrade_3.17_ext4_detail.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="298" src="http://img.epickrram.com/blog/jnl_upgrade_3.17_ext4_detail.png" width="640" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<i>Best case write latency still ~10 microseconds</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
Now that we had selected the optimal configuration for the OS/file-system, we tried out upgrading the hardware. Again, in attempting to change only one thing at a time, we tried kernel 2.6.32 and ext3 combinations on the new hardware, I will just show the results for kernel 3.17/ext4, which yielded the best results.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/jnl_new_hardware_3.17_ext4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="296" src="http://img.epickrram.com/blog/jnl_new_hardware_3.17_ext4.png" width="640" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<i>Write latency in microseconds after hardware upgrade</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
The improvement with the new hardware is actually difficult to see at this scale, such is the reduction in jitter and latency. Looking at the charts with a log scale on the y-axis makes things a little clearer.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/blog/jnl_new_hardware_3.17_ext4_log_scale.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="296" src="http://img.epickrram.com/blog/jnl_new_hardware_3.17_ext4_log_scale.png" width="640" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<i>Write latency in microseconds after hardware upgrade (log scale)</i></div>
<div style="text-align: left;">
<i><br /></i></div>
<div style="text-align: left;">
After the combination of hardware, operating system and file-system, background noise is down to 10 - 20 microseconds, with spikes up to around 300 microseconds. This is a great improvement on the baseline, which had a background of 200 - 400 microseconds, with spikes up to 1 millisecond. Also, we can see that the best-case write latency has decreased to 4 - 5 microseconds, about half of what it was on the original configuration.<br />
<br />
<br />
<h3 style="text-align: left;">
Further improvements</h3>
</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
More analysis of these results revealed that the 300 microsecond spikes are caused by the journaller rolling to a new file, rather than the cost of actually doing a write call (the file-rolling in our journaller is orchestrated at a lower level than the instrumentation that we added). This is something that we will be able to easily fix in software, meaning that we should have consistent write latencies of under 20 microseconds.<br />
<br />
We have also spent very little time experimenting with kernel tunables related to I/O performance, there may be further gains to be made by working on I/O schedulers and priorities.</div>
<div style="text-align: left;">
<br />
<br /></div>
<h3 style="text-align: left;">
Conclusion</h3>
<div>
<div>
<br />
Upgrading our operating system and hardware made a huge difference to the amount of jitter that we saw in our journalling performance. This was however, a significant undertaking and not an opportunity that comes along very often.<br />
<br />
A modern kernel on commodity hardware seems to be capable of write latencies as low as 5 microseconds when asynchronous I/O is used.<br />
<br />
Once again, the importance of being able to replay production-like inputs to our system has proven invaluable in testing and tuning for performance in a repeatable manner. Having this ability means that we are able to try out different settings without impacting the performance of our production environments, and generate faster feedback on these changes.<br />
<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<br /></div>
</div>
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-40612266137219421352015-04-24T15:53:00.000+01:002015-04-24T15:53:17.989+01:00JAX Finance talks<div dir="ltr" style="text-align: left;" trbidi="on">
My colleague Sam & I will be talking at JAX Finance next week (28th/29th April).<br />
<br />
I'll be doing a talk with Vijay from Azul on our experiences at LMAX with deploying Zing to production. In the talk, we'll discuss how to go about making such a change in a safe manner, some of the internals of Zing, and lessons learned along the way.<br />
<br />
Sam's talk describes how we achieve high-throughput and low-latency at LMAX Exchange, and the architecture that we've developed to become the UK's fastest growing tech firm.<br />
<br />
<br />
<br />
<a href="http://jax-finance.com/2015/session/lmax-exchange-architecture-high-throughput-low-latency-plain-old-java/">http://jax-finance.com/2015/session/lmax-exchange-architecture-high-throughput-low-latency-plain-old-java/</a><br />
<br />
<br />
<a href="http://jax-finance.com/2015/session/low-latency-java-real-world-lmax-exchange-zing-jvm/">http://jax-finance.com/2015/session/low-latency-java-real-world-lmax-exchange-zing-jvm/</a><br />
<br />
<br />
If you're attending the conference, please come and speak to us - we'll be happy to talk about work and technology at LMAX Exchange.</div>
Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-89822588709230532662014-08-22T16:38:00.001+01:002020-09-14T19:34:13.586+01:00Performance Testing at LMAX (Part Three)<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">In my last post, I focussed on how to go about analysing your application's inbound traffic in order to create a simulation for performance testing. One part of the analysis was determining how long to leave between each simulated message, and we saw that the majority of messages received by LMAX have less than one millisecond between them.
<br />
<br />
Using this piece of data to create a simulation doesn't really make sense - if our simulator should send messages with zero milliseconds delay between each, it will just sit in a loop sending messages as fast as possible. While this will certainly stress your system, it won't be an accurate representation of what is happening in production.
<br />
<br />
The appearance of zero just points to the fact that we are not looking close enough at the source data. If milliseconds are not precise enough to measure intervals between messages, then we need to start using microseconds. If you are using something like <a href="http://en.wikipedia.org/wiki/Pcap#libpcap">pcap</a> or an application such as <a href="http://en.wikipedia.org/wiki/Tcpdump">tcpdump</a> to capture production traffic, then this is easy since each captured packet will be stamped with a microsecond precision timestamp by default (see <a href="http://www.tcpdump.org/manpages/pcap-tstamp.7.html">pcap-tstamp</a> for timestamp sources).
<br />
<br />
<h3>
Finding the right level of detail</h3>
Looking at inbound message intervals for a single session, but this time considering microsecond precision, a different picture emerges.
<br />
<br />
Firstly, a reminder of how the distribution appeared at the millisecond level. Note the log-scale on the Y-axis, indicating that most sequential messages had 0 - 10 milliseconds between them:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/inbound_message_interval_10ms.png" name="interval_millis" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/inbound_message_interval_10ms.png" /></a></div>
<br />
At microsecond precision, using 10 microsecond buckets, almost all datapoints are under 100 microseconds:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/inbound_message_interval_10us.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/inbound_message_interval_10us.png" /></a></div>
<br />
<br />
This is slightly easier to see when the log scale is removed from the Y axis:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/inbound_message_interval_10us_non_log_scale.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/inbound_message_interval_10us_non_log_scale.png" /></a></div>
<br />
<br />
It is clear from these charts that we have a broad range of message intervals under one millisecond, with most occurrences under 100 microseconds. Looking at the raw data confirms this:
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
</code>
<pre><code>interval (us) count
0 1591
10 611
20 1017
30 1243
40 3890
50 6167
60 3349
70 1655
80 907
90 519
100 374
110 291
120 254
130 193
140 184
150 156
160 145
170 138
180 114
190 111
</code></pre>
<code>
</code>
</div>
<br />
<br />
Using histograms is a useful way to visually confirm that your analysis is using the right level of detail. As long as the peak of the histogram is not at zero, then you can be more confident that you have useful data for building a simulation.
<br />
<br />
<b>Using this more detailed view, we know that our model will need to generate sequential messages with an interval of 0 to 80 microseconds <i>most of the time</i> with a bias towards an interval of 50 microseconds.</b>
<br />
<br />
<h3>
Microbursts</h3>
Examining the histogram of millisecond message intervals <a href="https://www.blogger.com/blogger.g?blogID=786039184625585734#interval_millis">above</a>, it's clear that the distribution doesn't follow a smooth curve. The chart suggests that alongside messages that arrive very close together, we also receive a number of messages with an interval on the order of hundreds of milliseconds.
<br />
<br />
Removing the log-scale on the axes, and zooming in on the interesting intervals, it is easier to see that messages also arrive in bursts every 100 - 250 milliseconds:
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/inbound_message_interval_detail.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/inbound_message_interval_detail.png" /></a></div>
<br />
<br />
Are these pauses between messages uniformly spread out over time, or do they follow a repeating pattern? To answer this question, we can build a histogram plotting <i>arrival millisecond within a second</i> - this should help to identify messages that are arriving at the larger intervals.
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/messages_per_millisecond_session_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/messages_per_millisecond_session_1.png" /></a></div>
<br />
There are obvious bursts of messages at 0, 250, 500, 750 milliseconds into each second, and to a lesser extent every 100 milliseconds. This tells us that our inbound traffic is quite bursty, and that our simulation will need to replicate this behaviour to be realistic. The previous chart is only for a single session, but if we generate the same chart for all sessions, this behaviour is even more pronounced:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/250ms_burst.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/250ms_burst.png" width="700" /></a></div>
<br />
<div style="text-align: center;">
<i>Messages received at each millisecond over time, showing 250ms bursts</i></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/100ms_burst.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/100ms_burst.png" width="700" /></a></div>
<br />
<div style="text-align: center;">
<i>Messages received at each millisecond over time, showing 100ms bursts</i></div>
<br />
<br />
We found that this pattern of behaviour was probably down to our market-makers' own data sources. Each of them likely receives a market-data feed from one of several primary trading venues, who publish price updates every 250 milliseconds/100 milliseconds or so. The bursts we see in inbound traffic are down to our market-makers' algorithms reacting to their inputs.
<br />
<br />
<h3>
Conclusion</h3>
The devil is in the detail - it is easy to make assumptions about how users interact with your system, given a granular overview of such metrics as requests per second but, until a proper analysis is carried out, the nature of user traffic patterns cannot be more than guessed at.
<br />
<br />
Going to this level of detail allowed us to make sure that our performance tests were truly realistic, and identified a previously unknown aspect of user behaviour. This in turn gives us confidence that we can continue to plan for future capacity by running production-like loads in our performance test environment at ten times the current production throughput rate.
<br />
<br />
As stated in a previous post, this is an iterative process as part of our continuous delivery pipeline, so we will continue to analyse and monitor user behaviour. There are no doubt many more details that we have not yet discovered.
<br />
<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<br /></div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.comtag:blogger.com,1999:blog-786039184625585734.post-82701009822130105042014-07-01T17:08:00.002+01:002020-09-14T19:34:24.904+01:00Performance Testing at LMAX (Part Two)<div dir="ltr" style="text-align: left;" trbidi="on"><span><a name='more'></a></span><span style="font-size: large;">Hi there, and welcome. This content is still relevant,
but fairly old. If you are interested in keeping up-to-date with similar
articles on profiling, performance testing, and writing performant
code, consider signing up to the <a href="https://foursteps.software/">Four Steps to Faster Software</a> newsletter. Thanks!</span></div><div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-size: large;"><span><!--more--></span> </span> <br /></div><div dir="ltr" style="text-align: left;" trbidi="on"> </div><div dir="ltr" style="text-align: left;" trbidi="on">In this post, I'm going to talk about traffic analysis and load modelling for performance tests. I will describe in some detail how we went about ensuring that we were accurately modelling production traffic patterns when executing our performance tests. In this post, I will be focussing on the market-maker order flow into our system since it produces the most load and consequently, the most stress on the system.
<br />
<br />
<br />
<h3>
Traffic Analysis</h3>
In my previous post, I explained that our market-makers use the FIX protocol to communicate with us. This is an asynchronous request-response protocol using different message types to perform specific actions. In order to correctly model the load on our system, we need to answer a few important questions for each market-maker session:
<ol>
<li>What types of messages arrive?</li>
<li>How many messages arrive in a given time interval?</li>
<li>What is the interval of time elapsed between messages?</li>
<li>What resources are used by different types of message?</li>
</ol>
First we need to pick an interval of time to analyse, so let's start with about ten minutes worth of data from a normal day's trading. From my earlier post, we can see that a FIX message is comprised of tag-value pairs:
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
8=FIX.4.2|9=211|35=D|49=session1|56=LMAX-FIX|34=258790|52=20140509-06:36:12.193|22=8|47=P|21=1|54=1|60=20140509-06:36:12.193|59=0|38=10000|40=2|581=1|11=774876524712244|55=GBP/JPY|48=180415|44=148.72674|10=227|
</code>
</div>
In order to answer the questions above, we are going to be interested in looking at all inbound messages for a given session. In order to do this, we will be paying attention to the following tags:
<br />
<ul>
<li>49 - SenderCompId - the party sending the message</li>
<li>35 - MsgType - the type of message</li>
<li>52 - SendingTime - when the message was sent by the client</li>
<li>55 - Symbol - the resource to apply the message to (in this case, the financial instrument to be traded)</li>
</ul>
For a detailed breakdown of FIX tags and their meanings, refer to <a href="http://btobits.com/fixopaedia/fixdic42/index.html">Fixopaedia</a>.
<br />
First of all, we will isolate all messages for a given session and resource; time to delve into bash...
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
pricem@asset-xxxxx:~$ grep -oE "\|49=[^\|]+" message-log.txt | head -1<br />
|49=session1<br />
pricem@asset-xxxxx:~$ grep session1 message-log.txt > session1-messages.txt<br />
pricem@asset-xxxxx:~$ grep -oE "\|55=[^\|]+" session1-messages.txt | head -1<br />
|55=GBP/USD<br />
pricem@asset-xxxxx:~$ grep "GBP/USD" session1-messages.txt > session1-gbpusd-messages.txt<br />
</code>
</div>
We are only interested in inbound requests, since they are the messages that we will be replicating when running performance tests. There are four inbound messages that are important: NewOrderSingle <b>(35=D)</b>, OrderCancel <b>(35=F)</b>, OrderCancelReplace <b>(35=G)</b> and MassQuote <b>(35=i)</b>.
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
pricem@asset-xxxxx:~$ grep -E "\|35=(D|F|G|i)" session1-gbpusd-messages.txt > session1-gbpusd-inbound-messages.txt
</code>
</div>
Now that we have an interesting set of messages, we can start answering the questions posed above.
<br />
<i>What type of messages arrive?</i>
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
</code>
<pre><code>pricem@asset-xxxxx:~$ grep -oE "\|35=[^\|]+" session1-gbpusd-inbound-messages.txt | sort | uniq -c
24 |35=D
19 |35=F
2515 |35=G
</code></pre>
<code>
</code>
</div>
<i>How many messages arrive in a given time interval?</i>
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
</code>
<pre><code>pricem@asset-xxxxx:~$ grep -oE "\|52=[^\|\.]+" session1-gbpusd-inbound-messages.txt | cut -c1-18 | sort | uniq -c
103 |52=20140701-09:16
360 |52=20140701-09:17
63 |52=20140701-09:18
78 |52=20140701-09:19
90 |52=20140701-09:20
121 |52=20140701-09:21
108 |52=20140701-09:22
657 |52=20140701-09:23
338 |52=20140701-09:24
205 |52=20140701-09:25
108 |52=20140701-09:26
297 |52=20140701-09:27
30 |52=20140701-09:28
</code></pre>
<code>
</code>
</div>
<i>What is the interval of time elapsed between messages?</i>
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
</code>
<pre><code>pricem@asset-xxxxx:~$ cat session1-gbpusd-inbound-messages.txt | python ./millisecond_interval_histogram.py
interval, count
0, 1830
2, 7
4, 8
8, 26
16, 13
32, 144
64, 140
128, 111
256, 74
512, 49
1024, 59
2048, 42
4096, 29
8192, 13
32768, 11
65536, 1
</code></pre>
<code>
</code>
</div>
<i>What resources are used by different types of message?</i>
<br />
<br />
<div style="background: rgb(204, 204, 204) none repeat scroll 0% 0%; padding: 10px;">
<code>
</code>
<pre><code>pricem@asset-xxxxx:~$ grep -oE "\|55=[^\|\.]+" session1-messages.txt | sort | uniq -c | sort -n
72 |55=EUR/DKK
112 |55=USD/HKD
216 |55=AUD/NZD
...
4128 |55=GBP/ZAR
5142 |55=GBP/USD
7926 |55=USD/TRY
</code></pre>
<code>
</code>
</div>
<br />
<br />
<h3>
Drawing conclusions</h3>
With these data, we can begin to characterise the traffic generated by this session:
<ol>
<li>Most messages are OrderCancelReplace</li>
<li>Message arrival rates vary between 30 and 600 messages per minute</li>
<li>There is a gap of zero milliseconds between most messages, with a gap of 32 - 255 milliseconds between a number of other messages</li>
<li>Messages are distributed over a number of instruments, some more popular than others</li>
</ol>
Creating charts to to help visualise the data can be useful for comparison later:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/msg-type.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/msg-type.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/msg-rate.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/msg-rate.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/msg-interval-nolog.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/msg-interval-nolog.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/resource.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/resource.png" /></a></div>
Armed with this knowledge, it is possible to create a simulation that will generate traffic closely resembling our source data.
<br />
<br />
<br />
<h3>
Validating the model</h3>
Once your model is up and running in your performance environment, it is vital to measure its effectiveness at simulating production traffic patterns. To do this, simply perform the same analysis on data captured from your performance test environment.
Once the analysis is complete, validating the model is easily done by plotting the original numbers against those produced by your model.
<br />
<i>How well are we modelling the type of message received?</i>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/compare-msg-type.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/compare-msg-type.png" /></a></div>
<br />
This is pretty good, no calibration needed here.
<i>How well are we modelling the arrival rate of messages?</i>
<br />
This comparison is deliberately omitted, as we run our performance tests at ~10x production load.
<br />
<i>How well are we modelling the interval between messages?</i>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/compare-msg-interval-nolog.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/compare-msg-interval-nolog.png" /></a></div>
<br />
Again, pretty good - the overall shape looks about right. The most important thing is that we're sending in messages fast enough (i.e. both histograms have bumps in roughly the same places)
<i>How well are we modelling the message distribution over resources?</i>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img.epickrram.com/projects/compare-resource.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img.epickrram.com/projects/compare-resource.png" /></a></div>
<br />
Not so good, but not necessarily cause for concern.
<b>Disclaimer</b>: All charts are for illustrative purposes only; the model in our performance environment is built from analysing more than ten minutes worth of data, so something like resource distribution that changes over the course of the day will not be well represented in a ten minute sample.
<br />
Whether this chart matters will depend on your system and business domain - in our exchange, the cost of processing an order instruction is the same for all resources, so as long as there is some representative noise, the model is still valid. Running the analysis over a much longer sample period would produce enough data-points to improve the match between these two datasets, if such a step was required.
<br />
<br />
<br />
<h3>
Wrapping up</h3>
When developing a performance-testing strategy, wherever possible, real-world data should be used to generate a model that will create representative load on your system. It is important to understand how your users interact with your system - a naive approach based on guesswork is unlikely to resemble reality.
Using charts to visualise different aspects of data flow is a useful tool in developing a mental image, which can help you reason about the model that will need to be built. It can also be handy as a 'sanity-check' to make sure that your analyses are correct.
<br />
The model should be tested to ensure that it is behaving as expected, and that your performance test harness is close enough to reality to provide meaningful results - after all, it's no good testing the wrong thing.
<br />
<br />
<br />
<br />
<br />
<a class="twitter-follow-button" data-show-count="false" data-show-screen-name="false" data-size="large" href="https://twitter.com/epickrram">Follow @epickrram</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<br />
<br />
<br />
</div>Mark Pricehttp://www.blogger.com/profile/15942816335515363315noreply@blogger.com