Monday, 30 December 2013

JVM Escape Analysis

I recently came across a nice example of Oracle's Hotspot JVM using escape analysis in order to perform stack allocation, rather than heap allocation.

I'm sure that in many codebases across the planet, the following code fragments are familiar territory:

When trying to write garbage-free, or low-garbage code (as we do at LMAX), it is necessary to think about any code that may allocate unnecessary objects. Every invocation of the dispatchEvent method in the code above will cause the creation of an Iterator object (since the for-loop construct is just syntactic sugar for List.iterator().hasNext()/next()).

Using a byte-code viewer (I use ASM Bytecode Outline) to inspect the dispatchEvent method shows the creation and use of an Iterator object (via List.iterator()):

    ALOAD 1
    INVOKEINTERFACE java/util/List.iterator ()Ljava/util/Iterator;
    ASTORE 3
   FRAME APPEND [java/util/Iterator]
    ALOAD 3
    INVOKEINTERFACE java/util/Iterator.hasNext ()Z
    IFEQ L7
    ALOAD 3
    INVOKEINTERFACE java/util/ ()Ljava/lang/Object;
    CHECKCAST epickrram/example/Listener
    ASTORE 4

This would seem like an excellent candidate for escape analysis - if the compiler is smart enough, it should allocate the Iterator object on the stack, and save having to touch the heap.

This can be tested by running a simple test while printing out the GC activity:

The output with -XX:+DoEscapeAnalysis is:

Event Count: 10000000 

The output with -XX:-DoEscapeAnalysis is:

0.113: [GC [PSYoungGen: 23552K->416K(27456K)] 23552K->416K(90176K), 0.0010690 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
Total time for which application threads were stopped: 0.0012140 seconds
0.451: [GC [PSYoungGen: 94576K->0K(94656K)] 94592K->336K(157376K), 0.0011190 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
Total time for which application threads were stopped: 0.0012220 seconds

Event Count: 10000000

So we can see that the JVM is very kindly using stack allocation for the Iterator objects when escape analysis is enabled (the default since JDK1.6).

In the particular example I was looking at, I noticed that we only ever added a single Listener instance, so the length of the listeners list was always one. In this case it is wasteful to construct the new Iterator instance, even if it is allocated on the stack. The list of Listener objects can be replaced by a single reference to a Listener:

This small optimisation makes an order-of-magnitude difference when compared using a caliper benchmark. The benchmark has two tests - Access, which calls a single listener instance, and Iteration, which iterates over a list of size one.


  1. SingleListener has a different API neither ListenerRepository.

    Maybe it is worth to have SingleListenerRepository that similar to SingleListener but have method addListener instead of setListener but throws exception if smb tries to add one more listener ?

    Regarding EA: Looking into bytecode is not a good idea - PrintAssembly shades much more light on it as (C2?) JIT has too much under hood.

  2. I meant that EA could inline too much - as simple listener.onEvent(e) is less than 5 bytecode commands and could be inlined instead of call of method (that most probably has the only one implementation during benchmark) - and that's why you've got such performance boost.

    But again - all that is speculations - it is much better to have a look into native code via PrintAssembly.


    1. Hi Vladimir, good point - I hadn't considered that JIT could be inlining the method. It's probably worth experimenting with -XX:MaxInlineSize or just disabling inlining with -XX:-Inline to test your hypothesis.

      I'll try to get around to doing that soon, and will publish my findings.