in progress…

May 7, 2009

Fun with micro benchmarks and optimizers

Filed under: Software — Mark Falco @ 2:11 am

I’ve been doing some micro benchmarks comparing the apparent cost of Java vs C++ virtual function calls. While the comparison is interesting, what really threw me for a bit was the results I was getting in Java 1.6. Have a look at the first version of my test:

public class Main
    {
    public static class Test
        {
    	public int getVirtual()
            {
            return m_i;
            }

        public int m_i = 1;
        }

    public static void main(String[] asArg)
	{
    	int cIters = Integer.parseInt(asArg[0]);
    	int cOps   = Integer.parseInt(asArg[1]);

        Test test = new Test();
        for (int i = 0; i < cIters; ++i)
            {
            long ldtStart = System.currentTimeMillis();
            for (int j = 0; j < cOps; ++j)
                {
                test.getVirtual();
                }

            long cMillis = System.currentTimeMillis() - ldtStart;
            System.out.println(cMillis);
            }
        }
    }

And running 10 iterations of 1 billion virtual calls yields the following millisecond timings: java14 Main 10 1000000000 1646 1636 1650 1677 1645 1669 1689 1668 1676 1724 java15 Main 10 1000000000 1775 1752 1698 1688 1759 1732 1758 1709 1670 1733 java16 Main 10 1000000000 8 <— WTF?
1409
1370
1347
1357
1341
1346
1349
1350
1354

Ok, so on 1.6 the first iteration of billion calls took 8ms and each subsequent iteration took nearly 200 times as long, how could that be?

Q> I screwed up my measurements

A> Nope looks ok, double check, still looks ok

Q> Was is the garbage collector

A> Shouldn’t be, other then printing the results the test doesn’t generate any garbage.  I reran with GC logging just to be sure, and no GCs were logged (not surprising).

Q> Did I need to let the test run longer so HotSpot could do its thing?

A> Nope, ran for much longer, results held steady at ~1350ms

Q> Is the optimizer broken or deoptimizing after the first iteration?

A> Sure looks like it

Ok so time to start thinking about how to optimizer is going to change my test.  The first thing to notice is that it could identify that there is no reason to actually call my getVirtual() method, it can see there are no side effects from calling it, and can see that the result is discarded.  So lets modify to do something with the result, just looking at the test loop now.

for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    long cMillis = System.currentTimeMillis() - ldtStart;
    System.out.println(cMillis + ", " + cTotal);
    }

And the results: java16 Main 10 1000000000 46, 1000000000 1934, 1000000000 1875, 1000000000 1897, 1000000000 1924, 1000000000 1938, 1000000000 1884, 1000000000 1890, 1000000000 1895, 1000000000 1943, 1000000000 A few observations:

  • We can see the getVirtual() method (or least its logic) is getting run
  • the cost of both the fast iteration and the slow iteration increased
  • the performance difference between fast and slow runs is down to ~40

Still what’s the deal, running the test for longer doesn’t yield any additional fast iterations.  So at this point I get a few other people involved, and they work through many of the same suggestions and assumptions I’d listed above.  We also try the following:

  • pull the logic out of main() and put it in a non-static method
  • try running the test loop in parallel on multiple threads
  • try first warming up the JVM by running some random but heavy code prior to running the test
  • try recording the results into an array rather then printing during the test

All of these yield essentially the same results as above.  We do randomly trigger the loss of the fast iteration, but never trigger multiple fast iterations.  So yipee we figured out how to make things go slower.  As a side note it was intersting what would trigger the loss of the fast iteration, which was triggered by recording the results into an array as follows:

long[] acMillis = new long[cIters];
int[] acTotal  = new int[cIters];
for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    acMillis[i] = System.currentTimeMillis() - ldtStart;
    acTotal [i] = cTotal;
    }

for (int i = 0; i &amp;lt; cIters; ++i)
    {
    System.out.println(acMillis[i] + ", " + acTotal[i]);
    }

java16 Main 10 1000000000 1328, 1000000000 1346, 1000000000 1338, 1000000000 1308, 1000000000 1343, 1000000000 1335, 1000000000 1324, 1000000000 1354, 1000000000 1336, 1000000000 1344, 1000000000 And even odder we could “fix” this and get an initial fast result by recording the time measurements int an Object array rather then a long array. No kidding, dynamic object allocation improved results.

Object[] acMillis = new Object[cIters];
int[] acTotal  = new int[cIters];
for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    acMillis[i] = new Long(System.currentTimeMillis() - ldtStart);
    acTotal [i] = cTotal;
    }

for (int i = 0; i &amp;lt; cIters; ++i)
    {
    System.out.println(acMillis[i] + ", " + acTotal[i]);
    }

java16 Main 10 1000000000 46, 1000000000 1806, 1000000000 1791, 1000000000 1831, 1000000000 1775, 1000000000 1779, 1000000000 1806, 1000000000 1811, 1000000000 1789, 1000000000 1783, 1000000000 Ok, so clearly this is nuts. But then things start to come into focus, lets go back to a version where the optimizer can optimize out the getVirtual() call, in fact we’ll do it for them.

for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        // cTotal += test.getVirtual();
        }
    long cMillis = System.currentTimeMillis() - ldtStart;
    System.out.println(cMillis);
    }

java16 Main 10 1000000000 6 865 875 859 856 969 875 872 861 877 Ok, so we’ve so removed the thing we were originally trying to test, and performance improves across the board. We get a 40% savings on the slow runs. So we see not too surprisingly that of the time we were measuring nearly half was in the inner ‘for’ loop not in the thing we’d intended to test. And now we the ah ha moment, how does the optimizer look at optimizing the test “infrastructure”, i.e. our main method, and we make this change.

public static void main(String[] asArg)
    {
    int cIters = Integer.parseInt(asArg[0]);
    int cOps   = Integer.parseInt(asArg[1]);

    for (int i = 0; i &amp;lt; 10; ++i)
        {
        runTest(cIters, cOps);
        System.out.println("pass " + i + " complete");
        }
    }

public static void runTest(int cIters, int cOps)
    {
    Test test = new Test();
    for (int i = 0; i &amp;lt; cIters; ++i)
        {
        long ldtStart = System.currentTimeMillis();
        int  cTotal   = 0;
        for (int j = 0; j &amp;lt; cOps; ++j)
            {
            cTotal += test.getVirtual();
            }
        long cMillis = System.currentTimeMillis() - ldtStart;
        System.out.println(cMillis + ", " + cTotal);
        }
    }

java16 Main 10 1000000000 46, 1000000000 1886, 1000000000 1894, 1000000000 1902, 1000000000 1877, 1000000000 1870, 1000000000 1862, 1000000000 1910, 1000000000 1861, 1000000000 1878, 1000000000 pass 0 complete 1869, 1000000000 1899, 1000000000 1897, 1000000000 1866, 1000000000 1882, 1000000000 1855, 1000000000 1895, 1000000000 1890, 1000000000 1903, 1000000000 1880, 1000000000 pass 1 complete 38, 1000000000 41, 1000000000 40, 1000000000 43, 1000000000 39, 1000000000 42, 1000000000 39, 1000000000 42, 1000000000 61, 1000000000 44, 1000000000 pass 2 complete 40, 1000000000 41, 1000000000 41, 1000000000 42, 1000000000 39, 1000000000 39, 1000000000 42, 1000000000 …. Eureka, finally we can get lots sustained fast runs. So it appears that we need to pop out of our test code in order for the optimizer to undo the apparent deoptimization it had made during the second iteration. Note that it was not simply that we’d run more total iterations, because we’d tried that earlier, it is in fact popping out of the test function, which apparently allows an optimized version to be slipped in. This ends up being an important bit to remember to account for in doing these types of micro benchmarks. At this point I’d assumed the “fast” runs had inlined the body of the getVirtual() method into the ‘for’ loop, and that the “slow” runs did call the function. Changing the code to manually inline however don’t back this up, leaving me with more questions. At least the mystery of the broken optimizer are solved though, that is enough for now.

6 Comments »

  1. You might want to read http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks. One specific issue with HotSpot is that the code in main() itself will not be optimized heavily, so pulling your test code out into a method called by main() make sense. Also, you might find it interesting to test with Japex, which handles VM warm up and timing for you.

    Comment by Patrick Wright — May 27, 2009 @ 4:05 pm

    • Thanks for the feedback. It was interesting that simply pulling the test out into a helper method did not change the behavior, i.e. I still received an initial very fast iteration, followed by endless very slow iterations. It was only when the test was modified to repeatedly call the helper method that it allowed the optimizer apply a good optimization. All in all it makes sense, but it does go to show some of the surprises you can get when doing micro benchmarks. Generally I’m used to them providing overly positive performance estimates, in this case it went just the opposite. I’ll plan on checking out Japex and see what it can do.

      Comment by markandrewfalco — May 27, 2009 @ 4:31 pm

  2. I have two ‘out of thin air’ guesses.

    The first, did you try with the -Xint param to get hotspot out of the picture. Perhaps in this simple case the interpreter is faster, since you are really not doing anything, and the amount of state held in the interpreter discovers this quickly and early returns.

    Version two is that the first time hotspot see’s your loop it completely unrolls into a thousand or so no-op instructions, which then get further optimized into nothingness – then when the loop is seen again, perhaps it says “I do indeed need to run this as a loop”

    Either way you should send this to Cliff Click. He’d figure it out in two ticks.

    Comment by Adam Malter — May 28, 2009 @ 2:15 pm

    • Yep, I’d given both -Xint and -Xbatch a try. -Xint resulted in consistent runs, but all were about 80x slower then my prior slow runs. -Xbatch resulted in the same initial fast iteration followed by slow iterations. I’d left out a number of the intermediate steps I’d taken in trying to track this down. Here are some of the other switches I’d tried, in an attempt to persuade the optimizer to yield additional fast results, all working as effectively as -Xbatch.

      -Xcomp
      -XX:+AggressiveOpts
      -XX:+BackgroundCompilation

      In hindsight it makes sense that the optimizations which I was looking for could not be applied until the method was re-executed, which of course I was not allowing when keeping everything with main(). Though I’m still curious as to how to the drop in performance after the first iteration, apparently the optimizer was willing to take an initial stab at things, i.e. it was able to de-optimize my second (and subsequent) iterations.

      Comment by Mark Falco — May 29, 2009 @ 12:39 am

  3. Be sure to run with -server mode to get reasonable compilation. I would assume that in most the cases above code should be optimized to zero iterations. You can also try out compiling with -Xbatch to see how it changes the results.

    Myself, when running somebody’s microbenchmarks, I’m renaming his main(…) to mainX(…) and then adding my own main(…) calling his in the loop. Normally I’m getting stable results after 2-3 iterations.

    Comment by Artur — May 28, 2009 @ 3:08 pm

    • I suppose I’d left out some important details. The tests were in fact run with -server (my default execution mode), running with -client always produced “slow” results. The intent of my investigation had been to identify why “fast” runs appeared initially and then vanished. In the course of trial and error, I found many changes either to code or JVM switch would allow me to get all “slow” runs, but it was the elusive “fast” runs which I was searching for.

      I definitely agree with your point regarding having main() loop over the real test code, and not itself be part of the test. This little exercise has taught me to always add that going forward.

      Comment by Mark Falco — May 29, 2009 @ 12:27 am


RSS feed for comments on this post. TrackBack URI

Leave a reply to markandrewfalco Cancel reply

Blog at WordPress.com.