in progress…

May 7, 2009

Fun with micro benchmarks and optimizers

Filed under: Software — Mark Falco @ 2:11 am

I’ve been doing some micro benchmarks comparing the apparent cost of Java vs C++ virtual function calls. While the comparison is interesting, what really threw me for a bit was the results I was getting in Java 1.6. Have a look at the first version of my test:

public class Main
    {
    public static class Test
        {
    	public int getVirtual()
            {
            return m_i;
            }

        public int m_i = 1;
        }

    public static void main(String[] asArg)
	{
    	int cIters = Integer.parseInt(asArg[0]);
    	int cOps   = Integer.parseInt(asArg[1]);

        Test test = new Test();
        for (int i = 0; i < cIters; ++i)
            {
            long ldtStart = System.currentTimeMillis();
            for (int j = 0; j < cOps; ++j)
                {
                test.getVirtual();
                }

            long cMillis = System.currentTimeMillis() - ldtStart;
            System.out.println(cMillis);
            }
        }
    }

And running 10 iterations of 1 billion virtual calls yields the following millisecond timings:

java14 Main 10 1000000000
1646
1636
1650
1677
1645
1669
1689
1668
1676
1724

java15 Main 10 1000000000
1775
1752
1698
1688
1759
1732
1758
1709
1670
1733

java16 Main 10 1000000000
8 <— WTF?
1409
1370
1347
1357
1341
1346
1349
1350
1354

Ok, so on 1.6 the first iteration of billion calls took 8ms and each subsequent iteration took nearly 200 times as long, how could that be?

Q> I screwed up my measurements

A> Nope looks ok, double check, still looks ok

Q> Was is the garbage collector

A> Shouldn’t be, other then printing the results the test doesn’t generate any garbage.  I reran with GC logging just to be sure, and no GCs were logged (not surprising).

Q> Did I need to let the test run longer so HotSpot could do its thing?

A> Nope, ran for much longer, results held steady at ~1350ms

Q> Is the optimizer broken or deoptimizing after the first iteration?

A> Sure looks like it

Ok so time to start thinking about how to optimizer is going to change my test.  The first thing to notice is that it could identify that there is no reason to actually call my getVirtual() method, it can see there are no side effects from calling it, and can see that the result is discarded.  So lets modify to do something with the result, just looking at the test loop now.

for (int i = 0; i < cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j < cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    long cMillis = System.currentTimeMillis() - ldtStart;
    System.out.println(cMillis + ", " + cTotal);
    }

And the results:

java16 Main 10 1000000000
46, 1000000000
1934, 1000000000
1875, 1000000000
1897, 1000000000
1924, 1000000000
1938, 1000000000
1884, 1000000000
1890, 1000000000
1895, 1000000000
1943, 1000000000

A few observations:

  • We can see the getVirtual() method (or least its logic) is getting run
  • the cost of both the fast iteration and the slow iteration increased
  • the performance difference between fast and slow runs is down to ~40

Still what’s the deal, running the test for longer doesn’t yield any additional fast iterations.  So at this point I get a few other people involved, and they work through many of the same suggestions and assumptions I’d listed above.  We also try the following:

  • pull the logic out of main() and put it in a non-static method
  • try running the test loop in parallel on multiple threads
  • try first warming up the JVM by running some random but heavy code prior to running the test
  • try recording the results into an array rather then printing during the test

All of these yield essentially the same results as above.  We do randomly trigger the loss of the fast iteration, but never trigger multiple fast iterations.  So yipee we figured out how to make things go slower.  As a side note it was intersting what would trigger the loss of the fast iteration, which was triggered by recording the results into an array as follows:

long[] acMillis = new long[cIters];
int[] acTotal  = new int[cIters];
for (int i = 0; i < cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j < cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    acMillis[i] = System.currentTimeMillis() - ldtStart;
    acTotal [i] = cTotal;
    }

for (int i = 0; i < cIters; ++i)
    {
    System.out.println(acMillis[i] + ", " + acTotal[i]);
    }

java16 Main 10 1000000000
1328, 1000000000
1346, 1000000000
1338, 1000000000
1308, 1000000000
1343, 1000000000
1335, 1000000000
1324, 1000000000
1354, 1000000000
1336, 1000000000
1344, 1000000000

And even odder we could “fix” this and get an initial fast result by recording the time measurements int an Object array rather then a long array. No kidding, dynamic object allocation improved results.

Object[] acMillis = new Object[cIters];
int[] acTotal  = new int[cIters];
for (int i = 0; i < cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j < cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    acMillis[i] = new Long(System.currentTimeMillis() - ldtStart);
    acTotal [i] = cTotal;
    }

for (int i = 0; i < cIters; ++i)
    {
    System.out.println(acMillis[i] + ", " + acTotal[i]);
    }

java16 Main 10 1000000000
46, 1000000000
1806, 1000000000
1791, 1000000000
1831, 1000000000
1775, 1000000000
1779, 1000000000
1806, 1000000000
1811, 1000000000
1789, 1000000000
1783, 1000000000

Ok, so clearly this is nuts. But then things start to come into focus, lets go back to a version where the optimizer can optimize out the getVirtual() call, in fact we’ll do it for them.

for (int i = 0; i < cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    for (int j = 0; j < cOps; ++j)
        {
        // cTotal += test.getVirtual();
        }
    long cMillis = System.currentTimeMillis() - ldtStart;
    System.out.println(cMillis);
    }

java16 Main 10 1000000000
6
865
875
859
856
969
875
872
861
877

Ok, so we’ve so removed the thing we were originally trying to test, and performance improves across the board. We get a 40% savings on the slow runs. So we see not too surprisingly that of the time we were measuring nearly half was in the inner ‘for’ loop not in the thing we’d intended to test. And now we the ah ha moment, how does the optimizer look at optimizing the test “infrastructure”, i.e. our main method, and we make this change.

public static void main(String[] asArg)
    {
    int cIters = Integer.parseInt(asArg[0]);
    int cOps   = Integer.parseInt(asArg[1]);

    for (int i = 0; i < 10; ++i)
        {
        runTest(cIters, cOps);
        System.out.println("pass " + i + " complete");
        }
    }

public static void runTest(int cIters, int cOps)
    {
    Test test = new Test();
    for (int i = 0; i < cIters; ++i)
        {
        long ldtStart = System.currentTimeMillis();
        int  cTotal   = 0;
        for (int j = 0; j < cOps; ++j)
            {
            cTotal += test.getVirtual();
            }
        long cMillis = System.currentTimeMillis() - ldtStart;
        System.out.println(cMillis + ", " + cTotal);
        }
    }

java16 Main 10 1000000000
46, 1000000000
1886, 1000000000
1894, 1000000000
1902, 1000000000
1877, 1000000000
1870, 1000000000
1862, 1000000000
1910, 1000000000
1861, 1000000000
1878, 1000000000
pass 0 complete
1869, 1000000000
1899, 1000000000
1897, 1000000000
1866, 1000000000
1882, 1000000000
1855, 1000000000
1895, 1000000000
1890, 1000000000
1903, 1000000000
1880, 1000000000
pass 1 complete
38, 1000000000
41, 1000000000
40, 1000000000
43, 1000000000
39, 1000000000
42, 1000000000
39, 1000000000
42, 1000000000
61, 1000000000
44, 1000000000
pass 2 complete
40, 1000000000
41, 1000000000
41, 1000000000
42, 1000000000
39, 1000000000
39, 1000000000
42, 1000000000
….

Eureka, finally we can get lots sustained fast runs. So it appears that we need to pop out of our test code in order for the optimizer to undo the apparent deoptimization it had made during the second iteration. Note that it was not simply that we’d run more total iterations, because we’d tried that earlier, it is in fact popping out of the test function, which apparently allows an optimized version to be slipped in. This ends up being an important bit to remember to account for in doing these types of micro benchmarks.

At this point I’d assumed the “fast” runs had inlined the body of the getVirtual() method into the ‘for’ loop, and that the “slow” runs did call the function. Changing the code to manually inline however don’t back this up, leaving me with more questions. At least the mystery of the broken optimizer are solved though, that is enough for now.

About these ads

6 Comments »

  1. You might want to read http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks. One specific issue with HotSpot is that the code in main() itself will not be optimized heavily, so pulling your test code out into a method called by main() make sense. Also, you might find it interesting to test with Japex, which handles VM warm up and timing for you.

    Comment by Patrick Wright — May 27, 2009 @ 4:05 pm

    • Thanks for the feedback. It was interesting that simply pulling the test out into a helper method did not change the behavior, i.e. I still received an initial very fast iteration, followed by endless very slow iterations. It was only when the test was modified to repeatedly call the helper method that it allowed the optimizer apply a good optimization. All in all it makes sense, but it does go to show some of the surprises you can get when doing micro benchmarks. Generally I’m used to them providing overly positive performance estimates, in this case it went just the opposite. I’ll plan on checking out Japex and see what it can do.

      Comment by markandrewfalco — May 27, 2009 @ 4:31 pm

  2. I have two ‘out of thin air’ guesses.

    The first, did you try with the -Xint param to get hotspot out of the picture. Perhaps in this simple case the interpreter is faster, since you are really not doing anything, and the amount of state held in the interpreter discovers this quickly and early returns.

    Version two is that the first time hotspot see’s your loop it completely unrolls into a thousand or so no-op instructions, which then get further optimized into nothingness – then when the loop is seen again, perhaps it says “I do indeed need to run this as a loop”

    Either way you should send this to Cliff Click. He’d figure it out in two ticks.

    Comment by Adam Malter — May 28, 2009 @ 2:15 pm

    • Yep, I’d given both -Xint and -Xbatch a try. -Xint resulted in consistent runs, but all were about 80x slower then my prior slow runs. -Xbatch resulted in the same initial fast iteration followed by slow iterations. I’d left out a number of the intermediate steps I’d taken in trying to track this down. Here are some of the other switches I’d tried, in an attempt to persuade the optimizer to yield additional fast results, all working as effectively as -Xbatch.

      -Xcomp
      -XX:+AggressiveOpts
      -XX:+BackgroundCompilation

      In hindsight it makes sense that the optimizations which I was looking for could not be applied until the method was re-executed, which of course I was not allowing when keeping everything with main(). Though I’m still curious as to how to the drop in performance after the first iteration, apparently the optimizer was willing to take an initial stab at things, i.e. it was able to de-optimize my second (and subsequent) iterations.

      Comment by Mark Falco — May 29, 2009 @ 12:39 am

  3. Be sure to run with -server mode to get reasonable compilation. I would assume that in most the cases above code should be optimized to zero iterations. You can also try out compiling with -Xbatch to see how it changes the results.

    Myself, when running somebody’s microbenchmarks, I’m renaming his main(…) to mainX(…) and then adding my own main(…) calling his in the loop. Normally I’m getting stable results after 2-3 iterations.

    Comment by Artur — May 28, 2009 @ 3:08 pm

    • I suppose I’d left out some important details. The tests were in fact run with -server (my default execution mode), running with -client always produced “slow” results. The intent of my investigation had been to identify why “fast” runs appeared initially and then vanished. In the course of trial and error, I found many changes either to code or JVM switch would allow me to get all “slow” runs, but it was the elusive “fast” runs which I was searching for.

      I definitely agree with your point regarding having main() loop over the real test code, and not itself be part of the test. This little exercise has taught me to always add that going forward.

      Comment by Mark Falco — May 29, 2009 @ 12:27 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: