07 | May | 2009 | in progress...

May 7, 2009

Fun with micro benchmarks and optimizers

Filed under: Software — Mark Falco @ 2:11 am
I’ve been doing some micro benchmarks comparing the apparent cost of Java vs C++ virtual function calls. While the comparison is interesting, what really threw me for a bit was the results I was getting in Java 1.6. Have a look at the first version of my test:
public class Main
    {
    public static class Test
        {
    	public int getVirtual()
            {
            return m_i;
            }

        public int m_i = 1;
        }

    public static void main(String[] asArg)
	{
    	int cIters = Integer.parseInt(asArg[0]);
    	int cOps   = Integer.parseInt(asArg[1]);

        Test test = new Test();
        for (int i = 0; i &amp;lt; cIters; ++i)
            {
            long ldtStart = System.currentTimeMillis();
            for (int j = 0; j &amp;lt; cOps; ++j)
                {
                test.getVirtual();
                }

            long cMillis = System.currentTimeMillis() - ldtStart;
            System.out.println(cMillis);
            }
        }
    }

And running 10 iterations of 1 billion virtual calls yields the following millisecond timings: java14 Main 10 1000000000 1646 1636 1650 1677 1645 1669 1689 1668 1676 1724 java15 Main 10 1000000000 1775 1752 1698 1688 1759 1732 1758 1709 1670 1733 java16 Main 10 1000000000 8 <— WTF?

1409

1370

1347

1357

1341

1346

1349

1350

1354
Ok, so on 1.6 the first iteration of billion calls took 8ms and each subsequent iteration took nearly 200 times as long, how could that be?
Q> I screwed up my measurements
A> Nope looks ok, double check, still looks ok
Q> Was is the garbage collector
A> Shouldn’t be, other then printing the results the test doesn’t generate any garbage.  I reran with GC logging just to be sure, and no GCs were logged (not surprising).
Q> Did I need to let the test run longer so HotSpot could do its thing?
A> Nope, ran for much longer, results held steady at ~1350ms
Q> Is the optimizer broken or deoptimizing after the first iteration?
A> Sure looks like it
Ok so time to start thinking about how to optimizer is going to change my test.  The first thing to notice is that it could identify that there is no reason to actually call my getVirtual() method, it can see there are no side effects from calling it, and can see that the result is discarded.  So lets modify to do something with the result, just looking at the test loop now.
for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    long cMillis = System.currentTimeMillis() - ldtStart;
    System.out.println(cMillis + ", " + cTotal);
    }

And the results: java16 Main 10 1000000000 46, 1000000000 1934, 1000000000 1875, 1000000000 1897, 1000000000 1924, 1000000000 1938, 1000000000 1884, 1000000000 1890, 1000000000 1895, 1000000000 1943, 1000000000 A few observations:

We can see the getVirtual() method (or least its logic) is getting run
the cost of both the fast iteration and the slow iteration increased
the performance difference between fast and slow runs is down to ~40

Still what’s the deal, running the test for longer doesn’t yield any additional fast iterations.  So at this point I get a few other people involved, and they work through many of the same suggestions and assumptions I’d listed above.  We also try the following:

pull the logic out of main() and put it in a non-static method
try running the test loop in parallel on multiple threads
try first warming up the JVM by running some random but heavy code prior to running the test
try recording the results into an array rather then printing during the test

All of these yield essentially the same results as above.  We do randomly trigger the loss of the fast iteration, but never trigger multiple fast iterations.  So yipee we figured out how to make things go slower.  As a side note it was intersting what would trigger the loss of the fast iteration, which was triggered by recording the results into an array as follows:
long[] acMillis = new long[cIters];
int[] acTotal  = new int[cIters];
for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    acMillis[i] = System.currentTimeMillis() - ldtStart;
    acTotal [i] = cTotal;
    }

for (int i = 0; i &amp;lt; cIters; ++i)
    {
    System.out.println(acMillis[i] + ", " + acTotal[i]);
    }

java16 Main 10 1000000000 1328, 1000000000 1346, 1000000000 1338, 1000000000 1308, 1000000000 1343, 1000000000 1335, 1000000000 1324, 1000000000 1354, 1000000000 1336, 1000000000 1344, 1000000000 And even odder we could “fix” this and get an initial fast result by recording the time measurements int an Object array rather then a long array. No kidding, dynamic object allocation improved results.
Object[] acMillis = new Object[cIters];
int[] acTotal  = new int[cIters];
for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    int  cTotal   = 0;
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        cTotal += test.getVirtual();
        }

    acMillis[i] = new Long(System.currentTimeMillis() - ldtStart);
    acTotal [i] = cTotal;
    }

for (int i = 0; i &amp;lt; cIters; ++i)
    {
    System.out.println(acMillis[i] + ", " + acTotal[i]);
    }

java16 Main 10 1000000000 46, 1000000000 1806, 1000000000 1791, 1000000000 1831, 1000000000 1775, 1000000000 1779, 1000000000 1806, 1000000000 1811, 1000000000 1789, 1000000000 1783, 1000000000 Ok, so clearly this is nuts. But then things start to come into focus, lets go back to a version where the optimizer can optimize out the getVirtual() call, in fact we’ll do it for them.
for (int i = 0; i &amp;lt; cIters; ++i)
    {
    long ldtStart = System.currentTimeMillis();
    for (int j = 0; j &amp;lt; cOps; ++j)
        {
        // cTotal += test.getVirtual();
        }
    long cMillis = System.currentTimeMillis() - ldtStart;
    System.out.println(cMillis);
    }

java16 Main 10 1000000000 6 865 875 859 856 969 875 872 861 877 Ok, so we’ve so removed the thing we were originally trying to test, and performance improves across the board. We get a 40% savings on the slow runs. So we see not too surprisingly that of the time we were measuring nearly half was in the inner ‘for’ loop not in the thing we’d intended to test. And now we the ah ha moment, how does the optimizer look at optimizing the test “infrastructure”, i.e. our main method, and we make this change.
public static void main(String[] asArg)
    {
    int cIters = Integer.parseInt(asArg[0]);
    int cOps   = Integer.parseInt(asArg[1]);

    for (int i = 0; i &amp;lt; 10; ++i)
        {
        runTest(cIters, cOps);
        System.out.println("pass " + i + " complete");
        }
    }

public static void runTest(int cIters, int cOps)
    {
    Test test = new Test();
    for (int i = 0; i &amp;lt; cIters; ++i)
        {
        long ldtStart = System.currentTimeMillis();
        int  cTotal   = 0;
        for (int j = 0; j &amp;lt; cOps; ++j)
            {
            cTotal += test.getVirtual();
            }
        long cMillis = System.currentTimeMillis() - ldtStart;
        System.out.println(cMillis + ", " + cTotal);
        }
    }

java16 Main 10 1000000000 46, 1000000000 1886, 1000000000 1894, 1000000000 1902, 1000000000 1877, 1000000000 1870, 1000000000 1862, 1000000000 1910, 1000000000 1861, 1000000000 1878, 1000000000 pass 0 complete 1869, 1000000000 1899, 1000000000 1897, 1000000000 1866, 1000000000 1882, 1000000000 1855, 1000000000 1895, 1000000000 1890, 1000000000 1903, 1000000000 1880, 1000000000 pass 1 complete 38, 1000000000 41, 1000000000 40, 1000000000 43, 1000000000 39, 1000000000 42, 1000000000 39, 1000000000 42, 1000000000 61, 1000000000 44, 1000000000 pass 2 complete 40, 1000000000 41, 1000000000 41, 1000000000 42, 1000000000 39, 1000000000 39, 1000000000 42, 1000000000 …. Eureka, finally we can get lots sustained fast runs. So it appears that we need to pop out of our test code in order for the optimizer to undo the apparent deoptimization it had made during the second iteration. Note that it was not simply that we’d run more total iterations, because we’d tried that earlier, it is in fact popping out of the test function, which apparently allows an optimized version to be slipped in. This ends up being an important bit to remember to account for in doing these types of micro benchmarks. At this point I’d assumed the “fast” runs had inlined the body of the getVirtual() method into the ‘for’ loop, and that the “slow” runs did call the function. Changing the code to manually inline however don’t back this up, leaving me with more questions. At least the mystery of the broken optimizer are solved though, that is enough for now.
Comments (6)
in progress…

May 7, 2009

Fun with micro benchmarks and optimizers

Pages

Archives