Consider you have some arbitrary operation which you wish to measure the performance of. For instance:

public static double testMethod(double dfl) { for (int i = 0; i < 100; ++i) { dfl = Math.log(dfl); } return dfl; } [/sourcecode] The content of the method is not important, in fact we should not assume to know the implementation, all we know is that it is likely to complete in well under the system's clock resolution. Typically we'd micro benchmark this during development time by simply calling it a large number of times in a tight loop and computing the average execution time: [sourcecode language="java"] public static Result testClassic(int cIters) { long ldtStart = System.currentTimeMillis(); double dflResult = 123.456; for (int i = 0; i < cIters; ++i) { dflResult = testMethod(dflResult); } long cMillis = System.currentTimeMillis() - ldtStart; double dflMillisPerOp = ((double) cMillis) / cIters; return new Result(cIters, cMillis, dflMillisPerOp, dflResult); } [/sourcecode] One of the primary reasons we do the loop is that as we expect testMethod to return in well under the clock resolution, and thus without the loop we'd be left with an arguably meaningless value of "0". So instead we forcefully run it long enough that we are sure the clock will have "ticked" many times. This is all well and good, but what if we weren't in a position to call it in a tight loop. What if our interest was in measuring the cost of periodic calls to this function inside a working application, where we can't simply time a loop over thousands of calls? At this point you may be asking why would you want to have the application measure the cost, isn't that what profilers are for? Well yes, this isn't meant to replace a profiler, but it can be useful in collecting runtime statistics. For instance in <a href="http://www.oracle.com/technology/products/coherence/index.html">Oracle Coherence</a> we've consider measuring the average serialization time for user data objects. These are not classes we have control over, or the ability to easily profile, but it would be nice to know track their serialization cost as it is a crucial part of our overall performance. As Coherence still supports Java 1.4 we are without the higher resolution System.nanoTime(), and so an alternative is needed. So what do we do with a millisecond resolution clock, when we want to measure sub millisecond operations? Well we just do the obvious. // ... other application code ... long ldtStart = System.currentTimeMillis(); testMethod(dflInpuut); cMillisTotal += System.currentTimeMillis() - ldtStart; ++cIters; // ... other application code ... // at some point compute the average cost double dflMillisPerOp = ((double) cMillisTotal) / cIters;

That is it, we just need to do is cross over this code a few thousand times and we’re good. I’d imagine some might be wondering what the use in this is as the per-operation measurement should yield zero each time. We’ll lets give it a try and see what we get. To simulate the application crossing over the above code we’ll introduce a loop, which between calls into the above code, burns a random amount of CPU, i.e. our fake application logic. We’ll call this approach the “discrete” approach as compared to the “classic” approach shown above.

public static Result testDiscrete(int cIters)

{

int nFactor = rnd.nextInt(10);

long ldtStart = System.currentTimeMillis();

long cMillis = 0;

double dflResult = 123.456;

for (int i = 0; i < cIters; ++i)
{
long ldtIter = System.currentTimeMillis();
dflResult = testMethod(dflResult);
cMillis += System.currentTimeMillis() - ldtIter;
burnCPU(nFactor); // simulate lots of other code running
}
long cMillisOuter = System.currentTimeMillis() - ldtStart;
double dflMillisPerOp = ((double) cMillis) / cIters;
return new Result(cIters, cMillisOuter, dflMillisPerOp, dflResult);
}
[/sourcecode]
The results for testing multiple runs with 100,000 iterations each are as follows:
Classic: Average execution in 0.0092 ms; after measuring 100000 over 920 ms
Discrete: Average execution in 0.00938 ms; after measuring 100000 over 17192 ms
Classic: Average execution in 0.00883 ms; after measuring 100000 over 883 ms
Discrete: Average execution in 0.00898 ms; after measuring 100000 over 908 ms
Classic: Average execution in 0.0089 ms; after measuring 100000 over 890 ms
Discrete: Average execution in 0.00905 ms; after measuring 100000 over 19128 ms
Classic: Average execution in 0.00885 ms; after measuring 100000 over 885 ms
Discrete: Average execution in 0.0089 ms; after measuring 100000 over 5435 ms
Classic: Average execution in 0.00884 ms; after measuring 100000 over 884 ms
Discrete: Average execution in 0.0094 ms; after measuring 100000 over 12287 ms
Classic: Average execution in 0.00887 ms; after measuring 100000 over 887 ms
Discrete: Average execution in 0.00899 ms; after measuring 100000 over 3173 ms
As you can see both approaches yield surprisingly similar results, the cost of our testMethod is around .01 ms. The variability in the total duration of the discrete test is intentional and caused by the burnCPU() method, which is configured to burn a random amount of CPU per test run simulating the frequency of calls to testMethod().
But how does the discrete approach yield apparently accurate results if always measures 0ms per call? Well I suppose the obvious answer is that it doesn't always measure 0ms, sometimes is measures 1ms. The fact that the method always completes in well under a millisecond doesn't mean we don't occasionally cross a clock edge during the measurement and get an even more incorrect 1ms result. Remember that we aren't in control of the clock, and thus we can't assume that we are at the beginning of a clock-cycle when we record the start time, thus we could end up recording the start time when we are just a few nanoseconds from the next tick. Ok, but still how does this yield such good results? This can be answered is in the question, "What is the probability of us crossing the clock edge?". Using the result from the classic method we can see that the probability of crossing a one ms clock boundary is around 1%. This comes straight from our measurement of ~0.01ms. In the classic case with back to back method calls, for every 100 or so 0.01ms calls the clock would tick. This cost of testMethod() doesn't change with the discrete approach, and thus the probability of us encountering a clock tick is still 1%. So 99% of our samples result in 0ms, and 1% result in 1ms. Meaning that the accumulated milliseconds for out 100,000 samples will be around one second, which pretty much lines up with what we get in the classic case. So this is how we come to the title, we are measuring relying upon the small probability of crossing a clock edge, and the result of adding the results from a large number of samples.
I find it really interesting to watch this work. Even though the math makes sense it is still surprising to see the results occur which such accuracy. The results are not the result of a carefully crafted test, change the iterations, take out or change the "burn", change the test function, it all still holds up quite well. It also reliant on this particular clock resolution, for instance the same thing will work on Windows where the precision of System.currentTimeMillis() is actually ~16ms (resolution is still 1ms). In such a case the probability of seeing a clock edge becomes 1 in 1600, but the cost of that one, is now 16ms, so the totals and averages will still hold.
Now there are a few small caveats:

Of course the biggest caveat of all is that with higher resolution clocks such as Java 1.5’s System.nanoTime() this technique isn’t as necessary, though that doesn’t change the fact that it is still quite interesting.

And for those that are interested here is the full source:

import java.util.Random;

/**

* Scratch test

*/

public class Scratch

{

public static void main(String[] asArg)

throws Exception

{

int cRuns = asArg.length > 0 ? Integer.parseInt(asArg[0]) : 100;

int cIters = asArg.length > 1 ? Integer.parseInt(asArg[1]) : 100000;

for (int i = 0; i < cRuns; ++i)

{

System.out.println("Classic: " + testClassic(cIters));

System.out.println("Discrete: " + testDiscrete(cIters));

System.out.println();

}

}

/**

* Some non-trivial method which is still very short in duration.

*/

public static double testMethod(double dfl)

{

for (int i = 0; i < 100; ++i)

{

dfl = Math.log(dfl);

}

return dfl;

}

/**

* Perform "classic" performance test measuring how long it takes to run

* the operation many times, and then computing an average.

*/

public static Result testClassic(int cIters)

{

long ldtStart = System.currentTimeMillis();

double dflResult = 123.456;

for (int i = 0; i < cIters; ++i)
{
dflResult = testMethod(dflResult);
}
long cMillis = System.currentTimeMillis() - ldtStart;
double dflMillisPerOp = ((double) cMillis) / cIters;
return new Result(cIters, cMillis, dflMillisPerOp, dflResult);
}
/**
* Perform "non-classic" performance measurement where timings are recorded
* for individual executions, and then computing an average.
*/
public static Result testDiscrete(int cIters)
{
int nFactor = rnd.nextInt(10);
long ldtStart = System.currentTimeMillis();
long cMillis = 0;
double dflResult = 123.456;
for (int i = 0; i < cIters; ++i)
{
long ldtIter = System.currentTimeMillis();
dflResult = testMethod(dflResult);
cMillis += System.currentTimeMillis() - ldtIter;
burnCPU(nFactor); // simulate lots of other code running
}
long cMillisOuter = System.currentTimeMillis() - ldtStart;
double dflMillisPerOp = ((double) cMillis) / cIters;
return new Result(cIters, cMillisOuter, dflMillisPerOp, dflResult);
}
/**
* Data structure for recording resuls.
*/
static class Result
{
public Result(int cIters, long cMillisTotal, double dflMillisPerOp,
double dflResult)
{
m_cIters = cIters;
m_cMillisTotal = cMillisTotal;
m_dflMillisPerOp = dflMillisPerOp;
m_dflResult = dflResult;
}
public String toString()
{
return "Average execution in " + m_dflMillisPerOp +
" ms; after measuring " + m_cIters + " over " +
m_cMillisTotal + " ms";
}
private int m_cIters;
private long m_cMillisTotal;
private double m_dflMillisPerOp;
private double m_dflResult;
}
/**
* Helper method to burn a random amount of CPU in order to simulate
* spreading out the individual measurements.
*/
public static int burnCPU(int nFactor)
{
int r = 0;
for (int j = 0; j < nFactor * 1000; ++j)
{
r = 1 + rnd.nextInt();
}
return r;
}
static final Random rnd = new Random();
}
[/sourcecode]
I hope you've found this as interesting as I have

Thank you for this good and simple to understand article.

It happens I also use this simple measurement with meaningful results.

But System.nanoTime() from Java 1.5 will not save you in all cases : the precision is better than System.currentTimeMillis() (16ms on WindowsXP as you said), but System.nanoTime() has a cost much higher than System.currentTimeMillis(). On my recent WindowsXP workstation System.nanoTime() has a cost of 2 microseconds which is not always negligible.

Comment by Emeric — June 11, 2009 @ 5:46 pm