And vs Modulo performance

From EggeWiki

In school I was taught the modulo operator is slow compared with bitwise operators. This is because often modulo is implemented using divide. See http://en.wikipedia.org/wiki/Modulo_operation. So when I write code which is going to sample a variable, I often use AND instead of modulo. Well, it turns out, I may have fallen victim to premature optimization - at least in Java anyway.

public class ModuloSpeed {


	public void testWarmup() {
		long s = System.currentTimeMillis();
		int sum = 0, count = 0;
		while(System.currentTimeMillis() - s < 10000) {
			sum += count++ & 0xff % 200;
		}

		System.out.println("Warm-up completed (" + sum + ")");
	}
	
	public void testAnd() {
		long s = System.currentTimeMillis();
		int sum = 0, count = 0;
		while(count < 1000000000) {
			sum += count++ & 0xff;
		}

		System.out.println(count + " ands completed in " + (System.currentTimeMillis() - s) + "ms (" + sum + ")");
	}
	
	public void testModulo() {
		long s = System.currentTimeMillis();
		int sum = 0, count = 0;
		while(count < 1000000000) {
			sum += count++ % 200;
		}

		System.out.println(count + " modulos completed in " + (System.currentTimeMillis() - s) + "ms (" + sum + ")");
	}
	
	public static void main(String[] args) {
		ModuloSpeed m = new ModuloSpeed();
		m.testWarmup();
		m.testAnd();
		m.testModulo();
		m.testAnd();
		m.testModulo();
	}
	
}

Running on my desktop (Intel Pentium D CPU 2.8GHz) yielded:

Warm-up completed (-894978952)
Warm-up completed (-868618467)
1000000000 ands completed in 1453ms (-1349018880)
1000000000 modulos completed in 15579ms (715752192)
1000000000 ands completed in 1484ms (-1349018880)
1000000000 modulos completed in 15485ms (715752192)

These tests show AND is about 10 times faster than modulo. I next decided to run the same test on a server (Dual-Core AMD Opteron Processor 2216).

$ java ModuloSpeed
Warm-up completed (930431069)
1000000000 ands completed in 781ms (-1349018880)
1000000000 modulos completed in 3249ms (715752192)
1000000000 ands completed in 766ms (-1349018880)
1000000000 modulos completed in 3289ms (715752192)

$ java -d64 ModuloSpeed
1000000000 ands completed in 814ms (-1349018880)
1000000000 modulos completed in 3666ms (715752192)
1000000000 ands completed in 877ms (-1349018880)
1000000000 modulos completed in 3244ms (715752192)

On the AMD hardware, AND is only five times faster, but still fast. However, as soon as I add another operation, like comparing the system clock, the performance is about 100x worse, and one ends up measuring the performance of the other function. Very few loops in Java are going to run 1 billion operations per second. So while the performance of AND is faster, unless I was writing a high performance routine, I wouldn't concern myself with the difference.