D. J. Bernstein
Computer hardware

x86 speed

Pentium timings

See Agner Fog's Pentium optimization manual [text copy of old version] and Intel's Intel Architecture Optimization Manual 242816.

Pentium Pro timings

See the manuals listed above. Note that Pentium Pro optimization is very different from Pentium optimization.

Pentium MMX timings

The Pentium MMX is essentially the same as the Pentium. Big exceptions: MMX instructions; a 16K L1 data cache; the Pentium-Pro branch-prediction mechanism; and no first-time-in-cache pairing restrictions.

Pentium II timings

The Pentium II is essentially the same as the Pentium Pro. Big exceptions: MMX instructions; a 16K L1 data cache.

Pentium III timings

The Pentium III is essentially the same as the Pentium II. Big exceptions: cache prefetch instructions (welcome to the 1990s, Intel!); SSE instructions. See Intel's Intel Architecture Optimization Reference Manual 245127 for MMX and SSE information. Beware that SSE uses new registers that need to be saved in context switches; SSE code will fail sporadically on older operating systems.

Pentium 4 timings

The Pentium 4 has a similar feel to the Pentium III, plus SSE2 instructions. However, the internal architecture is different. Cycle counts are generally much worse than the Pentium III, often even worse than the original Pentium.

AMD K6-2 timings

See AMD's Note 21924 (PDF).

AMD Athlon timings

See AMD's Note 22007 (PDF).

The Athlon L1 data cache is only two-way but is a gigantic 64K. (This is one of the reasons that the Athlon is much faster than the Pentium III.) In one cycle it can handle two 64-bit loads, or one 64-bit load and one 64-bit store, or two 32-bit stores. It has a first-level TLB with 24 entries for 4K pages and 8 entries for large pages, and a second-level four-way TLB with 256 entries for 4K pages.

The Athlon can do an FADD and an FMUL, along with two loads, every cycle, if the code is properly scheduled. (This is another of the reasons that the Athlon is much faster than the Pentium III.) Both FADD and FMUL have latency 4. For example, the code

     f = x[1]; f *= y[4]; r5 += f;
     f = x[1]; f *= y[5]; r6 += f;
     f = x[1]; f *= y[6]; r7 += f;
     f = x[1]; f *= y[7]; r8 += f;
     f = x[2]; f *= y[3]; r5 += f;
     f = x[2]; f *= y[4]; r6 += f;
     ...
takes 1 cycle per line if the 8 instruction bytes in each line (3 for FLD with 8-bit displacement, 3 for FMUL with 8-bit displacement, 2 for FADDP) are aligned to an 8-byte boundary. The same code takes 1.5 cycles per line if the instructions are not aligned. Julian Ruhe suggests padding floating-point instructions with REP to hit 8-byte boundaries; an Athlon assembler could easily take care of this.

The Athlon does an excellent job of reordering operations. (This is another of the reasons that the Athlon is much faster than the Pentium III.)

Cycle counters

The Pentium line and the Athlon have built-in 64-bit cycle counters, measuring time since boot. To read the cycle counter, use machine-language bytes 15 and 49; the result is put into EAX/EDX.

Code measurement tools

Intel's Vtune Analyzer includes a Pentium simulator and a Pentium II simulator, but it isn't free.

A usable simulator is a tremendous asset for programmers trying to identify bottlenecks in speed-critical code. Every CPU company has simulators for its chips; it amazes me that these simulators aren't released for free.

Other sources of information

The Pentium Compiler Group has a Pentium-optimized version of gcc; their documentation page has some links to x86 chip information. For more links try Paul Hsieh's page. For an introduction to programming using the x86 see Randall Hyde's Art of Assembly Language Programming.