D. J. Bernstein
Computer hardware

PowerPC speed

Processors

MacOS sysctl reports "hw.cputype: 18" for PowerPC, and in particular "hw.cpusubtype: 1" for 601, 2 for 602, 3 or 4 or 5 for 603, 6 or 7 for 604, 8 for 620, 9 for 750, 10 for 7400, 11 for 7450, 100 for 970.

Speed

Motorola's processor page includes a 7410 user's manual (PDF) and a 7450 user's manual (PDF).

A dual G4-533, for example, is a dual 7410-533. A G4-733 is a 7450-733.

The 7410 can perform 3 double-precision floating-point operations every 4 cycles, with a latency of 3 cycles. After floating-point operations occur on 3 successive cycles, a floating-point operation on the next cycle will stall. An operation is one of the following:

     r2 = r0 + r1; /* gcc notation: fadd r2,r0,r1 */
     r2 = r0 - r1; /* fsub r2,r0,r1 */
     r2 = r0 * r1; /* fmul r2,r0,r1 */
     r3 = r0 * r1 + r2; /* fmadd r3,r0,r1,r2 */
     r3 = r0 * r1 - r2; /* fmsub r3,r0,r1,r2 */
     r3 = -(r0 * r1 + r2); /* fnmadd r3,r0,r1,r2 */
     r3 = -(r0 * r1 - r2); /* fnmsub r3,r0,r1,r2 */
The 7410 can also load or store one double-precision floating-point register every cycle, with a latency of 2 cycles.

Although the 7410 is nominally an out-of-order chip, it performs all floating-point operations in order, with an extremely short instruction queue. So it's essential to schedule straight-line floating-point code as if this were an in-order chip. Good floating-point code for the 7410 is similar to good floating-point code for the UltraSPARC.

(The UltraSPARC has the advantage of being able to perform an independent multiplication and addition every cycle. The 7410 has the slight advantage that a multiply-add has latency 3/3/3 from the three inputs instead of 6/6/3. Both chips appear to do a reasonably good job with out-of-order stores.)

The 7450 can perform 4 double-precision floating-point operations every 5 cycles, with a latency of 5 cycles. The 7450 can also load or store one double-precision floating-point register every cycle, with a latency of 4 cycles. (The latency for integer loads is 3 cycles.)

I'm now scheduling PowerPC floating-point code with at most 3 floating-point operations in a row, with an operation latency of 5 cycles, and with a load latency of 4 cycles. The resulting code should run well on both the 7410 and the 7450.

Writing high-performance C code

The C tools bundled with MacOS X (gcc 2.95.2, Darwin libraries) are designed to annoy floating-point programmers.

Misalignment of double-precision variables. MacOS X uses 4-byte alignment, not 8-byte alignment, for double. Aargh! This is true even for static variables, where there's no excuse for using anything smaller than natural alignment.

Fortunately, MacOS X uses 8-byte alignment for a struct containing a double, and the MacOS X assembler and linker preserve the alignment. So you can replace

     static double x;
     static double y[2];
with
     static struct {
       double x;
       double y[2];
     } sometimesgccreallyannoysme;
     #define x sometimesgccreallyannoysme.x
     #define y sometimesgccreallyannoysme.y
to align variables properly. Add long long pad; to the end of the struct and you'll even get correct alignment under AIX.

Painfully slow function calls. Apparently the MacOS X calling convention specifies a huge number of callee-save floating-point registers, f14 through f31. A function using all 32 floating-point registers for 200 cycles will waste another 40 cycles calling saveFP and restFP. Aargh!

The compiler should automatically use caller-save for inline-able functions, with interprocedural analysis to see which registers actually have to be saved.

Deferred multiplications. On most architectures, gcc's -O1 optimization level is the do-what-I-told-you-to-do level. On the PowerPC, gcc -O1 moves floating-point multiplications as late in the code as possible, because it's searching for opportunities to use instructions such as fmadd.

Fortunately, gcc's inline asm is powerful enough to define a multiply-these-variables-now macro.

Cycle counters

Most PowerPC chips have a ``time base'' running at a fraction of the processor cycle speed. The mftb instruction copies the bottom 32 bits of the time base into a register. On my dual 7410-533, for example, the time base increases by 1 every 16 cycles. On an IBM RS64 III-668, the time base increases by 1 every cycle.

The PowerPC 604 and above can use PMC1 as a cycle counter. Something like

     asm("li %0,64;mtspr MMCR0,%0;mfspr %0,PMC1","=r"(t))
will tell PMC1 to count cycles and will put the current value of PMC1 into t. Unfortunately, this instruction is usable only by the kernel.

CPU identification

The PowerPC has a mfpvr instruction that puts the CPU version into a register. Unfortunately, this instruction is usable only by the kernel.