Computer hardware

A dual G4-533, for example, is a dual 7410-533. A G4-733 is a 7450-733.

The 7410 can perform 3 double-precision floating-point operations every 4 cycles, with a latency of 3 cycles. After floating-point operations occur on 3 successive cycles, a floating-point operation on the next cycle will stall. An operation is one of the following:

r2 = r0 + r1; /* gcc notation: fadd r2,r0,r1 */ r2 = r0 - r1; /* fsub r2,r0,r1 */ r2 = r0 * r1; /* fmul r2,r0,r1 */ r3 = r0 * r1 + r2; /* fmadd r3,r0,r1,r2 */ r3 = r0 * r1 - r2; /* fmsub r3,r0,r1,r2 */ r3 = -(r0 * r1 + r2); /* fnmadd r3,r0,r1,r2 */ r3 = -(r0 * r1 - r2); /* fnmsub r3,r0,r1,r2 */The 7410 can also load or store one double-precision floating-point register every cycle, with a latency of 2 cycles.

Although the 7410 is nominally an out-of-order chip, it performs all floating-point operations in order, with an extremely short instruction queue. So it's essential to schedule straight-line floating-point code as if this were an in-order chip. Good floating-point code for the 7410 is similar to good floating-point code for the UltraSPARC.

(The UltraSPARC has the advantage of being able to perform an independent multiplication and addition every cycle. The 7410 has the slight advantage that a multiply-add has latency 3/3/3 from the three inputs instead of 6/6/3. Both chips appear to do a reasonably good job with out-of-order stores.)

The 7450 can perform 4 double-precision floating-point operations every 5 cycles, with a latency of 5 cycles. The 7450 can also load or store one double-precision floating-point register every cycle, with a latency of 4 cycles. (The latency for integer loads is 3 cycles.)

I'm now scheduling PowerPC floating-point code with at most 3 floating-point operations in a row, with an operation latency of 5 cycles, and with a load latency of 4 cycles. The resulting code should run well on both the 7410 and the 7450.

**Misalignment of double-precision variables.**
MacOS X uses 4-byte alignment, not 8-byte alignment, for double.
Aargh!
This is true even for static variables,
where there's no excuse for using anything smaller than natural alignment.

Fortunately, MacOS X uses 8-byte alignment for a struct containing a double, and the MacOS X assembler and linker preserve the alignment. So you can replace

static double x; static double y[2];with

static struct { double x; double y[2]; } sometimesgccreallyannoysme; #define x sometimesgccreallyannoysme.x #define y sometimesgccreallyannoysme.yto align variables properly. Add

**Painfully slow function calls.**
Apparently the MacOS X calling convention
specifies a huge number of callee-save floating-point registers,
f14 through f31.
A function using all 32 floating-point registers for 200 cycles
will waste another 40 cycles calling `saveFP` and `restFP`.
Aargh!

The compiler should automatically use caller-save for inline-able functions, with interprocedural analysis to see which registers actually have to be saved.

**Deferred multiplications.**
On most architectures,
gcc's -O1 optimization level is the do-what-I-told-you-to-do level.
On the PowerPC,
gcc -O1 moves floating-point multiplications as late in the code as possible,
because it's searching for opportunities to use
instructions such as `fmadd`.

Fortunately, gcc's inline asm is powerful enough to define a multiply-these-variables-now macro.

The PowerPC 604 and above can use PMC1 as a cycle counter. Something like

asm("li %0,64;mtspr MMCR0,%0;mfspr %0,PMC1","=r"(t))will tell PMC1 to count cycles and will put the current value of PMC1 into