A dual G4-533, for example, is a dual 7410-533. A G4-733 is a 7450-733.
The 7410 can perform 3 double-precision floating-point operations every 4 cycles, with a latency of 3 cycles. After floating-point operations occur on 3 successive cycles, a floating-point operation on the next cycle will stall. An operation is one of the following:
r2 = r0 + r1; /* gcc notation: fadd r2,r0,r1 */ r2 = r0 - r1; /* fsub r2,r0,r1 */ r2 = r0 * r1; /* fmul r2,r0,r1 */ r3 = r0 * r1 + r2; /* fmadd r3,r0,r1,r2 */ r3 = r0 * r1 - r2; /* fmsub r3,r0,r1,r2 */ r3 = -(r0 * r1 + r2); /* fnmadd r3,r0,r1,r2 */ r3 = -(r0 * r1 - r2); /* fnmsub r3,r0,r1,r2 */The 7410 can also load or store one double-precision floating-point register every cycle, with a latency of 2 cycles.
Although the 7410 is nominally an out-of-order chip, it performs all floating-point operations in order, with an extremely short instruction queue. So it's essential to schedule straight-line floating-point code as if this were an in-order chip. Good floating-point code for the 7410 is similar to good floating-point code for the UltraSPARC.
(The UltraSPARC has the advantage of being able to perform an independent multiplication and addition every cycle. The 7410 has the slight advantage that a multiply-add has latency 3/3/3 from the three inputs instead of 6/6/3. Both chips appear to do a reasonably good job with out-of-order stores.)
The 7450 can perform 4 double-precision floating-point operations every 5 cycles, with a latency of 5 cycles. The 7450 can also load or store one double-precision floating-point register every cycle, with a latency of 4 cycles. (The latency for integer loads is 3 cycles.)
I'm now scheduling PowerPC floating-point code with at most 3 floating-point operations in a row, with an operation latency of 5 cycles, and with a load latency of 4 cycles. The resulting code should run well on both the 7410 and the 7450.
Misalignment of double-precision variables. MacOS X uses 4-byte alignment, not 8-byte alignment, for double. Aargh! This is true even for static variables, where there's no excuse for using anything smaller than natural alignment.
Fortunately, MacOS X uses 8-byte alignment for a struct containing a double, and the MacOS X assembler and linker preserve the alignment. So you can replace
static double x; static double y[2];with
static struct { double x; double y[2]; } sometimesgccreallyannoysme; #define x sometimesgccreallyannoysme.x #define y sometimesgccreallyannoysme.yto align variables properly. Add long long pad; to the end of the struct and you'll even get correct alignment under AIX.
Painfully slow function calls. Apparently the MacOS X calling convention specifies a huge number of callee-save floating-point registers, f14 through f31. A function using all 32 floating-point registers for 200 cycles will waste another 40 cycles calling saveFP and restFP. Aargh!
The compiler should automatically use caller-save for inline-able functions, with interprocedural analysis to see which registers actually have to be saved.
Deferred multiplications. On most architectures, gcc's -O1 optimization level is the do-what-I-told-you-to-do level. On the PowerPC, gcc -O1 moves floating-point multiplications as late in the code as possible, because it's searching for opportunities to use instructions such as fmadd.
Fortunately, gcc's inline asm is powerful enough to define a multiply-these-variables-now macro.
The PowerPC 604 and above can use PMC1 as a cycle counter. Something like
asm("li %0,64;mtspr MMCR0,%0;mfspr %0,PMC1","=r"(t))will tell PMC1 to count cycles and will put the current value of PMC1 into t. Unfortunately, this instruction is usable only by the kernel.