Does a free swap turn a stack into a traditional register set?
D. J. Bernstein
2005.02.10

The x86 architecture supports a stack of eight 80-bit floating-point
values. There are x86 machine instructions to push a new value on top of
the stack, add any value in the stack to the top of the stack, add the
top of the stack to any value in the stack, etc.

What do you do if you need to add the third-to-top element on the stack
to the seventh-to-top element on the stack? You swap the third element
with the top, and then add it to the seventh; or swap the seventh
element with the top, and add the third to it. There's an x86 machine
instruction, FXCH, that swaps any stack position with the top.

This is a familiar programming model from higher-level stack-based
languages: dc, Forth, the HP RPN calculator language, etc. As a set of
machine instructions, however, it has come under criticism, for two
different reasons.

The first criticism is that swapping is slow. Some computations are
happy to work almost exclusively with the top of the stack, but others
aren't. Compressing top-of-stack operations might produce a speedup by
reducing code size, but this speedup is outweighed by the slowdown from
constant swaps.

Intel responded by allowing a free swap between (essentially) any two
Pentium floating-point operations. ``The fxch instruction can be
executed in parallel with the commonly used FP instructions, which lets
the code generator or programmer treat the floating-point stack as a
regular register set without any performance degradation,'' Intel wrote
many years ago.

The second criticism is that, even with a free swap before each
operation, a stack still isn't as good as a traditional register set:
sometimes a stack requires a bunch of extra swaps.

This criticism usually comes from compiler writers who are trying to
generate x86 code. They use a traditional register allocator, as they
learned in school, for a simulated machine with 8 registers that don't
move; then they try to convert the resulting code into x86 code. The
simplest conversion strategy, namely inserting a swap before any
operation that uses two non-top values, is fine for a sequence of
operations with no labels and no jumps (a ``basic block''), but a
backwards jump may require additional swaps to reshuffle the stack.

One response is that there are obvious ways to improve the conversion
strategy. There's often a choice of swaps before an operation; by
combining these choices one can usually eliminate the end-of-basic-block
reshuffling.

I've recently realized that there's another response. Everyone keeps
talking about a one-free-swap stack as worse than a traditional register
set, the only question being how much worse. But maybe a one-free-swap
stack is _better_ than a traditional register set!

Here's an example. Try applying a traditional 8-register allocator to
the following loop:

   ...
   do {
     F += B;
     G += B;
     H += B;
     X = A + B;
     A = B + C;
     B = C + X;
     C += A;
     D += B;
     E += B;
   } while (...);
   ...

Let's start from the moment after the X assignment. The values assigned
to B, C, D, E, F, G, H, X are all ``live''---they will be used later---
so they need separate registers. Let's say they're in registers 0, 1, 2,
3, 4, 5, 6, 7.

The assignment to A then ``kills'' B, but every other variable is still
live, so we don't have any choice but to put A into register 0.

Next, the assignment to B kills X, but every other variable is still
live, so we don't have any choice but to put B into register 7.

Similarly, the assignments to C, D, E, F, G, H are forced to leave C, D,
E, F, G, H in registers 1, 2, 3, 4, 5, 6.

Finally, the assignment to X kills A, but every other variable is still
live, so we don't have any choice but to put X into register 0.

Now we've reached our starting point again---with the registers in the
wrong order. The traditional register allocator has to insert extra
code: a swap of B and X (typically via memory), or an unrolled loop. In
traditional language introduced by Chaitin, the ``conflict graph'' of
these variables is the complete graph on 9 vertices, so there's no way
to ``8-color'' it, even though only 8 values are live at any moment.

With a one-free-swap stack, on the other hand, it's easy to handle the
same loop without any extra shuffling:

   Stack: top -> B C D E F G H A <- bottom.    F += B
   Stack: top -> B C D E F G H A <- bottom.    G += B
   Stack: top -> B C D E F G H A <- bottom.    H += B
   Stack: top -> B C D E F G H A <- bottom.    swap A to top
   Stack: top -> A C D E F G H B <- bottom.    X = A + B
   Stack: top -> X C D E F G H B <- bottom.    swap B to top
   Stack: top -> B C D E F G H X <- bottom.    A = B + C
   Stack: top -> A C D E F G H X <- bottom.    swap X to top
   Stack: top -> X C D E F G H A <- bottom.    B = C + X
   Stack: top -> B C D E F G H A <- bottom.    swap C to top
   Stack: top -> C B D E F G H A <- bottom.    C += A
   Stack: top -> C B D E F G H A <- bottom.    swap B to top
   Stack: top -> B C D E F G H A <- bottom.    D += B
   Stack: top -> B C D E F G H A <- bottom.    E += B

So, in this example, the one-free-swap stack is actually _faster_ than
the traditional register set.

Everyone working on traditional register allocation knows that it's
sometimes impossible to fit code into 8 traditional registers, even
though the code has at most 8 live values. Nobody seems to have realized
that some of the same examples _can_ fit into an 8-element one-free-swap
stack. Of course, you can't see this if you make the mistake of starting
with traditional 8-register allocation; you'll throw away all of these
examples before you generate stack code!

Yes, there are cases where a one-free-swap stack is worse than a
traditional register set. One of the worst cases is a jump that kills
several values simultaneously, forcing several stack pops. But there are
also cases where a one-free-swap stack is better than a traditional
register set.

I suggest that compiler writers stop viewing the x86 stack as a twisted
version of a traditional 8-register set. Any code that uses at most 8
live values should be translated directly into stack code. Assigning
traditional registers to the values is a bad idea.