Does a free swap turn a stack into a traditional register set? D. J. Bernstein 2005.02.10 The x86 architecture supports a stack of eight 80-bit floating-point values. There are x86 machine instructions to push a new value on top of the stack, add any value in the stack to the top of the stack, add the top of the stack to any value in the stack, etc. What do you do if you need to add the third-to-top element on the stack to the seventh-to-top element on the stack? You swap the third element with the top, and then add it to the seventh; or swap the seventh element with the top, and add the third to it. There's an x86 machine instruction, FXCH, that swaps any stack position with the top. This is a familiar programming model from higher-level stack-based languages: dc, Forth, the HP RPN calculator language, etc. As a set of machine instructions, however, it has come under criticism, for two different reasons. The first criticism is that swapping is slow. Some computations are happy to work almost exclusively with the top of the stack, but others aren't. Compressing top-of-stack operations might produce a speedup by reducing code size, but this speedup is outweighed by the slowdown from constant swaps. Intel responded by allowing a free swap between (essentially) any two Pentium floating-point operations. ``The fxch instruction can be executed in parallel with the commonly used FP instructions, which lets the code generator or programmer treat the floating-point stack as a regular register set without any performance degradation,'' Intel wrote many years ago. The second criticism is that, even with a free swap before each operation, a stack still isn't as good as a traditional register set: sometimes a stack requires a bunch of extra swaps. This criticism usually comes from compiler writers who are trying to generate x86 code. They use a traditional register allocator, as they learned in school, for a simulated machine with 8 registers that don't move; then they try to convert the resulting code into x86 code. The simplest conversion strategy, namely inserting a swap before any operation that uses two non-top values, is fine for a sequence of operations with no labels and no jumps (a ``basic block''), but a backwards jump may require additional swaps to reshuffle the stack. One response is that there are obvious ways to improve the conversion strategy. There's often a choice of swaps before an operation; by combining these choices one can usually eliminate the end-of-basic-block reshuffling. I've recently realized that there's another response. Everyone keeps talking about a one-free-swap stack as worse than a traditional register set, the only question being how much worse. But maybe a one-free-swap stack is _better_ than a traditional register set! Here's an example. Try applying a traditional 8-register allocator to the following loop: ... do { F += B; G += B; H += B; X = A + B; A = B + C; B = C + X; C += A; D += B; E += B; } while (...); ... Let's start from the moment after the X assignment. The values assigned to B, C, D, E, F, G, H, X are all ``live''---they will be used later--- so they need separate registers. Let's say they're in registers 0, 1, 2, 3, 4, 5, 6, 7. The assignment to A then ``kills'' B, but every other variable is still live, so we don't have any choice but to put A into register 0. Next, the assignment to B kills X, but every other variable is still live, so we don't have any choice but to put B into register 7. Similarly, the assignments to C, D, E, F, G, H are forced to leave C, D, E, F, G, H in registers 1, 2, 3, 4, 5, 6. Finally, the assignment to X kills A, but every other variable is still live, so we don't have any choice but to put X into register 0. Now we've reached our starting point again---with the registers in the wrong order. The traditional register allocator has to insert extra code: a swap of B and X (typically via memory), or an unrolled loop. In traditional language introduced by Chaitin, the ``conflict graph'' of these variables is the complete graph on 9 vertices, so there's no way to ``8-color'' it, even though only 8 values are live at any moment. With a one-free-swap stack, on the other hand, it's easy to handle the same loop without any extra shuffling: Stack: top -> B C D E F G H A <- bottom. F += B Stack: top -> B C D E F G H A <- bottom. G += B Stack: top -> B C D E F G H A <- bottom. H += B Stack: top -> B C D E F G H A <- bottom. swap A to top Stack: top -> A C D E F G H B <- bottom. X = A + B Stack: top -> X C D E F G H B <- bottom. swap B to top Stack: top -> B C D E F G H X <- bottom. A = B + C Stack: top -> A C D E F G H X <- bottom. swap X to top Stack: top -> X C D E F G H A <- bottom. B = C + X Stack: top -> B C D E F G H A <- bottom. swap C to top Stack: top -> C B D E F G H A <- bottom. C += A Stack: top -> C B D E F G H A <- bottom. swap B to top Stack: top -> B C D E F G H A <- bottom. D += B Stack: top -> B C D E F G H A <- bottom. E += B So, in this example, the one-free-swap stack is actually _faster_ than the traditional register set. Everyone working on traditional register allocation knows that it's sometimes impossible to fit code into 8 traditional registers, even though the code has at most 8 live values. Nobody seems to have realized that some of the same examples _can_ fit into an 8-element one-free-swap stack. Of course, you can't see this if you make the mistake of starting with traditional 8-register allocation; you'll throw away all of these examples before you generate stack code! Yes, there are cases where a one-free-swap stack is worse than a traditional register set. One of the worst cases is a jump that kills several values simultaneously, forcing several stack pops. But there are also cases where a one-free-swap stack is better than a traditional register set. I suggest that compiler writers stop viewing the x86 stack as a twisted version of a traditional 8-register set. Any code that uses at most 8 live values should be translated directly into stack code. Assigning traditional registers to the values is a bad idea.