2010.08.16 lab sessions: "Writing really fast code"

This is a free two-part lab session for students and anyone else interested. This session is part of AppliedCrypto 2010. For part 1, meet 09:15 at the tables outside Campbell (the Crypto auditorium). Bring your laptop, fully charged. For part 2, meet 16:00 at the tables outside Campbell.

Goals

This lab session will start with the following surprisingly tricky question: "How quickly can we add 1000 32-bit integers?" More precisely, it will focus on the following question: "How quickly can we add 1000 32-bit integers on a 600MHz IBM PowerPC 750CXe CPU?"

Maybe the computation you're actually interested in is, say, the SHAvite-3 hash function with 256-bit output. It's still a good idea to start with the add-1000-integers problem, a simple problem that allows you to learn fundamental issues of latency and throughput before tackling more difficult problems.

Maybe the CPU you're actually interested in is, say, your laptop's 1.4GHz Intel Core 2 Duo U9400. It's still a good idea to start with the PowerPC, for three reasons. First, the PowerPC has a simpler architecture than the Core 2, i.e., a simpler set of instructions that it understands; you'll quickly be able to learn all the relevant features of the instruction set, and then focus on broader optimization issues. Second, this PowerPC has a much simpler microarchitecture than the Core 2, i.e., a much simpler description of how quickly it follows instructions. Third, other lab participants will also be starting with this PowerPC; you'll be able to discuss your PowerPC speeds with them.

Prerequisites

You will be expected to connect to an AppliedCrypto Linux server through ssh and write code on that server, rather than facing the lab leader with an endless variety of different programming environments. The lab leader has created an account for you on daedalus.crypto.rites.uic.edu and authorized the ssh key that you uploaded as part of the AppliedCrypto registration.

You should already be comfortable with writing and running C code from a command line on a remote Linux server, as illustrated by the following example. Log in to daedalus.crypto.rites.uic.edu:

     ssh daedalus.crypto.rites.uic.edu

On daedalus (i.e., within this ssh daedalus session), create a hello.c file with the following contents:

     #include <stdio.h>
     int main()
     {
       printf("hello world\n");
       return 0;
     }

You should already know how to do this with a command such as vi hello.c. Now compile hello.c, creating hello:

     gcc -o hello hello.c

Finally, run hello:

     ./hello

Writing correct code

Start by creating a simple, and presumably correct, sum1000.c:

     int sum1000(int *x)
     {
       int result = 0;
       int i;
       for (i = 0;i < 1000;++i) result += x[i];
       return result;
     }

This machine's C compiler defines int as a 32-bit integer, i.e., an integer between -2147483648 and 2147483647, with arithmetic modulo 2^32.

Create an accuracy.c as a double-check for sum1000.c:

     #include <stdio.h>
     #include <stdlib.h>
     int x[1000];
     int main()
     {
       int r;
       int expected = 0;
       int result;
       int i;
       for (i = 0;i < 1000;++i) {
         r = random();
         expected += r;
         x[i] = r;
       }
       result = sum1000(x);
       if (result == expected) return 0;
       printf("%d != expected %d\n",result,expected);
       return 100;
     }

Create a Makefile that compiles sum1000.c together with accuracy.c:

     CC=gcc -O1

     default: accuracy

     accuracy: accuracy.c sum1000.c
             $(CC) -o accuracy accuracy.c sum1000.c

On the $(CC) line, make sure to type a tab rather than 8 spaces.

Compile and run:

     make
     ./accuracy

The program finishes immediately, indicating success. If the double-check had failed then the program would have printed the difference.

Measuring speed

The time() function. Create a speed.c that measures the time consumed by sum1000.c:

     #include <time.h>
     #include <stdio.h>
     int x[1000];
     long long t[20];
     int main()
     {
       int i;
       for (i = 0;i < 20;++i) { t[i] = time(0); sum1000(x); }
       for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n");
       return 0;
     }

This program has 19 successive calls to the sum1000 function, surrounded by 20 timestamps returned by the time() function from the standard C library. This program prints the 19 differences between successive timestamps.

Extend the Makefile to compile both accuracy.c and speed.c:

     CC=gcc -O1

     default: accuracy speed

     accuracy: accuracy.c sum1000.c
             $(CC) -o accuracy accuracy.c sum1000.c

     speed: speed.c sum1000.c
             $(CC) -o speed speed.c sum1000.c

Compile and run:

     make && ./accuracy && ./speed

The && runs ./accuracy only if make succeeds, and runs ./speed only if ./accuracy succeeds.

Whoops, all the outputs are 0! The problem here is that time() has very low precision, and therefore almost always very low accuracy. It returns the number of seconds since the UNIX epoch, rounded down to an integer.

The clock() function. Change speed.c to use clock() instead of time():

     #include <time.h>
     #include <stdio.h>
     int x[1000];
     long long t[20];
     int main()
     {
       int i;
       for (i = 0;i < 20;++i) { t[i] = clock(); sum1000(x); }
       for (i = 1;i < 20;++i) printf("%lf ",(t[i] - t[i - 1]) * 1.0 / CLOCKS_PER_SEC);
       printf("\n");
       return 0;
     }

Recompile and run:

     make && ./accuracy && ./speed

Whoops, all the outputs are still 0! The problem here is that clock() still has very low precision, namely 0.01 seconds.

The gettimeofday() function. Change speed.c to use gettimeofday():

     #include <sys/time.h>
     #include <stdio.h>
     long long microseconds(void)
     {
       struct timeval tv;
       gettimeofday(&tv,0);
       return tv.tv_sec * 1000000LL + tv.tv_usec;
     }
     int x[1000];
     long long t[20];
     int main()
     {
       int i;
       for (i = 0;i < 20;++i) { t[i] = microseconds(); sum1000(x); }
       for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n");
       return 0;
     }

Recompile and run:

     make && ./accuracy && ./speed

Now the timings are nonzero:

     27 4 3 3 4 4 4 3 3 4 4 3 4 4 3 3 4 4 4

The first call took about 27 microseconds, and the subsequent calls took about 3 or 4 microseconds. Recall that this PowerPC runs at 600MHz, i.e., 600 million cycles per second, so about 3 or 4 microseconds means about 1800 or 2400 cycles.

The clock_gettime() function. Change speed.c to use clock_gettime():

     #include <time.h>
     #include <stdio.h>
     long long nanoseconds(void)
     {
       struct timespec t;
       clock_gettime(CLOCK_MONOTONIC,&t);
       return t.tv_sec * 1000000000LL + t.tv_nsec;
     }
     int x[1000];
     long long t[20];
     int main()
     {
       int i;
       for (i = 0;i < 20;++i) { t[i] = nanoseconds(); sum1000(x); }
       for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n");
       return 0;
     }

Change the Makefile to compile speed with -lrt, the library that contains the clock_gettime() function:

     CC=gcc -O1

     default: accuracy speed

     accuracy: accuracy.c sum1000.c
             $(CC) -o accuracy accuracy.c sum1000.c

     speed: speed.c sum1000.c
             $(CC) -o speed speed.c sum1000.c -lrt

Recompile and run:

     make && ./accuracy && ./speed

The timings seem to be just the gettimeofday() timings multiplied by 1000:

     23000 4000 4000 4000 4000 3000 3000 4000 4000 3000 4000 4000 3000 3000 4000 4000 4000 3000 3000

Running the program several times can produce variable results. Here is another run:

     23000 5000 3000 3000 4000 4000 3000 4000 4000 3000 3000 81000 4000 4000 3000 4000 3000 4000 3000

Notice the 81000 in the middle: a computation that did not finish until 81000 nanoseconds later. This is not an inaccuracy in clock_gettime(); it is a real delay, perhaps caused by a network packet distracting the operating system for nearly 80000 nanoseconds.

The PowerPC time base. Change speed.c to use the PowerPC "time base" as follows:

     #include <time.h>
     #include <stdio.h>
     long long cycles(void)
     {
       unsigned long high;
       unsigned long low;
       unsigned long newhigh;
       unsigned long long result;
       asm volatile(
         "7:mftbu %0;mftb %1;mftbu %2;cmpw %0,%2;bne 7b"
         : "=r" (high), "=r" (low), "=r" (newhigh)
       );
       result = high;
       result <<= 32;
       result |= low;
       return result * 24;
     }
     int x[1000];
     long long t[20];
     int main()
     {
       int i;
       for (i = 0;i < 20;++i) { t[i] = cycles(); sum1000(x); }
       for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n");
       return 0;
     }

This "time base" is a hardware feature of the CPU: a 64-bit counter that increases by 1 every 24 cycles. Beware that the 24 is specific to the PowerPC 750CXe; other PowerPCs have different time-base frequencies.

Recompile and run:

     make && ./accuracy && ./speed

Now the results are much more stable:

     15984 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3000

These results show that sum1000()+cycles() usually takes 3024 cycles, plus or minus 24 cycles for the 24-cycle precision of the time base.

General notes on accurate timing. These examples illustrate the fact that cycle counters, such as the PowerPC time base, are much more accurate than typical timing mechanisms.

These examples also illustrate the fact that calling a function usually does not take constant time, even if the function seems to be performing a constant amount of work. In particular, the first call to this function is much slower than the rest. (Why do you think this is the case?) This is a very common phenomenon, and an important phenomenon, since many functions are not called in loops. Functions can also be interrupted by the operating system at any moment, often for a very long time; see the 81000 example above. There are many other important sources of variability in function timing.

Many authors report average times (e.g., average cycle counts) without any hint of the variance in the times. This is a horrifyingly bad practice. An average over a small number of samples is very often corrupted by interruptions. An average over a large number of samples is less random but still includes, and is often heavily influenced by, the interruptions: for example, a long-term average time T will turn into approximately 2T if the computer happens to be running something else during the measurement. Authors often try to reduce interruptions ("the computer was otherwise idle"), and perhaps they succeed, but there is no way to be sure about this when the only measurement reported is an average.

Thinking at the architectural level

Loads. The PowerPC instruction set doesn't have an instruction that adds an array element, such as x[i], to something else. The array element has to be copied into a register first. There are only 32 registers (some of which are dedicated to other purposes), and the array of registers is not randomly accessible.

Copying from RAM to a register is called a load. Change sum1000.c to perform this load explicitly:

     int sum1000(int *x)
     {
       register int result = 0;
       register int i;
       for (i = 0;i < 1000;++i) { 
         register int xi = x[i];
         result += xi;
       }
       return result;
     }

C doesn't guarantee that register variables will actually occupy CPU registers, but in this case the compiler turns out to be smart enough to put xi and i and result into CPU registers.

Recompile and run:

     make && ./accuracy && ./speed

Again the loop takes 3 cycles per iteration:

     15960 3048 3024 3000 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024

Separating this one C line into two didn't make any difference in performance, but it's a first step towards understanding how the CPU is actually handling this code.

Pointer arithmetic. From the CPU's perspective, bytes in memory have addresses 0, 1, 2, 3, 4, 5, ..., and x is one of these addresses. Let's say x is address 100. The first array element x[0] consists of the 4 bytes located at addresses 100, 101, 102, 103; the second array element x[1] consists of the 4 bytes located at addresses 104, 105, 106, 107; and in general the array element x[i] consists of the 4 bytes located at addresses 100+4*i, 101+4*i, 102+4*i, 103+4*i. To figure out the address of x[i] the CPU has to compute 4*i and add that to x.

Change sum1000.c to view x the way the CPU does, as an integer address:

     int sum1000(int x)
     {
       register int result = 0;
       register int i;
       for (i = 0;i < 1000;++i) { 
         register int xi = *(int *) (x+4*i);
         result += xi;
       }
       return result;
     }

The int *x definition has changed to int x, and x[i] has changed to *(int *) (x+4*i). Here x+4*i is the address 4*i bytes after x, and *(int *) refers to the 4 bytes at this address.

Limitations on pointer arithmetic. The PowerPC instruction set doesn't actually have an instruction

     register int xi = *(int *) (x+4*i);

but it does have an instruction

     register int xi = *(int *) (x+i);

Change sum1000.c to keep track of 4*i instead of i:

     int sum1000(int x)
     {
       register int result = 0;
       register int i4;
       for (i4 = 0;i4 < 4000;i4 += 4) { 
         register int xi = *(int *) (x+i4);
         result += xi;
       }
       return result;
     }

Conditional branches. Change sum1000.c to use goto instead of for:

     int sum1000(int x)
     {
       register int result = 0;
       register int i4 = 0;
       register int xi;
       startofloop:
         xi = *(int *) (x+i4);
         result += xi;
	 i4 += 4;
       if (i4 < 4000) goto startofloop;
       return result;
     }

The PowerPC instruction set doesn't actually have an instruction

     if (i4 < 4000) goto startofloop;

but it does have an instruction that's essentially

     if (i4 >= 0) goto startofloop;

if i4 has just been modified. To take advantage of this, change sum1000.c to decrease i4 from 3996 to 0 rather than increasing it from 0 to 3996:

     int sum1000(int x)
     {
       register int result = 0;
       register int i4 = 3996;
       register int xi;
       startofloop:
         xi = *(int *) (x+i4);
         result += xi;
	 i4 -= 4;
       if (i4 >= 0) goto startofloop;
       return result;
     }

Switching to assembly language

At this point each of the C lines in sum1000 corresponds directly to a PowerPC instruction. By changing these PowerPC instructions we'll drastically improve the performance of the code. However, the compiler doesn't make it easy for the programmer to control the sequence of PowerPC instructions that the compiler actually produces, so let's first switch to another tool that does make it easy.

Create a new file sum1000.q with the following contents:

     int32 x
     int32 sum
     int32 result
     int32 i4
     int32 xi

     input x
     output result

     enter sum1000
       sum = 0
       i4 = 3996
     startofloop:
       xi = *(int32 *) (x + i4)
       sum += xi
                          <? i4 -= 4
     goto startofloop if !<
       result = sum
     leave

The int32 lines define 32-bit register variables. The input and output lines define function parameters and return values. The enter and leave lines define the start and end of the function. The <? i4-=4 first subtracts 4 from i4 and then sets a CPU flag, specifically the PowerPC's less-than flag, if the new value of i4 is smaller than 0 (as a signed 32-bit integer). The goto startofloop if !< branches back to startofloop if the flag is not set, i.e., if the new value of i4 is greater than or equal to 0. The final result = sum might seem silly, but the obvious alternative—simply say output sum and skip result—doesn't work: on the PowerPC, the first input and the first output are always placed in the same register, so they can't be used simultaneously, but this code uses x and sum simultaneously.

Change Makefile to use sum1000.q instead of sum1000.c:

     CC=gcc -O1

     default: accuracy speed

     accuracy: accuracy.c sum1000.o
             $(CC) -o accuracy accuracy.c sum1000.o

     speed: speed.c sum1000.o
             $(CC) -o speed speed.c sum1000.o -lrt

     sum1000.o: sum1000.s
             $(CC) -c sum1000.s

     sum1000.s: sum1000.q
             qhasm-ppc32-linux < sum1000.q > sum1000.s

Recompile and run:

     make && ./accuracy && ./speed

Suddenly the loop takes only 2 cycles per iteration:

     16272 2040 2088 2016 2040 2016 2112 2016 2040 2016 2040 2016 2040 2016 2040 2016 2040 2016 2040

This isn't great performance yet: we could have achieved the same speed by playing with the compiler options. But now we're in a position to start serious optimization.

Optimizing at the microarchitectural level

The current sum1000.q has 4 instructions per iteration: a load xi = *(int32 *) (x + i4), an addition sum += xi, a subtraction i4 -= 4, and a branch goto startofloop if !<.

The IBM PowerPC 750CX/CXe/CXr user manual says that "As many as two instructions can be dispatched per clock." Dispatching 4 instructions takes at least 2 cycles, so to reduce cycles/iteration below 2 we have to reduce instructions/iteration below 4.

Each iteration has 2 unavoidable instructions (the load and the addition) and 2 overhead instructions (the subtraction and the branch). Reducing the number of overhead instructions is a simple matter of unrolling:

     int32 x
     int32 sum
     int32 result
     int32 i4
     int32 xi
     int32 xminus4
     int32 xminus8
     int32 xminus12

     input x
     output result

     enter sum1000
       sum = 0
       i4 = 3996
       xminus4 = x - 4
       xminus8 = x - 8
       xminus12 = x - 12
     startofloop:
       xi = *(int32 *) (x + i4)
       sum += xi
       xi = *(int32 *) (xminus4 + i4)
       sum += xi
       xi = *(int32 *) (xminus8 + i4)
       sum += xi
       xi = *(int32 *) (xminus12 + i4)
       sum += xi
                          <? i4 -= 16
     goto startofloop if !<
       result = sum
     leave

This reduces the number of cycles/iteration to 1.5:

     15312 1560 1584 1512 1536 1536 1608 1536 1512 1536 1536 1512 1536 1536 1536 1512 1536 1536 1536

But this is not as much improvement as one might hope. The number of instructions/iteration is now only 2.5, so if 2 instructions were dispatched per cycle then the number of cycles/iteration would be only 1.25, not 1.5. Why do the 10 instructions in the inner loop take 6 cycles instead of 5?

The PowerPC 750CXe has a single "load unit" that handles all loads, and two integer units that handle integer additions, subtractions, etc. All of these units can operate simultaneously, handling one load and two integer operations in a single cycle, but they have some restrictions. All integer operations are carried out in the original order of integer instructions: an integer instruction can be followed in the same cycle as an earlier integer instruction, but cannot be followed in an earlier cycle. Similarly, all loads are carried out in the original order of load instructions. The integer latency is 1: the output of an integer instruction cannot be used until 1 cycle later. The load latency is 2: i.e., the output of a load cannot be used until 2 cycles later.

Consider what happens if the load from x+i4 starts on cycle 1000. The load from xminus4+i4 cannot start before cycle 1001. The load from xminus8+i4 cannot start before cycle 1002. The load from xminus12+i4 cannot start before cycle 1003. The xi produced by that load cannot be added to sum before cycle 1005. The subsequent integer operation i4-=16 cannot occur before cycle 1005. The next load from x+i4 cannot start before cycle 1006.

Breaking this chain is a simple matter of moving i4-=16 up one instruction, so that it can start in an earlier cycle:

     int32 x
     int32 sum
     int32 result
     int32 i4
     int32 xi
     int32 xminus4
     int32 xminus8
     int32 xminus12

     input x
     output result

     enter sum1000
       sum = 0
       i4 = 3996
       xminus4 = x - 4
       xminus8 = x - 8
       xminus12 = x - 12
     startofloop:
       xi = *(int32 *) (x + i4)
       sum += xi
       xi = *(int32 *) (xminus4 + i4)
       sum += xi
       xi = *(int32 *) (xminus8 + i4)
       sum += xi
       xi = *(int32 *) (xminus12 + i4)
                          <? i4 -= 16
       sum += xi
     goto startofloop if !<
       result = sum
     leave

Each iteration now takes just 1.25 cycles:

     15408 1296 1344 1272 1272 1272 1344 1296 1272 1272 1272 1296 1272 1272 1272 1296 1272 1272 1272

Further unrolling brings the cost to just 1.125 cycles/iteration:

     int32 x
     int32 sum
     int32 result
     int32 i4
     int32 xi
     int32 xminus4
     int32 xminus8
     int32 xminus12
     int32 xminus16
     int32 xminus20
     int32 xminus24
     int32 xminus28

     input x
     output result

     enter sum1000
       sum = 0
       i4 = 3996
       xminus4 = x - 4
       xminus8 = x - 8
       xminus12 = x - 12
       xminus16 = x - 16
       xminus20 = x - 20
       xminus24 = x - 24
       xminus28 = x - 28
     startofloop:
       xi = *(int32 *) (x + i4)
       sum += xi
       xi = *(int32 *) (xminus4 + i4)
       sum += xi
       xi = *(int32 *) (xminus8 + i4)
       sum += xi
       xi = *(int32 *) (xminus12 + i4)
       sum += xi
       xi = *(int32 *) (xminus16 + i4)
       sum += xi
       xi = *(int32 *) (xminus20 + i4)
       sum += xi
       xi = *(int32 *) (xminus24 + i4)
       sum += xi
       xi = *(int32 *) (xminus28 + i4)
                          <? i4 -= 32
       sum += xi
     goto startofloop if !<
       result = sum
     leave

Is this the right tool?

Some people make the bold claim that assembly-language programming is pointless, since today's compilers have an easy time turning straightforward C code into excellent machine code, especially for simple architectures such as the PowerPC. Can you find compiler options that make the original C code competitive with this assembly-language code? The best possibility identified so far is

     gcc -O3 -mcpu=750 -fschedule-insns2 -funroll-loops

which is a fairly big step backwards in performance compared to the assembly-language code:

     14496 1752 1680 1728 1848 1704 1680 1752 1584 1584 1584 1584 1608 1584 1584 1584 1584 1584 1608

Some people make an equally bold claim in the opposite direction: that portable C code, no matter how well done, can never match the performance of assembly language. Can you write portable C code that runs close to 1 cycle/iteration? The question isn't whether this is as easy as writing in assembly language; the question is whether it's possible.

There are many different assemblers, and all of them are capable of producing the same performance, close to 1 cycle/iteration. How easy is it to write the code using each of these tools? This lab focused on qhasm for ease of use, but perhaps there are competitors.

Followup projects

How much can you speed up the first call to sum1000()?

Once you understand the add-1000-integers example on the PowerPC 750CXe, try exploring the same problem on other CPUs. For each CPU, try to figure out a lower bound on the number of cycles needed, and then see how close you can come to the lower bound.