Maybe the computation you're actually interested in is, say, the SHAvite-3 hash function with 256-bit output. It's still a good idea to start with the add-1000-integers problem, a simple problem that allows you to learn fundamental issues of latency and throughput before tackling more difficult problems.
Maybe the CPU you're actually interested in is, say, your laptop's 1.4GHz Intel Core 2 Duo U9400. It's still a good idea to start with the PowerPC, for three reasons. First, the PowerPC has a simpler architecture than the Core 2, i.e., a simpler set of instructions that it understands; you'll quickly be able to learn all the relevant features of the instruction set, and then focus on broader optimization issues. Second, this PowerPC has a much simpler microarchitecture than the Core 2, i.e., a much simpler description of how quickly it follows instructions. Third, other lab participants will also be starting with this PowerPC; you'll be able to discuss your PowerPC speeds with them.
You should already be comfortable with writing and running C code from a command line on a remote Linux server, as illustrated by the following example. Log in to daedalus.crypto.rites.uic.edu:
ssh daedalus.crypto.rites.uic.eduOn daedalus (i.e., within this ssh daedalus session), create a hello.c file with the following contents:
#include <stdio.h> int main() { printf("hello world\n"); return 0; }You should already know how to do this with a command such as vi hello.c. Now compile hello.c, creating hello:
gcc -o hello hello.cFinally, run hello:
./hello
int sum1000(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }This machine's C compiler defines int as a 32-bit integer, i.e., an integer between -2147483648 and 2147483647, with arithmetic modulo 2^32.
Create an accuracy.c as a double-check for sum1000.c:
#include <stdio.h> #include <stdlib.h> int x[1000]; int main() { int r; int expected = 0; int result; int i; for (i = 0;i < 1000;++i) { r = random(); expected += r; x[i] = r; } result = sum1000(x); if (result == expected) return 0; printf("%d != expected %d\n",result,expected); return 100; }
Create a Makefile that compiles sum1000.c together with accuracy.c:
CC=gcc -O1 default: accuracy accuracy: accuracy.c sum1000.c $(CC) -o accuracy accuracy.c sum1000.cOn the $(CC) line, make sure to type a tab rather than 8 spaces.
Compile and run:
make ./accuracyThe program finishes immediately, indicating success. If the double-check had failed then the program would have printed the difference.
#include <time.h> #include <stdio.h> int x[1000]; long long t[20]; int main() { int i; for (i = 0;i < 20;++i) { t[i] = time(0); sum1000(x); } for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n"); return 0; }This program has 19 successive calls to the sum1000 function, surrounded by 20 timestamps returned by the time() function from the standard C library. This program prints the 19 differences between successive timestamps.
Extend the Makefile to compile both accuracy.c and speed.c:
CC=gcc -O1 default: accuracy speed accuracy: accuracy.c sum1000.c $(CC) -o accuracy accuracy.c sum1000.c speed: speed.c sum1000.c $(CC) -o speed speed.c sum1000.c
Compile and run:
make && ./accuracy && ./speedThe && runs ./accuracy only if make succeeds, and runs ./speed only if ./accuracy succeeds.
Whoops, all the outputs are 0! The problem here is that time() has very low precision, and therefore almost always very low accuracy. It returns the number of seconds since the UNIX epoch, rounded down to an integer.
The clock() function. Change speed.c to use clock() instead of time():
#include <time.h> #include <stdio.h> int x[1000]; long long t[20]; int main() { int i; for (i = 0;i < 20;++i) { t[i] = clock(); sum1000(x); } for (i = 1;i < 20;++i) printf("%lf ",(t[i] - t[i - 1]) * 1.0 / CLOCKS_PER_SEC); printf("\n"); return 0; }
Recompile and run:
make && ./accuracy && ./speed
Whoops, all the outputs are still 0! The problem here is that clock() still has very low precision, namely 0.01 seconds.
The gettimeofday() function. Change speed.c to use gettimeofday():
#include <sys/time.h> #include <stdio.h> long long microseconds(void) { struct timeval tv; gettimeofday(&tv,0); return tv.tv_sec * 1000000LL + tv.tv_usec; } int x[1000]; long long t[20]; int main() { int i; for (i = 0;i < 20;++i) { t[i] = microseconds(); sum1000(x); } for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n"); return 0; }
Recompile and run:
make && ./accuracy && ./speed
Now the timings are nonzero:
27 4 3 3 4 4 4 3 3 4 4 3 4 4 3 3 4 4 4The first call took about 27 microseconds, and the subsequent calls took about 3 or 4 microseconds. Recall that this PowerPC runs at 600MHz, i.e., 600 million cycles per second, so about 3 or 4 microseconds means about 1800 or 2400 cycles.
The clock_gettime() function. Change speed.c to use clock_gettime():
#include <time.h> #include <stdio.h> long long nanoseconds(void) { struct timespec t; clock_gettime(CLOCK_MONOTONIC,&t); return t.tv_sec * 1000000000LL + t.tv_nsec; } int x[1000]; long long t[20]; int main() { int i; for (i = 0;i < 20;++i) { t[i] = nanoseconds(); sum1000(x); } for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n"); return 0; }
Change the Makefile to compile speed with -lrt, the library that contains the clock_gettime() function:
CC=gcc -O1 default: accuracy speed accuracy: accuracy.c sum1000.c $(CC) -o accuracy accuracy.c sum1000.c speed: speed.c sum1000.c $(CC) -o speed speed.c sum1000.c -lrt
Recompile and run:
make && ./accuracy && ./speed
The timings seem to be just the gettimeofday() timings multiplied by 1000:
23000 4000 4000 4000 4000 3000 3000 4000 4000 3000 4000 4000 3000 3000 4000 4000 4000 3000 3000Running the program several times can produce variable results. Here is another run:
23000 5000 3000 3000 4000 4000 3000 4000 4000 3000 3000 81000 4000 4000 3000 4000 3000 4000 3000Notice the 81000 in the middle: a computation that did not finish until 81000 nanoseconds later. This is not an inaccuracy in clock_gettime(); it is a real delay, perhaps caused by a network packet distracting the operating system for nearly 80000 nanoseconds.
The PowerPC time base. Change speed.c to use the PowerPC "time base" as follows:
#include <time.h> #include <stdio.h> long long cycles(void) { unsigned long high; unsigned long low; unsigned long newhigh; unsigned long long result; asm volatile( "7:mftbu %0;mftb %1;mftbu %2;cmpw %0,%2;bne 7b" : "=r" (high), "=r" (low), "=r" (newhigh) ); result = high; result <<= 32; result |= low; return result * 24; } int x[1000]; long long t[20]; int main() { int i; for (i = 0;i < 20;++i) { t[i] = cycles(); sum1000(x); } for (i = 1;i < 20;++i) printf("%lld ",t[i] - t[i - 1]); printf("\n"); return 0; }This "time base" is a hardware feature of the CPU: a 64-bit counter that increases by 1 every 24 cycles. Beware that the 24 is specific to the PowerPC 750CXe; other PowerPCs have different time-base frequencies.
Recompile and run:
make && ./accuracy && ./speed
Now the results are much more stable:
15984 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3000These results show that sum1000()+cycles() usually takes 3024 cycles, plus or minus 24 cycles for the 24-cycle precision of the time base.
General notes on accurate timing. These examples illustrate the fact that cycle counters, such as the PowerPC time base, are much more accurate than typical timing mechanisms.
These examples also illustrate the fact that calling a function usually does not take constant time, even if the function seems to be performing a constant amount of work. In particular, the first call to this function is much slower than the rest. (Why do you think this is the case?) This is a very common phenomenon, and an important phenomenon, since many functions are not called in loops. Functions can also be interrupted by the operating system at any moment, often for a very long time; see the 81000 example above. There are many other important sources of variability in function timing.
Many authors report average times (e.g., average cycle counts) without any hint of the variance in the times. This is a horrifyingly bad practice. An average over a small number of samples is very often corrupted by interruptions. An average over a large number of samples is less random but still includes, and is often heavily influenced by, the interruptions: for example, a long-term average time T will turn into approximately 2T if the computer happens to be running something else during the measurement. Authors often try to reduce interruptions ("the computer was otherwise idle"), and perhaps they succeed, but there is no way to be sure about this when the only measurement reported is an average.
Copying from RAM to a register is called a load. Change sum1000.c to perform this load explicitly:
int sum1000(int *x) { register int result = 0; register int i; for (i = 0;i < 1000;++i) { register int xi = x[i]; result += xi; } return result; }C doesn't guarantee that register variables will actually occupy CPU registers, but in this case the compiler turns out to be smart enough to put xi and i and result into CPU registers.
Recompile and run:
make && ./accuracy && ./speedAgain the loop takes 3 cycles per iteration:
15960 3048 3024 3000 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024 3024Separating this one C line into two didn't make any difference in performance, but it's a first step towards understanding how the CPU is actually handling this code.
Pointer arithmetic. From the CPU's perspective, bytes in memory have addresses 0, 1, 2, 3, 4, 5, ..., and x is one of these addresses. Let's say x is address 100. The first array element x[0] consists of the 4 bytes located at addresses 100, 101, 102, 103; the second array element x[1] consists of the 4 bytes located at addresses 104, 105, 106, 107; and in general the array element x[i] consists of the 4 bytes located at addresses 100+4*i, 101+4*i, 102+4*i, 103+4*i. To figure out the address of x[i] the CPU has to compute 4*i and add that to x.
Change sum1000.c to view x the way the CPU does, as an integer address:
int sum1000(int x) { register int result = 0; register int i; for (i = 0;i < 1000;++i) { register int xi = *(int *) (x+4*i); result += xi; } return result; }The int *x definition has changed to int x, and x[i] has changed to *(int *) (x+4*i). Here x+4*i is the address 4*i bytes after x, and *(int *) refers to the 4 bytes at this address.
Limitations on pointer arithmetic. The PowerPC instruction set doesn't actually have an instruction
register int xi = *(int *) (x+4*i);but it does have an instruction
register int xi = *(int *) (x+i);
Change sum1000.c to keep track of 4*i instead of i:
int sum1000(int x) { register int result = 0; register int i4; for (i4 = 0;i4 < 4000;i4 += 4) { register int xi = *(int *) (x+i4); result += xi; } return result; }
Conditional branches. Change sum1000.c to use goto instead of for:
int sum1000(int x) { register int result = 0; register int i4 = 0; register int xi; startofloop: xi = *(int *) (x+i4); result += xi; i4 += 4; if (i4 < 4000) goto startofloop; return result; }
The PowerPC instruction set doesn't actually have an instruction
if (i4 < 4000) goto startofloop;but it does have an instruction that's essentially
if (i4 >= 0) goto startofloop;if i4 has just been modified. To take advantage of this, change sum1000.c to decrease i4 from 3996 to 0 rather than increasing it from 0 to 3996:
int sum1000(int x) { register int result = 0; register int i4 = 3996; register int xi; startofloop: xi = *(int *) (x+i4); result += xi; i4 -= 4; if (i4 >= 0) goto startofloop; return result; }
Create a new file sum1000.q with the following contents:
int32 x int32 sum int32 result int32 i4 int32 xi input x output result enter sum1000 sum = 0 i4 = 3996 startofloop: xi = *(int32 *) (x + i4) sum += xi <? i4 -= 4 goto startofloop if !< result = sum leaveThe int32 lines define 32-bit register variables. The input and output lines define function parameters and return values. The enter and leave lines define the start and end of the function. The <? i4-=4 first subtracts 4 from i4 and then sets a CPU flag, specifically the PowerPC's less-than flag, if the new value of i4 is smaller than 0 (as a signed 32-bit integer). The goto startofloop if !< branches back to startofloop if the flag is not set, i.e., if the new value of i4 is greater than or equal to 0. The final result = sum might seem silly, but the obvious alternative—simply say output sum and skip result—doesn't work: on the PowerPC, the first input and the first output are always placed in the same register, so they can't be used simultaneously, but this code uses x and sum simultaneously.
Change Makefile to use sum1000.q instead of sum1000.c:
CC=gcc -O1 default: accuracy speed accuracy: accuracy.c sum1000.o $(CC) -o accuracy accuracy.c sum1000.o speed: speed.c sum1000.o $(CC) -o speed speed.c sum1000.o -lrt sum1000.o: sum1000.s $(CC) -c sum1000.s sum1000.s: sum1000.q qhasm-ppc32-linux < sum1000.q > sum1000.s
Recompile and run:
make && ./accuracy && ./speed
Suddenly the loop takes only 2 cycles per iteration:
16272 2040 2088 2016 2040 2016 2112 2016 2040 2016 2040 2016 2040 2016 2040 2016 2040 2016 2040This isn't great performance yet: we could have achieved the same speed by playing with the compiler options. But now we're in a position to start serious optimization.
The IBM PowerPC 750CX/CXe/CXr user manual says that "As many as two instructions can be dispatched per clock." Dispatching 4 instructions takes at least 2 cycles, so to reduce cycles/iteration below 2 we have to reduce instructions/iteration below 4.
Each iteration has 2 unavoidable instructions (the load and the addition) and 2 overhead instructions (the subtraction and the branch). Reducing the number of overhead instructions is a simple matter of unrolling:
int32 x int32 sum int32 result int32 i4 int32 xi int32 xminus4 int32 xminus8 int32 xminus12 input x output result enter sum1000 sum = 0 i4 = 3996 xminus4 = x - 4 xminus8 = x - 8 xminus12 = x - 12 startofloop: xi = *(int32 *) (x + i4) sum += xi xi = *(int32 *) (xminus4 + i4) sum += xi xi = *(int32 *) (xminus8 + i4) sum += xi xi = *(int32 *) (xminus12 + i4) sum += xi <? i4 -= 16 goto startofloop if !< result = sum leaveThis reduces the number of cycles/iteration to 1.5:
15312 1560 1584 1512 1536 1536 1608 1536 1512 1536 1536 1512 1536 1536 1536 1512 1536 1536 1536But this is not as much improvement as one might hope. The number of instructions/iteration is now only 2.5, so if 2 instructions were dispatched per cycle then the number of cycles/iteration would be only 1.25, not 1.5. Why do the 10 instructions in the inner loop take 6 cycles instead of 5?
The PowerPC 750CXe has a single "load unit" that handles all loads, and two integer units that handle integer additions, subtractions, etc. All of these units can operate simultaneously, handling one load and two integer operations in a single cycle, but they have some restrictions. All integer operations are carried out in the original order of integer instructions: an integer instruction can be followed in the same cycle as an earlier integer instruction, but cannot be followed in an earlier cycle. Similarly, all loads are carried out in the original order of load instructions. The integer latency is 1: the output of an integer instruction cannot be used until 1 cycle later. The load latency is 2: i.e., the output of a load cannot be used until 2 cycles later.
Consider what happens if the load from x+i4 starts on cycle 1000. The load from xminus4+i4 cannot start before cycle 1001. The load from xminus8+i4 cannot start before cycle 1002. The load from xminus12+i4 cannot start before cycle 1003. The xi produced by that load cannot be added to sum before cycle 1005. The subsequent integer operation i4-=16 cannot occur before cycle 1005. The next load from x+i4 cannot start before cycle 1006.
Breaking this chain is a simple matter of moving i4-=16 up one instruction, so that it can start in an earlier cycle:
int32 x int32 sum int32 result int32 i4 int32 xi int32 xminus4 int32 xminus8 int32 xminus12 input x output result enter sum1000 sum = 0 i4 = 3996 xminus4 = x - 4 xminus8 = x - 8 xminus12 = x - 12 startofloop: xi = *(int32 *) (x + i4) sum += xi xi = *(int32 *) (xminus4 + i4) sum += xi xi = *(int32 *) (xminus8 + i4) sum += xi xi = *(int32 *) (xminus12 + i4) <? i4 -= 16 sum += xi goto startofloop if !< result = sum leaveEach iteration now takes just 1.25 cycles:
15408 1296 1344 1272 1272 1272 1344 1296 1272 1272 1272 1296 1272 1272 1272 1296 1272 1272 1272
Further unrolling brings the cost to just 1.125 cycles/iteration:
int32 x int32 sum int32 result int32 i4 int32 xi int32 xminus4 int32 xminus8 int32 xminus12 int32 xminus16 int32 xminus20 int32 xminus24 int32 xminus28 input x output result enter sum1000 sum = 0 i4 = 3996 xminus4 = x - 4 xminus8 = x - 8 xminus12 = x - 12 xminus16 = x - 16 xminus20 = x - 20 xminus24 = x - 24 xminus28 = x - 28 startofloop: xi = *(int32 *) (x + i4) sum += xi xi = *(int32 *) (xminus4 + i4) sum += xi xi = *(int32 *) (xminus8 + i4) sum += xi xi = *(int32 *) (xminus12 + i4) sum += xi xi = *(int32 *) (xminus16 + i4) sum += xi xi = *(int32 *) (xminus20 + i4) sum += xi xi = *(int32 *) (xminus24 + i4) sum += xi xi = *(int32 *) (xminus28 + i4) <? i4 -= 32 sum += xi goto startofloop if !< result = sum leave
gcc -O3 -mcpu=750 -fschedule-insns2 -funroll-loopswhich is a fairly big step backwards in performance compared to the assembly-language code:
14496 1752 1680 1728 1848 1704 1680 1752 1584 1584 1584 1584 1608 1584 1584 1584 1584 1584 1608
Some people make an equally bold claim in the opposite direction: that portable C code, no matter how well done, can never match the performance of assembly language. Can you write portable C code that runs close to 1 cycle/iteration? The question isn't whether this is as easy as writing in assembly language; the question is whether it's possible.
There are many different assemblers, and all of them are capable of producing the same performance, close to 1 cycle/iteration. How easy is it to write the code using each of these tools? This lab focused on qhasm for ease of use, but perhaps there are competitors.
Once you understand the add-1000-integers example on the PowerPC 750CXe, try exploring the same problem on other CPUs. For each CPU, try to figure out a lower bound on the number of cycles needed, and then see how close you can come to the lower bound.