### **CPU-specific optimization** Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad. Example of a function that we want to optimize: adding 1000 integers mod $2^{32}$ . Reference implementation: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }</pre> ``` 2 ecific optimization of a target CPU core: ortex-M4F core inside 0H5QR microcontroller ris LM4F120 Launchpad. Example of a function that we want to optimize: adding 1000 integers mod $2^{32}$ . Reference implementation: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }</pre> ``` Counting ``` static *cons ``` = (VO • • int rest int bef Output : Change Τ <u>nization</u> et CPU core: core inside icrocontroller 20 Launchpad. ``` Example of a function that we want to optimize: adding 1000 integers mod 2^{32}. ``` Reference implementation: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }</pre> ``` # Counting cycles: ``` static volatile *const DWT_CYC = (void *) 0xE ``` ``` int beforesum = int result = sum int aftersum = * UARTprintf("sum result, aftersu ``` Output shows 801 Change 1000 to 50 ``` re: e oller ``` npad. ``` Example of a function that we want to optimize: adding 1000 integers mod 2^{32}. ``` Reference implementation: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }</pre> ``` # Counting cycles: ``` static volatile unsigned *const DWT_CYCCNT = (void *) 0xE0001004; ``` ``` int beforesum = *DWT_CYCC int result = sum(x); int aftersum = *DWT_CYCCN UARTprintf("sum %d %d\n", result,aftersum-befores ``` Output shows 8012 cycles. Change 1000 to 500: 4012. ``` 2 ``` Example of a function that we want to optimize: adding 1000 integers mod $2^{32}$ . Reference implementation: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }</pre> ``` ``` Counting cycles: static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. ``` Change 1000 to 500: 4012. "Okay, 8 Um, are really th ``` of a function want to optimize: .000 integers mod 2^{32}. ce implementation: (int *x) esult = 0; i = 0; i < 1000; ++i) ult += x[i]; n result; ``` ``` Counting cycles: static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012. ``` ``` tion ptimize: ers mod 2^{32}. entation: 1000;++i) i]; ``` ``` Counting cycles: static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012. ``` "Okay, 8 cycles per Um, are microconreally this slow at ``` 2 ``` ``` Counting cycles: static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. ``` Change 1000 to 500: 4012. "Okay, 8 cycles per addition Um, are microcontrollers really this slow at addition?" ## Counting cycles: ``` static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... ``` ``` int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum); ``` Output shows 8012 cycles. Change 1000 to 500: 4012. "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" ## Counting cycles: ``` static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; ... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; ``` Output shows 8012 cycles. Change 1000 to 500: 4012. UARTprintf("sum %d %d\n", result, aftersum-beforesum); "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" #### Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. ``` 3 ``` ``` Counting cycles: ``` ``` static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012. ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. ``` 3 ``` # Counting cycles: ``` static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012. ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" #### Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. ``` 3 ``` ``` Counting cycles: ``` ``` static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012. ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" #### Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. ``` 3 ``` # Counting cycles: ``` static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004; int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result, aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012. ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" #### Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. Try -03: 8012 cycles. ``` Try mov int sum for ( retur } ``` int r int i res ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. Try -03: 8012 cycles. ``` g cycles: t DWT\_CYCCNT ult = sum(x); id \*) 0xE0001004; volatile unsigned int oresum = \*DWT\_CYCCNT; ersum = \*DWT\_CYCCNT; t,aftersum-beforesum); ntf("sum %d %d\n", shows 8012 cycles. 1000 to 500: 4012. ``` 3 ``` ``` unsigned int CNT 0001004; *DWT_CYCCNT; (x); DWT_CYCCNT; %d %d\n", m-beforesum); 2 cycles. ``` 00: 4012. ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. ``` Try -03: 8012 cycles. ``` Try moving the po int sum(int *x) int result = 0 int i; for (i = 0;i < result += *x return result; ``` ``` 3 ``` int NT; T; um); "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. Try -03: 8012 cycles. Try moving the pointer: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i result += *x++; return result; }</pre> ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" ### Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. Try -03: 8012 cycles. ``` Try moving the pointer: int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; ++i) result += *x++; return result; } ``` "Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition?" ### Bad approach: Apply random "optimizations" (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -01: 8012 cycles. Try -02: 8012 cycles. Try -03: 8012 cycles. ``` Try moving the pointer: int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; ++i) result += *x++; return result; } 8010 cycles. ``` ``` 5 3 cycles per addition. Try moving the pointer: Try cour microcontrollers int sum(int *x) int sum is slow at addition?" roach: int result = 0; indom "optimizations" int i; eak compiler options) for (i = 0; i < 1000; ++i) get bored/frustrated. result += *x++; e fastest results. return result; : 8012 cycles. 8012 cycles. 8010 cycles. 8012 cycles. : 8012 cycles. ``` int r int i for ( retur res ``` er addition. trollers addition?" timizations" ler options) d/frustrated. esults. cles. cles. cles. ``` cles. ``` Try moving the pointer: int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; ++i) result += *x++; return result; 8010 cycles. ``` ``` Try counting down int sum(int *x) { int result = 0 int i; for (i = 1000; result += *x return result; ``` ``` 4 Try moving the pointer: int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; ++i) result += *x++; return result; } 8010 cycles. ``` ``` Try counting down: int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;---i result += *x++; return result; ``` ### Try moving the pointer: ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }</pre> ``` 8010 cycles. ``` Try counting down: ``` ``` int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; } ``` } 8010 cycles. ``` Try moving the pointer: ``` ``` int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }</pre> ``` 8010 cycles. ``` Try counting down: int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; ``` ``` 5 6 Try counting down: ing the pointer: Try usin (int *x) int sum(int *x) int sum { esult = 0; int result = 0; int r int i; int * for (i = 1000; i > 0; --i) i = 0; i < 1000; ++i) while ult += *x++; result += *x++; res n result; return result; retur 8010 cycles. cles. ``` ``` Try counting down: ointer: int sum(int *x) { int result = 0; int i; 1000;++i) for (i = 1000; i > 0; --i) result += *x++; ++; return result; 8010 cycles. ``` ``` Try using an end p int sum(int *x) int result = 0 int *y = x + 1 while (x != y) result += *x return result; ``` ``` Try counting down: int sum(int *x) { ``` for (i = 1000; i > 0; --i) result += \*x++; int result = 0; return result; int i; 8010 cycles. } ``` Try using an end pointer: int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; ``` # Try counting down: ``` int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; } ``` 8010 cycles. ``` int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; } ``` Try using an end pointer: ``` . ``` # Try counting down: 8010 cycles. ``` int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; } ``` ``` Try using an end pointer: int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; ``` return result; 8010 cycles. } ``` Try using an end pointer: Back to nting down: (int *x) int sum(int *x) int sum { esult = 0; int result = 0; int r int *y = x + 1000; int i i = 1000; i > 0; --i) while (x != y) for ( ult += *x++; result += *x++; n result; return result; retur 8010 cycles. cles. ``` res res ``` 6 i > 0; --i) ++; ``` ``` Try using an end pointer: int sum(int *x) { int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; 8010 cycles. ``` ``` Back to original. int sum(int *x) int result = 0 int i; for (i = 0;i < result += x[ result += x[ return result; ``` ``` Try using an end pointer: int sum(int *x) { int result = 0; ``` int \*y = x + 1000; result += \*x++; while (x != y) return result; 8010 cycles. } ``` int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; i + result += x[i]; result += x[i + 1]; return result; ``` Back to original. Try unrolli ``` int sum(int *x) int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; 8010 cycles. ``` ``` Back to original. Try unrolling: ``` ``` int sum(int *x) int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; return result; } ``` ``` int sum(int *x) int result = 0; int *y = x + 1000; while (x != y) result += *x++; return result; 8010 cycles. ``` ``` Back to original. Try unrolling: int sum(int *x) int result = 0; int i; for (i = 0; i < 1000; i += 2) { result += x[i]; result += x[i + 1]; return result; ``` 5016 cycles. } ``` g an end pointer: Back to original. Try unrolling: int sum(int *x) (int *x) { esult = 0; int result = 0; y = x + 1000; int i; for (i = 0; i < 1000; i += 2) { (x != y) ult += *x++; result += x[i]; result += x[i + 1]; n result; return result; cles. 5016 cycles. ``` ``` int sum { int r int i for ( res res res res res retur ``` ``` oointer: 000; ++; ``` ``` Back to original. Try unrolling: int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; i += 2) { result += x[i]; result += x[i + 1]; return result; 5016 cycles. ``` ``` int sum(int *x) int result = 0 int i; for (i = 0;i < result += x[ result += x[ result += x[ result += x[ result += x[ return result; ``` ``` Back to original. Try unrolling: int sum(int *x) ``` for (i = 0;i < 1000;i += 2) { int result = 0; return result; result += x[i]; result += x[i + 1]; int i; 5016 cycles. { } ``` int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; i + result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; return result; ``` int sum(int \*x) ``` Back to original. Try unrolling: ``` ``` int sum(int *x) int result = 0; int i; for (i = 0; i < 1000; i += 2) { result += x[i]; result += x[i + 1]; return result; 5016 cycles. ``` ``` int result = 0; int i; for (i = 0; i < 1000; i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; ``` ``` original. Try unrolling: 4016 cyc int sum(int *x) { (int *x) int result = 0; int i; esult = 0; for (i = 0; i < 1000; i += 5) { result += x[i]; i = 0; i < 1000; i += 2) { result += x[i + 1]; ult += x[i]; result += x[i + 2]; ult += x[i + 1]; result += x[i + 3]; result += x[i + 4]; n result; return result; cles. ``` ``` Try unrolling: int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; 1000; i += 2) { result += x[i + 1]; i]; result += x[i + 2]; i + 1]; result += x[i + 3]; result += x[i + 4]; } return result; ``` 4016 cycles. Are v ``` 8 int sum(int *x) { int result = 0; int i; for (i = 0; i < 1000; i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; ``` ng: 4016 cycles. Are we done no ``` 9 ``` ``` int sum(int *x) int result = 0; int i; for (i = 0; i < 1000; i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; return result; ``` ``` int sum(int *x) int result = 0; int i; for (i = 0; i < 1000; i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; return result; ``` Most random "optimizations" that we tried seem useless. Can spend time trying more. Does frustration level tell us that we're close to optimal? ``` int sum(int *x) int result = 0; int i; for (i = 0; i < 1000; i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; return result; ``` Most random "optimizations" that we tried seem useless. Can spend time trying more. Does frustration level tell us that we're close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let's try this approach. ``` (int *x) esult = 0; i = 0; i < 1000; i += 5) { ult += x[i]; ult += x[i + 1]; ult += x[i + 2]; ult += x[i + 3]; ult += x[i + 4]; n result; ``` 10 4016 cycles. Are we done now? **Technica** Most random "optimizations" that we tried seem useless. Can spend time trying more. Does frustration level tell us that we're close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let's try this approach. Rely on M4F = 1Manual Find "A architect Points to Architec "implem which de e.g., "A First ma ADD tal i + 4]; 4016 cycles. Are we done now? Most random "optimizations" that we tried seem useless. Can spend time trying more. Does frustration level tell us that we're close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let's try this approach. Find "ARM Cortex Technical Reference Rely on Wikipedia M4F = M4 + float Manual says that "implements the A architecture profile Points to the "AR Architecture Refer which defines institute.g., "ADD" for 3 First manual says ADD takes just 1 = 5) { 4016 cycles. Are we done now? Most random "optimizations" that we tried seem useless. Can spend time trying more. Does frustration level tell us that we're close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let's try this approach. Find "ARM Cortex-M4 Proof Technical Reference Manual Rely on Wikipedia comment M4F = M4 + floating-point Manual says that Cortex-M4 "implements the ARMv7E-Narchitecture profile". Points to the "ARMv7-M Architecture Reference Man which defines instructions: e.g., "ADD" for 32-bit addit First manual says that ADD takes just 1 cycle. 10 Most random "optimizations" that we tried seem useless. Can spend time trying more. Does frustration level tell us that we're close to optimal? # Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let's try this approach. Find "ARM Cortex-M4 Processor Technical Reference Manual". Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 "implements the ARMv7E-M architecture profile". Points to the "ARMv7-M Architecture Reference Manual", which defines instructions: e.g., "ADD" for 32-bit addition. First manual says that ADD takes just 1 cycle. ndom "optimizations" tried seem useless. nd time trying more. stration level tell us re close to optimal? proach: ut lower bound for pent on arithmetic etc. and gap between und and observed time. this approach. Find "ARM Cortex-M4 Processor Technical Reference Manual". Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 "implements the ARMv7E-M architecture profile". Points to the "ARMv7-M Architecture Reference Manual", which defines instructions: e.g., "ADD" for 32-bit addition. First manual says that ADD takes just 1 cycle. "integer has 16 in special-pand" pro Each ele be "load Basic loa Manual a note a Then mo instructi address then it s timizations" useless. ying more. evel tell us optimal? ithmetic etc. etween observed time. oach. Find "ARM Cortex-M4 Processor Technical Reference Manual". Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 "implements the ARMv7E-M architecture profile". Points to the "ARMv7-M Architecture Reference Manual", which defines instructions: e.g., "ADD" for 32-bit addition. First manual says that ADD takes just 1 cycle. Inputs and output "integer registers" has 16 integer reg special-purpose "s and "program cou Each element of x be "loaded" into a Basic load instruct Manual says 2 cyc a note about "pipe Then more explan instruction is also address not based then it saves 1 cyc 11 ow? . tc. me. Find "ARM Cortex-M4 Processor Technical Reference Manual". Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 "implements the ARMv7E-M architecture profile". Points to the "ARMv7-M Architecture Reference Manual", which defines instructions: e.g., "ADD" for 32-bit addition. First manual says that ADD takes just 1 cycle. Inputs and output of ADD a "integer registers". ARMv7-has 16 integer registers, incl special-purpose "stack point and "program counter". Each element of x array nee be "loaded" into a register. Basic load instruction: LDR Manual says 2 cycles but ad a note about "pipelining". Then more explanation: if n instruction is also LDR (with address not based on first L then it saves 1 cycle. Find "ARM Cortex-M4 Processor Technical Reference Manual". Rely on Wikipedia comment that M4F = M4 + floating-point unit. Manual says that Cortex-M4 "implements the ARMv7E-M architecture profile". Points to the "ARMv7-M Architecture Reference Manual", which defines instructions: e.g., "ADD" for 32-bit addition. First manual says that ADD takes just 1 cycle. Inputs and output of ADD are "integer registers". ARMv7-M has 16 integer registers, including special-purpose "stack pointer" and "program counter". Each element of x array needs to be "loaded" into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about "pipelining". Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. RM Cortex-M4 Processor al Reference Manual". Wikipedia comment that M4 + floating-point unit. says that Cortex-M4 ents the ARMv7E-M ture profile". the "ARMv7-M ture Reference Manual", efines instructions: DD" for 32-bit addition. nual says that kes just 1 cycle. Inputs and output of ADD are "integer registers". ARMv7-M has 16 integer registers, including special-purpose "stack pointer" and "program counter". Each element of x array needs to be "loaded" into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about "pipelining". Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consectation takes on ("more in pelinear Can ach in other but noth 2n + 1 on cycles Lower b Why obsomer non-constant costs of x-M4 Processor ce Manual". comment that ting-point unit. Cortex-M4 ARMv7E-M e". Mv7-M rence Manual", ructions: 2-bit addition. that cycle. Inputs and output of ADD are "integer registers". ARMv7-M has 16 integer registers, including special-purpose "stack pointer" and "program counter". Each element of x array needs to be "loaded" into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about "pipelining". Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consecutive LDF takes only n+1 c ("more multiple L pipelined together" Can achieve this s in other ways (LD but nothing seems Lower bound for n2n + 1 cycles, incles n cycles of arithmeters Why observed tim non-consecutive L costs of manipulat cessor that unit. / ual", ion. Inputs and output of ADD are "integer registers". ARMv7-M has 16 integer registers, including special-purpose "stack pointer" and "program counter". Each element of x array needs to be "loaded" into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about "pipelining". Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consecutive LDRs takes only n+1 cycles ("more multiple LDRs can be pipelined together"). Can achieve this speed in other ways (LDRD, LDM but nothing seems faster. Lower bound for n LDR + n2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher non-consecutive LDRs; costs of manipulating i. Inputs and output of ADD are "integer registers". ARMv7-M has 16 integer registers, including special-purpose "stack pointer" and "program counter". Each element of x array needs to be "loaded" into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about "pipelining". Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle. n consecutive LDRs takes only n + 1 cycles ("more multiple LDRs can be pipelined together"). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 12 nd output of ADD are registers". ARMv7-M nteger registers, including ourpose "stack pointer" ogram counter". ment of x array needs to led" into a register. ad instruction: LDR. says 2 cycles but adds bout "pipelining". ore explanation: if next on is also LDR (with not based on first LDR) aves 1 cycle. *n* consecutive LDRs takes only n+1 cycles ("more multiple LDRs can be pipelined together"). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. y = x +p = x 2281 cyc result : loop: xi9 = xi8 = xi7 = xi6 = xi5 = xi4 = xi3 = xi2 = of ADD are . ARMv7-M isters, including tack pointer" nter". array needs to register. tion: LDR. Teles but adds elining". ation: if next LDR (with on first LDR) n consecutive LDRs takes only n+1 cycles ("more multiple LDRs can be pipelined together"). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 2281 cycles using $$y = x + 4000$$ $p = x$ $result = 0$ loop: xi3 = \*(uint32) xi2 = \*(uint32) 13 are M uding er" ds to . ds ext า DR) n consecutive LDRs takes only n+1 cycles ("more multiple LDRs can be pipelined together"). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 2281 cycles using ldr.w: $$y = x + 4000$$ $p = x$ $result = 0$ loop: n consecutive LDRs takes only n+1 cycles ("more multiple LDRs can be pipelined together"). Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i. 2281 cycles using ldr.w: $$y = x + 4000$$ $p = x$ $result = 0$ loop: ``` Sutive LDRs by n+1 cycles multiple LDRs can be done together"). ``` ieve this speed ways (LDRD, LDM) ing seems faster. ound for n LDR + n ADD: cycles, including of arithmetic. served time is higher: secutive LDRs; manipulating i. 2281 cycles using ldr.w: $$y = x + 4000$$ $p = x$ $result = 0$ loop: xi1 = xi0 = result resul<sup>r</sup> resul resul resul resul resul resul xi9 = xi8 = xi7 = ``` Rs ycles DRs can be ``` peed RD, LDM) faster. $$n LDR + n ADD$$ : uding uaing etic. e is higher: DRs; ing i. 2281 cycles using ldr.w: $$y = x + 4000$$ $p = x$ $result = 0$ loop: ``` xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48) ``` ``` xi1 = *(uint32) xi0 = *(uint32) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32) xi8 = *(uint32) xi7 = *(uint32) ``` # 2281 cycles using ldr.w: $$y = x + 4000$$ $p = x$ result = 0 #### loop: ADD: • xi9 = \*(uint32 \*) (p + 76) ``` xi1 = *(uint32 *) (p + xi0 = *(uint32 *) (p + result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + xi8 = *(uint32 *) (p + xi7 = *(uint32 *) (p + ``` # 2281 cycles using ldr.w: $$y = x + 4000$$ $p = x$ $result = 0$ ### loop: ``` xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28) ``` cles using ldr.w: 4000 = 0 resul<sup>r</sup> xi6 = xi5 = xi4 = xi3 = xi2 = xi1 = xi0 = resul resul resul resul resul resul xi1 = \*(uint32 \*) (p + 44)xi0 = \*(uint32 \*) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = \*(uint32 \*) (p + 36) xi8 = \*(uint32 \*) (p + 32) xi7 = \*(uint32 \*) (p + 28) \*(uint32 \*) (p + 76) \*(uint32 \*) (p + 72) \*(uint32 \*) (p + 68) \*(uint32 \*) (p + 64) \*(uint32 \*) (p + 60) \*(uint32 \*) (p + 56) \*(uint32 \*) (p + 52) \*(uint32 \*) (p + 48) xi6 = \*(uint32) xi5 = \*(uint32) xi4 = \*(uint32) xi3 = \*(uint32) xi2 = \*(uint32) xi1 = \*(uint32) xi0 = \*(uint32) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 ldr.w: \*) (p + 76) \*) (p + 72) \*) (p + 68) \*) (p + 64) \*) (p + 60) \*) (p + 56) \*) (p + 52) \*) (p + 48) 76) 72) 68) 64) 60) 56) 52) 48) ``` xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28) ``` xi6 = \*(uint32 \*) (p +xi5 = \*(uint32 \*) (p +xi4 = \*(uint32 \*) (p +xi3 = \*(uint32 \*) (p +xi2 = \*(uint32 \*) (p +xi1 = \*(uint32 \*) (p +xi0 = \*(uint32 \*) p; presult += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 xi1 = \*(uint32 \*) (p + 44) xi0 = \*(uint32 \*) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = \*(uint32 \*) (p + 36) xi8 = \*(uint32 \*) (p + 32) xi7 = \*(uint32 \*) (p + 28) xi6 = \*(uint32 \*) (p + 24) xi5 = \*(uint32 \*) (p + 20) xi4 = \*(uint32 \*) (p + 16) xi3 = \*(uint32 \*) (p + 12) xi2 = \*(uint32 \*) (p + 8) xi1 = \*(uint32 \*) (p + 4) xi0 = \*(uint32 \*) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 resul resul xi9 = xi8 = xi7 = xi6 = xi5 = xi4 = xi3 = xi2 = xi1 = xi0 = resul resul resul ``` *(uint32 *) (p + 44) xi6 = *(uint32 *) (p + 24) *(uint32 *) (p + 40) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) t += xi9 xi3 = *(uint32 *) (p + 12) t += xi8 xi2 = *(uint32 *) (p + 8) t += xi7 xi1 = *(uint32 *) (p + 4) t += xi6 xi0 = *(uint32 *) p; p += 160 t += xi5 t += xi4 result += xi9 t += xi3 result += xi8 t += xi2 result += xi7 t += xi1 result += xi6 t += xi0 result += xi5 result += xi4 *(uint32 *) (p + 36) *(uint32 *) (p + 32) result += xi3 *(uint32 *) (p + 28) result += xi2 ``` - \*) (p + 44) - \*) (p + 40) \*) (p + 36) \*) (p + 32) - xi6 = \*(uint32 \*) (p + 24) - xi5 = \*(uint32 \*) (p + 20) - xi4 = \*(uint32 \*) (p + 16) - xi3 = \*(uint32 \*) (p + 12) - xi2 = \*(uint32 \*) (p + 8) - xi1 = \*(uint32 \*) (p + 4) - xi0 = \*(uint32 \*) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 - result += xi4 - result += xi3 - \*) (p + 28) result += xi2 result += xi0 result += xi1 xi9 = \*(uint32) xi8 = \*(uint32) xi7 = \*(uint32) xi6 = \*(uint32) xi5 = \*(uint32) xi4 = \*(uint32) xi3 = \*(uint32) xi2 = \*(uint32) xi1 = \*(uint32) xi0 = \*(uint32) result += xi9 result += xi8 result += xi7 44) 40) 36) 32) 28) xi4 = \*(uint32 \*) (p + 16)xi3 = \*(uint32 \*) (p + 12)xi2 = \*(uint32 \*) (p + 8)xi1 = \*(uint32 \*) (p + 4)xi0 = \*(uint32 \*) p; p += 160result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi0 xi9 = \*(uint32 \*) (p xi8 = \*(uint32 \*) (p xi7 = \*(uint32 \*) (p xi6 = \*(uint32 \*) (p xi5 = \*(uint32 \*) (p xi4 = \*(uint32 \*) (p xi3 = \*(uint32 \*) (p xi2 = \*(uint32 \*) (p xi1 = \*(uint32 \*) (p xi0 = \*(uint32 \*) (p result += xi9 result += xi8 result += xi7 result += xi1 ``` xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 ``` ``` result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7 ``` resul resul resul resul resul resul resul xi9 = xi8 = xi7 = xi6 = xi5 = xi4 = xi3 = xi2 = 16 | result += xi1 | | | |-------------------|----------|--| | result += xi0 | | | | xi9 = *(uint32 *) | (p - 4) | | | xi8 = *(uint32 *) | (p - 8) | | | xi7 = *(uint32 *) | (p - 12) | | | xi6 = *(uint32 *) | (p - 16) | | | xi5 = *(uint32 *) | (p - 20) | | | xi4 = *(uint32 *) | (p - 24) | | | xi3 = *(uint32 *) | (p - 28) | | | xi2 = *(uint32 *) | (p - 32) | | | xi1 = *(uint32 *) | (p - 36) | | | xi0 = *(uint32 *) | (p - 40) | | | result += xi9 | | | | result += xi8 | | | | result += xi7 | | | | | | | ``` *) (p + 24) ``` $$*) (p + 20)$$ $$*) (p + 16)$$ $$*) (p + 12)$$ $$*) (p + 8)$$ $$*) (p + 4)$$ $$*)$$ p; p += 160 result += xi1 result += xi0 $$xi9 = *(uint32 *) (p - 4)$$ $$xi8 = *(uint32 *) (p - 8)$$ $$xi7 = *(uint32 *) (p - 12)$$ $$xi6 = *(uint32 *) (p - 16)$$ $$xi5 = *(uint32 *) (p - 20)$$ $$xi4 = *(uint32 *) (p - 24)$$ $$xi3 = *(uint32 *) (p - 28)$$ $$xi2 = *(uint32 *) (p - 32)$$ $$xi1 = *(uint32 *) (p - 36)$$ $$xi0 = *(uint32 *) (p - 40)$$ result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = \*(uint32) xi8 = \*(uint32) xi7 = \*(uint32) xi6 = \*(uint32) xi5 = \*(uint32) xi4 = \*(uint32) xi3 = \*(uint32) xi2 = \*(uint32) ``` 16 17 24) result += xi1 result += xi6 20) result += xi0 result += xi5 xi9 = *(uint32 *) (p - 4) 16) result += xi4 xi8 = *(uint32 *) (p - 8) 12) result += xi3 8) xi7 = *(uint32 *) (p - 12) result += xi2 xi6 = *(uint32 *) (p - 16) result += xi1 xi5 = *(uint32 *) (p - 20) += 160 result += xi0 xi9 = *(uint32 *) (p - xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi8 = *(uint32 *) (p - xi2 = *(uint32 *) (p - 32) xi7 = *(uint32 *) (p - xi1 = *(uint32 *) (p - 36) xi6 = *(uint32 *) (p - xi0 = *(uint32 *) (p - 40) xi5 = *(uint32 *) (p - result += xi9 xi4 = *(uint32 *) (p - xi3 = *(uint32 *) (p - result += xi8 xi2 = *(uint32 *) (p - result += xi7 ``` ``` result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 ``` result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = \*(uint32 \*) (p - 44)xi8 = \*(uint32 \*) (p - 48)xi7 = \*(uint32 \*) (p - 52)xi6 = \*(uint32 \*) (p - 56)xi5 = \*(uint32 \*) (p - 60)xi4 = \*(uint32 \*) (p - 64)xi3 = \*(uint32 \*) (p - 68)xi2 = \*(uint32 \*) (p - 72) t += xi1 t += xi0 t += xi9 t += xi8 t += xi7 \*(uint32 \*) (p - 4) \*(uint32 \*) (p - 8) \*(uint32 \*) (p - 12) \*(uint32 \*) (p - 16) \*(uint32 \*) (p - 20) \*(uint32 \*) (p - 24) \*(uint32 \*) (p - 28) \*(uint32 \*) (p - 32) \*(uint32 \*) (p - 36) \*(uint32 \*) (p - 40) result += xi6 ``` result += xi5 *) (p - 4) result += xi4 *) (p - 8) result += xi3 *) (p - 12) result += xi2 *) (p - 16) result += xi1 *) (p - 20) result += xi0 *) (p - 24) xi9 = *(uint32 *) (p - 44) *) (p - 28) xi8 = *(uint32 *) (p - 48) *) (p - 32) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) *) (p - 36) *) (p - 40) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72) ``` ``` xi1 = *(uint32) xi0 = *(uint32) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 ``` goto loop if != | | result += xi6 | |-----|----------------------------| | | result += xi5 | | 4) | result += xi4 | | 8) | result += xi3 | | 12) | result += xi2 | | 16) | result += xi1 | | 20) | result += xi0 | | 24) | xi9 = *(uint32 *) (p - 44) | | 28) | xi8 = *(uint32 *) (p - 48) | | 32) | xi7 = *(uint32 *) (p - 52) | | 36) | xi6 = *(uint32 *) (p - 56) | | 40) | xi5 = *(uint32 *) (p - 60) | | | xi4 = *(uint32 *) (p - 64) | | | xi3 = *(uint32 *) (p - 68) | | | xi2 = *(uint32 *) (p - 72) | | | | ``` xi1 = *(uint32 *) (p - xi0 = *(uint32 *) (p - result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if != ``` result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = \*(uint32 \*) (p - 44) xi8 = \*(uint32 \*) (p - 48) xi7 = \*(uint32 \*) (p - 52) xi6 = \*(uint32 \*) (p - 56) xi5 = \*(uint32 \*) (p - 60) xi4 = \*(uint32 \*) (p - 64) xi3 = \*(uint32 \*) (p - 68) xi2 = \*(uint32 \*) (p - 72) xi1 = \*(uint32 \*) (p - 76) xi0 = \*(uint32 \*) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if != ``` t += xi6 t += xi5 t += xi4 t += xi3 t += xi2 t += xi1 t += xi0 *(uint32 *) (p - 44) *(uint32 *) (p - 48) *(uint32 *) (p - 52) *(uint32 *) (p - 56) *(uint32 *) (p - 60) *(uint32 *) (p - 64) *(uint32 *) (p - 68) *(uint32 *) (p - 72) ``` ``` xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if != ``` Wikiped even per optimizing performa \*) (p - 44) \*) (p - 48) \*) (p - 52) \*) (p - 56) \*) (p - 60) \*) (p - 64) \*) (p - 68) \*) (p - 72) xi1 = \*(uint32 \*) (p - 76)xi0 = \*(uint32 \*) (p - 80)result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if != Wikipedia: "By the even performance optimizing compiled performance of humans and the even humans and humans and humans and humans are humans and humans and humans and humans are humans and humans and humans and humans are humans and humans and humans are humans and humans and humans are humans and humans and humans are humans and humans and humans are humans and humans and humans are humans and humans are 44) 48) 52) 56) 60) 64) 68) 72) xi1 = \*(uint32 \*) (p - 76)xi0 = \*(uint32 \*) (p - 80)result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if != Wikipedia: "By the late 199 even performance sensitive of optimizing compilers exceed performance of human expe ``` xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 ``` Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." xi1 = \*(uint32 \*) (p - 76)xi0 = \*(uint32 \*) (p - 80)result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 =? p - y goto loop if != Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation. \*(uint32 \*) (p - 76) \*(uint32 \*) (p - 80) t += xi9 t += xi8 t += xi7 t += xi6 t += xi5 t += xi4 t += xi3 t += xi2 t += xi1 t += xi0 =? p - y op if != Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation. The big CPUs ar farther a from nai \*) (p - 76) \*) (p - 80) Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation. The big picture CPUs are evolving farther and farther from naive models р - у 19 76) 80) Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation. #### The big picture 20 CPUs are evolving farther and farther away from naive models of CPUs. Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation. ### The big picture CPUs are evolving farther and farther away from naive models of CPUs. Wikipedia: "By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts." Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation. ### The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges: - Pipelining. - Superscalar processing. Major optimization challenges: - Vectorization. - Many threads; many cores. - The memory hierarchy; the ring; the mesh. - Larger-scale parallelism. - Larger-scale networking. The fastest software lies on human experts anding the CPU. trust compiler to instruction selection. trust compiler to instruction scheduling. trust compiler to register allocation. ### The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges: - Pipelining. - Superscalar processing. Major optimization challenges: - Vectorization. - Many threads; many cores. - The memory hierarchy; the ring; the mesh. - Larger-scale parallelism. - Larger-scale networking. $h_0$ Gates $\pi$ product of integer st software man experts CPU. oiler to on selection. oiler to on scheduling. oiler to Illocation. ## The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges: - Pipelining. - Superscalar processing. Major optimization challenges: - Vectorization. - Many threads; many cores. - The memory hierarchy; the ring; the mesh. - Larger-scale parallelism. - Larger-scale networking. ### CPU design in a n Gates $\pi: a, b \mapsto 1$ product $h_0 + 2h_1$ of integers $f_0 + 2f_1$ ## The big picture CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges: - Pipelining. - Superscalar processing. Major optimization challenges: - Vectorization. - Many threads; many cores. - The memory hierarchy; the ring; the mesh. - Larger-scale parallelism. - Larger-scale networking. ### CPU design in a nutshell Gates $\pi$ : $a, b \mapsto 1 - ab$ comproduct $h_0 + 2h_1 + 4h_2 + 8h_1$ of integers $f_0 + 2f_1$ , $g_0 + 2g_2$ #### The big picture CPUs are evolving farther and farther away from naive models of CPUs. #### Minor optimization challenges: - Pipelining. - Superscalar processing. ### Major optimization challenges: - Vectorization. - Many threads; many cores. - The memory hierarchy; the ring; the mesh. - Larger-scale parallelism. - Larger-scale networking. #### CPU design in a nutshell Gates $\pi$ : $a, b \mapsto 1 - ab$ computing product $h_0 + 2h_1 + 4h_2 + 8h_3$ of integers $f_0 + 2f_1$ , $g_0 + 2g_1$ . ### picture e evolving and farther away ve models of CPUs. otimization challenges: ning. scalar processing. otimization challenges: rization. threads; many cores. nemory hierarchy; g; the mesh. -scale parallelism. -scale networking. ### CPU design in a nutshell Gates $\pi$ : $a, b \mapsto 1 - ab$ computing product $h_0 + 2h_1 + 4h_2 + 8h_3$ of integers $f_0 + 2f_1$ , $g_0 + 2g_1$ . Electricity percolate If $f_0$ , $f_1$ , at then $h_0$ , a few m $f_0$ $g_0$ $h_1$ $h_0$ $h_3$ $h_2$ Gates $\pi: a, b \mapsto 1 - ab$ computing product $h_0 + 2h_1 + 4h_2 + 8h_3$ of integers $f_0 + 2f_1, g_0 + 2g_1$ . away of CPUs. n challenges: essing. n challenges: nany cores. rarchy; sh. allelism. working. Electricity takes ti percolate through If $f_0$ , $f_1$ , $g_0$ , $g_1$ are then $h_0$ , $h_1$ , $h_2$ , $h_3$ a few moments lat $f_1$ $f_0$ go $\overline{\wedge}$ $h_0$ $h_1$ $h_2$ *h*<sub>3</sub> Gates $\pi$ : $a, b \mapsto 1 - ab$ computing product $h_0 + 2h_1 + 4h_2 + 8h_3$ of integers $f_0 + 2f_1$ , $g_0 + 2g_1$ . Electricity takes time to percolate through wires and If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. es: es: . . Gates $\pi$ : $a, b \mapsto 1 - ab$ computing product $h_0 + 2h_1 + 4h_2 + 8h_3$ of integers $f_0 + 2f_1$ , $g_0 + 2g_1$ . Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Gates $\pi$ : $a, b \mapsto 1 - ab$ computing product $h_0 + 2h_1 + 4h_2 + 8h_3$ of integers $f_0 + 2f_1$ , $g_0 + 2g_1$ . Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) ### sign in a nutshell : $a, b \mapsto 1 - ab$ computing $h_0 + 2h_1 + 4h_2 + 8h_3$ ers $f_0 + 2f_1, g_0 + 2g_1$ . Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) Build cir 32-bit in given 4and 32-k reg #### utshell - ab computing $+ 4h_2 + 8h_3$ $1, g_0 + 2g_1$ . Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) Build circuit to co 32-bit integer $r_i$ given 4-bit integer and 32-bit integers # register read Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots$ # register read puting h<sub>3</sub> | • Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : register read 23 Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : ## register read Build circuit for "register write": $r_0, \ldots, r_{15}, s, i \mapsto r'_0, \ldots, r'_{15}$ where $r'_i = r_j$ except $r'_i = s$ . 23 Electricity takes time to percolate through wires and gates. If $f_0$ , $f_1$ , $g_0$ , $g_1$ are stable then $h_0$ , $h_1$ , $h_2$ , $h_3$ are stable a few moments later. Build circuit with more gates to multiply (e.g.) 32-bit integers: (Details omitted.) Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : ## register read Build circuit for "register write": $r_0, \ldots, r_{15}, s, i \mapsto r'_0, \ldots, r'_{15}$ where $r'_j = r_j$ except $r'_i = s$ . Build circuit for addition. Etc. ty takes time to e through wires and gates. $g_0$ , $g_1$ are stable $h_1$ , $h_2$ , $h_3$ are stable oments later. rcuit with more gates ply (e.g.) 32-bit integers: omitted.) Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : # register read Build circuit for "register write": $$r_0, ..., r_{15}, s, i \mapsto r'_0, ..., r'_{15}$$ where $r'_i = r_j$ except $r'_i = s$ . Build circuit for addition. Etc. $r_0, \ldots, r_\ell$ where $r'_\ell$ regis rea me to wires and gates. stable are stable ter. more gates 32-bit integers: Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : # register read Build circuit for "register write": $r_0, \ldots, r_{15}, s, i \mapsto r'_0, \ldots, r'_{15}$ where $r'_j = r_j$ except $r'_i = s$ . Build circuit for addition. Etc. $r_0, \ldots, r_{15}, i, j, k \vdash$ where $r'_\ell = r_\ell$ exce register reg registe write gates. Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : gers: ## register read Build circuit for "register write": $r_0, \ldots, r_{15}, s, i \mapsto r'_0, \ldots, r'_{15}$ where $r'_j = r_j$ except $r'_i = s$ . Build circuit for addition. Etc. $r_0, \ldots, r_{15}, i, j, k \mapsto r'_0, \ldots, r_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j$ register register read read register write Build circuit to compute 32-bit integer $r_i$ given 4-bit integer i and 32-bit integers $r_0, r_1, \ldots, r_{15}$ : ## register read Build circuit for "register write": $r_0, \ldots, r_{15}, s, i \mapsto r'_0, \ldots, r'_{15}$ where $r'_j = r_j$ except $r'_i = s$ . Build circuit for addition. Etc. $r_0, ..., r_{15}, i, j, k \mapsto r'_0, ..., r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read register write recuit to compute teger $r_i$ bit integer i bit integers $r_0, r_1, \ldots, r_{15}$ : # ister cuit for "register write": $r_1, s, i \mapsto r'_0, \dots, r'_{15}$ $r_i = r_j$ except $r'_i = s$ . cuit for addition. Etc. $r_0, \ldots, r_{15}, i, j, k \mapsto r'_0, \ldots, r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read register write Add mo replace ( $("\times", i, j)$ $$("+", i,$$ mpute $$r_0, r_1, \dots, r_{15}$$ : register write": $r'_0, \ldots, r'_{15}$ pt $r'_i = s$ . ddition. Etc. $r_0, ..., r_{15}, i, j, k \mapsto r'_0, ..., r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read Add more flexibility More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and r $r_0, ..., r_{15}, i, j, k \mapsto r'_0, ..., r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : , *r*<sub>15</sub>: register register read read ite": tc. register write Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more option $r_0, \ldots, r_{15}, i, j, k \mapsto r'_0, \ldots, r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read register write Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. 25 $r_0, \ldots, r_{15}, i, j, k \mapsto r'_0, \ldots, r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read register write Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. $r_0, \ldots, r_{15}, i, j, k \mapsto r'_0, \ldots, r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read register write Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. "Instruction fetch": $p \mapsto o_p, i_p, j_p, k_p, p'.$ $r_0, \ldots, r_{15}, i, j, k \mapsto r'_0, \ldots, r'_{15}$ where $r'_{\ell} = r_{\ell}$ except $r'_i = r_j r_k$ : register register read read register write Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. "Instruction fetch": $p \mapsto o_p, i_p, j_p, k_p, p'.$ "Instruction decode": decompression of compressed format for $o_p$ , $i_p$ , $j_p$ , $k_p$ , p'. $r_1$ , $i, j, k \mapsto r'_0, \dots, r'_{15}$ = $r_\ell$ except $r'_i = r_j r_k$ : ## ter register d read egister write Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. "Instruction fetch": $p \mapsto o_p, i_p, j_p, k_p, p'$ . "Instruction decode": decompression of compressed format for $o_p$ , $i_p$ , $j_p$ , $k_p$ , p'. Build "fl storing ( Hook (p Hook ou into the At each flip-flops with the Clock new for elect all the w from flip ### gister ead Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. "Instruction fetch": $p \mapsto o_p, i_p, j_p, k_p, p'.$ "Instruction decode": decompression of compressed format for $o_p$ , $i_p$ , $j_p$ , $k_p$ , p'. Build "flip-flops" storing $(p, r_0, \dots,$ Hook $(p, r_0, \ldots, r_1)$ flip-flops into circu Hook outputs (p', into the same flip- At each "clock tic flip-flops are overv with the outputs. Clock needs to be for electricity to peall the way through from flip-flops to f Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. "Instruction fetch": $p\mapsto o_p, i_p, j_p, k_p, p'.$ "Instruction decode": decompression of compressed format for $o_p$ , $i_p$ , $j_p$ , $k_p$ , p'. Build "flip-flops" storing $(p, r_0, \dots, r_{15})$ . Hook $(p, r_0, \dots, r_{15})$ flip-flops into circuit inputs. Hook outputs $(p', r'_0, \dots, r'_{15})$ into the same flip-flops. At each "clock tick", flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circufrom flip-flops to flip-flops. Add more flexibility. More arithmetic: replace (i, j, k) with $("\times", i, j, k)$ and ("+", i, j, k) and more options. More (but slower) storage: "load" from and "store" to larger "RAM" arrays. "Instruction fetch": $p \mapsto o_p, i_p, j_p, k_p, p'.$ "Instruction decode": decompression of compressed format for $o_p$ , $i_p$ , $j_p$ , $k_p$ , p'. Build "flip-flops" storing $(p, r_0, \dots, r_{15})$ . Hook $(p, r_0, \dots, r_{15})$ flip-flops into circuit inputs. Hook outputs $(p', r'_0, \ldots, r'_{15})$ into the same flip-flops. At each "clock tick", flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. re flexibility. thmetic: (i, j, k) with j, k) and j, k) and more options. ut slower) storage: rom and "store" to RAM" arrays. tion fetch": $i_p, j_p, k_p, p'$ . tion decode": ression of compressed or $o_p$ , $i_p$ , $j_p$ , $k_p$ , p'. Build "flip-flops" storing $(p, r_0, ..., r_{15})$ . Hook $(p, r_0, ..., r_{15})$ flip-flops into circuit inputs. Hook outputs $(p', r'_0, \ldots, r'_{15})$ into the same flip-flops. At each "clock tick", flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. reac Further e.g., log y. h nore options. storage: 'store" to ays. . n' le": compressed $_{p}$ , $k_{p}$ , p'. Build "flip-flops" storing $(p, r_0, \dots, r_{15})$ . Hook $(p, r_0, \dots, r_{15})$ flip-flops into circuit inputs. Hook outputs $(p', r'_0, \ldots, r'_{15})$ into the same flip-flops. At each "clock tick", flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. Now have semi-fle flip-flops insn fetch insn decode register register read read Further flexibility in e.g., logic instruct register write ns. 27 Build "flip-flops" storing $(p, r_0, \dots, r_{15})$ . Hook $(p, r_0, \dots, r_{15})$ flip-flops into circuit inputs. Hook outputs $(p', r'_0, \ldots, r'_{15})$ into the same flip-flops. At each "clock tick", flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. Now have semi-flexible CPU Further flexibility is useful: e.g., logic instructions. 27 Build "flip-flops" storing $(p, r_0, \dots, r_{15})$ . Hook $(p, r_0, \dots, r_{15})$ flip-flops into circuit inputs. Hook outputs $(p', r'_0, \ldots, r'_{15})$ into the same flip-flops. At each "clock tick", flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops. Now have semi-flexible CPU: Further flexibility is useful: e.g., logic instructions. $$(p, r_0, \ldots, r_{15}).$$ $$(r_0, r_0, \ldots, r_{15})$$ s into circuit inputs. itputs $$(p', r'_0, \dots, r'_{15})$$ same flip-flops. "clock tick", are overwritten outputs. eeds to be slow enough ricity to percolate ay through the circuit, -flops to flip-flops. Now have semi-flexible CPU: Further flexibility is useful: e.g., logic instructions. fli fli .5) uit inputs. $$r'_0, \ldots, r'_{15}$$ ) flops. k", vritten slow enough ercolate h the circuit, lip-flops. Now have semi-flexible CPU: Further flexibility is useful: e.g., logic instructions. ıgh uit, #### Now have semi-flexible CPU: Further flexibility is useful: e.g., logic instructions. "Pipelining" allows faster cle stage stage stage stage stage Now have semi-flexible CPU: flip-flops insn fetch insn decode register register read read register write Further flexibility is useful: e.g., logic instructions. "Pipelining" allows faster clock: ve semi-flexible CPU: p-flops insn fetch insn ecode erregister read egister write flexibility is useful: ic instructions. Goal: Stone tick Instructi reads ne feeds p' After ne instructi uncompi while ins reads an Some ex Also ext preserve e.g., sta xible CPU: "Pipelining" allows faster clock: flip-flops insn stage 1 fetch flip-flops insn stage 2 decode flip-flops register register stage 3 read read flip-flops stage 4 flip-flops register stage 5 write Goal: Stage *n* har one tick after stage. Instruction fetch reads next instruction feeds p' back, sent After next clock tile instruction decode uncompresses this while instruction for reads another inst Some extra flip-flo Also extra area to preserve instructio e.g., stall on read- s useful: ions. #### "Pipelining" allows faster clock: Goal: Stage n handles instruous tick after stage n-1. Instruction fetch reads next instruction, feeds p' back, sends instruct After next clock tick, instruction decode uncompresses this instructio while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantice.g., stall on read-after-write "Pipelining" allows faster clock: flip-flops insn stage 1 fetch flip-flops insn stage 2 decode flip-flops register register stage 3 read read flip-flops stage 4 flip-flops register stage 5 write Goal: Stage n handles instruction one tick after stage n-1. Instruction fetch reads next instruction, feeds p' back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. "Superso registerre read Goal: Stage *n* handles instruction one tick after stage n-1. Instruction fetch 29 reads next instruction, feeds p' back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. stage 1 stage 2 stage 3 Goal: Stage *n* handles instruction one tick after stage n-1. Instruction fetch reads next instruction, feeds p' back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. "Superscalar" prod insi fetc insı read writ write flip-flops insn fetch flip-flops insn decode deco flip-flops register register regist read read flip-flops flip-flops register regis stage 4 stage 5 ock: e 1 e 2 e 3 e 4 2 5 Goal: Stage n handles instruction one tick after stage n-1. Instruction fetch reads next instruction, feeds p' back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. "Superscalar" processing: Goal: Stage n handles instruction one tick after stage n-1. Instruction fetch reads next instruction, feeds p' back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write. "Superscalar" processing: tage n handles instruction after stage n-1. on fetch xt instruction, back, sends instruction. on decode resses this instruction, struction fetch other instruction. tra flip-flop area. ra area to instruction semantics: ll on read-after-write. "Superscalar" processing: | | flip- | | | | | | |-----------------|------------------|-------------------|------------------|--|--|--| | | insn<br>fetch | | | | | | | | flip- | | | | | | | | insn<br>decode | insn<br>decode | | | | | | | flip-flops | | | | | | | egister<br>read | register<br>read | register<br>read | register<br>read | | | | | flip-flops | | | | | | | | | | flops | | | | | | | | | | | | | | | | register<br>write | | | | | Expand into *n*-ve ARM "N Intel "A GPUs ha "Vector tion, ds instruction. ck, instruction, etch ruction. p area. n semantics: after-write. "Superscalar" processing: | | flip- | | | | | |-------------------|---------------|---------------|----------|--|--| | | insn<br>fetch | insn<br>fetch | | | | | | flip- | flops | | | | | | insn insn | | | | | | | decode decode | | | | | | | flip- | flops | | | | | register | register | register | register | | | | read | read read | | read | | | | | | | | | | | | | | | | | | | | | | | | | flip-flops | | | | | | | register register | | | | | | | | write | write | | | | "Vector" processing Expand each 32-b into *n*-vector of 32 ARM "NEON" has Intel "AVX2" has Intel "AVX-512" h GPUs have larger uction "Superscalar" processing: insn insn fetch flip-flops insn insn insn decode decode flip-flops register register register read read read read read flip-flops register register write write ion. n, CS: 2. "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integer ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16 GPUs have larger n. "Superscalar" processing: | | | flip-flops | | | | | |------------------|------------|------------------|-------------|------------------|-------------|------| | | | | | in:<br>fet | | | | | | flip-flops | | | | | | | | insn insn decode | | | | | | | flip-flops | | | | 5 | | | register<br>read | | register<br>read | | register<br>read | | ster | | flip-flops | | | | | | | | | | | | | | | | flip-flops | | | | | • | | | | | regi<br>wr | ster<br>ite | regi<br>wr | ster<br>ite | | "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. "Superscalar" processing: | | | flip-flops | | | | | | |------------|------------------|------------|---------------------------|--|------------------|---|------------| | | | | insn insn<br>fetch fetch | | | | | | | | flip-flops | | | | | | | | | | insn insn<br>ecode decode | | | | | | | flip-flops | | | | 5 | | | | | register<br>read | | register<br>read | | register<br>read | | ster<br>ad | | flip-flops | | | | | | | | | | | | | | | | | | flip-flops | | | | | | ı | | | | | | ster<br>ite | | | | | "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. $n \times$ speedup if $n \times$ arithmetic circuits, $n \times$ read/write circuits. Benefit: Amortizes insn circuits. "Superscalar" processing: | | flip- | | | | | |-----------------|--------------------------------------------------|----------------------------|------------------|--|--| | | insn<br>fetch | insn<br>fetch | | | | | | flip- | flops | | | | | | insn<br>decode | insn insn<br>decode decode | | | | | | flip-flops | | | | | | registe<br>read | register<br>read | register<br>read | register<br>read | | | | flip-flops | | | | | | | | <del> </del> | | | | | | | | | | | | | | register register<br>write write | | | | | "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. $n \times$ speedup if $n \times$ arithmetic circuits, $n \times$ read/write circuits. Benefit: Amortizes inso circuits. Huge effect on higher-level algorithms and data structures. calar" processing: flip-flops insn insn fetch fetch flip-flops insn insn ecode decode flip-flops egister register register read read read flip-flops flip-flops egisterregister write write "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. $n \times$ speedup if $n \times$ arithmetic circuits, $n \times$ read/write circuits. Benefit: Amortizes inso circuits. Huge effect on higher-level algorithms and data structures. <u>Network</u> How exp Input: a Each nu represen Output: in increa represen same mi h de terregister read "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network on chip: How expensive is s Input: array of n represented in binary Output: array of *i* in increasing order represented in bina same multiset as i "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, \ldots, n\}$ represented in binary. Output: array of *n* numbers in increasing order, represented in binary; same multiset as input. "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, ..., n^2\}$ , represented in binary. Output: array of *n* numbers, in increasing order, represented in binary; same multiset as input. 32 "Vector" processing: Expand each 32-bit integer into n-vector of 32-bit integers. ARM "NEON" has n=4; Intel "AVX2" has n=8; Intel "AVX-512" has n=16; GPUs have larger n. n× speedup if n× arithmetic circuits, n× read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures. Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, ..., n^2\}$ , represented in binary. Output: array of *n* numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area $n^{1+o(1)}$ . For simplicity assume $n = 4^k$ . each 32-bit integer ector of 32-bit integers. JEON" has n = 4; VX2'' has n = 8; VX-512" has n = 16; ave larger n. dup if metic circuits, /write circuits. Amortizes insn circuits. fect on higher-level ms and data structures. Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, ..., n^2\}$ , represented in binary. Output: array of *n* numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area $n^{1+o(1)}$ . For simplicity assume $n = 4^k$ . Spread a square n each of with nea ng: it integer 2-bit integers. n = 4; n = 8; n = 16; uits, cuits. s insn circuits. ta structures. Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, ..., n^2\}$ , represented in binary. Output: array of *n* numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area $n^{1+o(1)}$ . For simplicity assume $n = 4^k$ . Spread array acros square mesh of n each of area $n^{o(1)}$ with near-neighbor #### Network on chip: the mesh How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, ..., n^2\}$ , represented in binary. Output: array of *n* numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area $n^{1+o(1)}$ . For simplicity assume $n = 4^k$ . 7 ers. uits. res. How expensive is sorting? Input: array of n numbers. Each number in $\{1, 2, ..., n^2\}$ , represented in binary. Output: array of *n* numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area $n^{1+o(1)}$ . For simplicity assume $n = 4^k$ . Spread array across square mesh of n small cells, each of area $n^{o(1)}$ , with near-neighbor wiring: on chip: the mesh ensive is sorting? rray of *n* numbers. mber in $\{1, 2, ..., n^2\}$ , ted in binary. array of *n* numbers, sing order, ted in binary; ultiset as input. seconds used by 1+o(1) f area $n^{1+o(1)}$ . olicity assume $n = 4^k$ . Spread array across square mesh of n small cells, each of area $n^{o(1)}$ , with near-neighbor wiring: Sort row in $n^{0.5+6}$ - Sort e 3 1 4 1 3 1 - Sort a 1 3 1 1 1 3 - Repeaequals #### the mesh sorting? numbers. $$1, 2, \ldots, n^2$$ , ary. numbers, 7 ary; nput. sed by o(1) me $n=4^k$ . Spread array across square mesh of n small cells, each of area $n^{o(1)}$ , with near-neighbor wiring: Sort row of $n^{0.5}$ coin $n^{0.5+o(1)}$ second - Sort each pair in 3 1 4 1 5 9 2 6 1 3 1 4 5 9 2 6 - Sort alternate parts 1 3 1 4 5 9 2 6 1 1 3 4 5 2 9 6 - Repeat until nur equals row lengt Spread array across square mesh of n small cells, each of area $n^{o(1)}$ , with near-neighbor wiring: Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: • Sort each pair in parallel. $$\frac{3\ 1}{1\ 3\ 1\ 4\ 5\ 9} \, \frac{2\ 6}{2\ 6} \mapsto$$ Sort alternate pairs in para Repeat until number of st equals row length. Spread array across square mesh of n small cells, each of area $n^{o(1)}$ , with near-neighbor wiring: Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: • Sort each pair in parallel. $$\frac{3\ 1}{1\ 3\ 1\ 4\ 5\ 9} \, \frac{2\ 6}{2\ 6} \mapsto$$ • Sort alternate pairs in parallel. Repeat until number of steps equals row length. 34 Spread array across square mesh of n small cells, each of area $n^{o(1)}$ , with near-neighbor wiring: Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: • Sort each pair in parallel. $$\frac{3\ 1}{1\ 3\ 1\ 4\ 5\ 9} \, \frac{2\ 6}{2\ 6} \mapsto$$ Sort alternate pairs in parallel. Repeat until number of steps equals row length. Sort *each* row, in parallel, in a *total* of $n^{0.5+o(1)}$ seconds. array across nesh of n small cells, area $n^{o(1)}$ , r-neighbor wiring: Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: Sort each pair in parallel. $$\frac{3}{1}, \frac{1}{4}, \frac{1}{5}, \frac{5}{9}, \frac{9}{2}, \frac{6}{6} \mapsto \frac{1}{3}, \frac{3}{1}, \frac{1}{4}, \frac{5}{5}, \frac{9}{2}, \frac{2}{6}$$ • Sort alternate pairs in parallel. Repeat until number of steps equals row length. Sort *each* row, in parallel, in a *total* of $n^{0.5+o(1)}$ seconds. Sort all in $n^{0.5+6}$ - Recursin para - Sort ea - Sort e. - Sort e - Sort e With prolemants of that this 35 small cells, r wiring: Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: • Sort each pair in parallel. $$\frac{3\ 1}{1\ 3\ 1\ 4\ 5\ 9} \, \frac{2\ 6}{2\ 6} \mapsto$$ Sort alternate pairs in parallel. Repeat until number of steps equals row length. Sort *each* row, in parallel, in a *total* of $n^{0.5+o(1)}$ seconds. Sort all n cells in $n^{0.5+o(1)}$ second - Recursively sort in parallel, if n > - Sort each colum - Sort each row in - Sort each colum - Sort each row in With proper choice left-to-right/right-for each row, can that this sorts who Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: • Sort each pair in parallel. $$\frac{3}{1}, \frac{1}{4}, \frac{1}{5}, \frac{5}{9}, \frac{9}{2}, \frac{6}{6} \mapsto \frac{1}{3}, \frac{3}{1}, \frac{1}{4}, \frac{5}{5}, \frac{9}{2}, \frac{2}{6}$$ Sort alternate pairs in parallel. Repeat until number of steps equals row length. Sort *each* row, in parallel, in a *total* of $n^{0.5+o(1)}$ seconds. Sort all n cells in $n^{0.5+o(1)}$ seconds: - Recursively sort quadrants in parallel, if n > 1. - Sort each column in parall - Sort each row in parallel. - Sort each column in parall - Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. Sort row of $n^{0.5}$ cells in $n^{0.5+o(1)}$ seconds: • Sort each pair in parallel. $$\frac{3\ 1}{1\ 3\ 1\ 4\ 5\ 9} \, \frac{2\ 6}{2\ 6} \mapsto$$ • Sort alternate pairs in parallel. Repeat until number of steps equals row length. Sort *each* row, in parallel, in a *total* of $n^{0.5+o(1)}$ seconds. Sort all n cells in $n^{0.5+o(1)}$ seconds: - Recursively sort quadrants in parallel, if n > 1. - Sort each column in parallel. - Sort each row in parallel. - Sort each column in parallel. - Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. of $n^{0.5}$ cells $o^{(1)}$ seconds: ach pair in parallel. $$\underline{1} \ \underline{5} \ \underline{9} \ \underline{2} \ \underline{6} \ \mapsto$$ 4 5 9 2 6 Iternate pairs in parallel. $$\underline{45} \ \underline{92} \ 6 \mapsto$$ 4 5 2 9 6 t until number of steps row length. h row, in parallel, of $n^{0.5+o(1)}$ seconds. Sort all *n* cells in $n^{0.5+o(1)}$ seconds: - Recursively sort quadrants in parallel, if n > 1. - Sort each column in parallel. - Sort each row in parallel. - Sort each column in parallel. - Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. | For | ex | ar | |------|----|----| | this | 8 | × | ells ds: parallel. $\mapsto$ airs in parallel. $\mapsto$ mber of steps h. parallel, o(1) seconds. Sort all n cells in $n^{0.5+o(1)}$ seconds: - Recursively sort quadrants in parallel, if n > 1. - Sort each column in parallel. - Sort each row in parallel. - Sort each column in parallel. - Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. For example, assume this $8 \times 8$ array is | 3 | 1 | 4 | 1 | 5 | 9 | |----------|---|---|---|---|---| | 5 | 3 | 5 | 8 | 9 | 7 | | 2 | 3 | 8 | 4 | 6 | 2 | | 3 | 3 | 8 | 3 | 2 | 7 | | 0 | 2 | 8 | 8 | 4 | 1 | | 1 | 6 | 9 | 3 | 9 | 9 | | 5 | 1 | 0 | 5 | 8 | 2 | | <b>—</b> | 4 | | 4 | 4 | _ | 36 Sort all n cells in $n^{0.5+o(1)}$ seconds: - Recursively sort quadrants in parallel, if n > 1. - Sort each column in parallel. - Sort each row in parallel. - Sort each column in parallel. - Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. For example, assume that this $8 \times 8$ array is in cells: | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | |---|---|---|---|---|---|---|---| | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 | | 2 | 3 | 8 | 4 | 6 | 2 | 6 | 4 | | 3 | 3 | 8 | 3 | 2 | 7 | 9 | 5 | | 0 | 2 | 8 | 8 | 4 | 1 | 9 | 7 | | 1 | 6 | 9 | 3 | 9 | 9 | 3 | 7 | | 5 | 1 | 0 | 5 | 8 | 2 | 0 | 9 | | 7 | 4 | 9 | 4 | 4 | 5 | 9 | 2 | allel. eps ds Sort all n cells in $n^{0.5+o(1)}$ seconds: - Recursively sort quadrants in parallel, if n > 1. - Sort each column in parallel. - Sort each row in parallel. - Sort each column in parallel. - Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. For example, assume that this $8 \times 8$ array is in cells: | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | |---|---|---|---|---|---|---|---| | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 | | 2 | 3 | 8 | 4 | 6 | 2 | 6 | 4 | | 3 | 3 | 8 | 3 | 2 | 7 | 9 | 5 | | 0 | 2 | 8 | 8 | 4 | 1 | 9 | 7 | | 1 | 6 | 9 | 3 | 9 | 9 | 3 | 7 | | 5 | 1 | 0 | 5 | 8 | 2 | 0 | 9 | | 7 | 4 | 9 | 4 | 4 | 5 | 9 | 2 | n cells $o^{(1)}$ seconds: sively sort quadrants allel, if n > 1. ach column in parallel. ach row in parallel. ach column in parallel. ach row in parallel. oper choice of ght/right-to-left row, can prove sorts whole array. For example, assume that this $8 \times 8$ array is in cells: | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | |---|---|---|---|---|---|---|---| | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 | | 2 | 3 | 8 | 4 | 6 | 2 | 6 | 4 | | 3 | 3 | 8 | 3 | 2 | 7 | 9 | 5 | | 0 | 2 | 8 | 8 | 4 | 1 | 9 | 7 | | 1 | 6 | 9 | 3 | 9 | 9 | 3 | 7 | | 5 | 1 | 0 | 5 | 8 | 2 | 0 | 9 | | 7 | 4 | 9 | 4 | 4 | 5 | 9 | 2 | | Rec | ursiv | |-----|---------------------| | top | $\longrightarrow$ , | | 1 | | |---|--| | 3 | | | 3 | | | 5 | | | 1 | | | 4 | | | 7 | | quadrants > 1. n in parallel. parallel. n in parallel. parallel. e of to-left prove ole array. For example, assume that this $8 \times 8$ array is in cells: | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | |---|---|---|---|---|---|---|---| | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 | | 2 | 3 | 8 | 4 | 6 | 2 | 6 | 4 | | 3 | 3 | 8 | 3 | 2 | 7 | 9 | 5 | | 0 | 2 | 8 | 8 | 4 | 1 | 9 | 7 | | 1 | 6 | 9 | 3 | 9 | 9 | 3 | 7 | | 5 | 1 | 0 | 5 | 8 | 2 | 0 | 9 | | 7 | 4 | 9 | 4 | 4 | 5 | 9 | 2 | Recursively sort question $\rightarrow$ , bottom $\leftarrow$ | 1 | 1 | 2 | 3 | 2 | 2 | |---|--------------------------------------|---|---|---|---| | 3 | 3 | 3 | 3 | 4 | 5 | | 3 | 4 | 4 | 5 | 6 | 6 | | 5 | 1<br>3<br>4<br>8<br>1<br>4<br>6<br>9 | 8 | 8 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | | 4 | 4 | 3 | 2 | 5 | 4 | | 7 | 6 | 5 | 5 | 9 | 8 | | | | | | | | For example, assume that this $8 \times 8$ array is in cells: | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | |---|---|---|---|---|---|---|---| | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 | | 2 | 3 | 8 | 4 | 6 | 2 | 6 | 4 | | 3 | 3 | 8 | 3 | 2 | 7 | 9 | 5 | | 0 | 2 | 8 | 8 | 4 | 1 | 9 | 7 | | 1 | 6 | 9 | 3 | 9 | 9 | 3 | 7 | | 5 | 1 | 0 | 5 | 8 | 2 | 0 | 9 | | 7 | 4 | 9 | 4 | 4 | 5 | 9 | 2 | Recursively sort quadrants, top $\rightarrow$ , bottom $\leftarrow$ : | 1 | 1 | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|---|---|---|---|---| | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 6 | | 3 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | | 4 | 4 | 3 | 2 | 5 | 4 | 4 | 3 | | 7 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | For example, assume that this $8 \times 8$ array is in cells: Recursively sort quadrants, top $\rightarrow$ , bottom $\leftarrow$ : | 1 | 1 | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|---|---|---|---|---| | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 6 | | 3 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | | 4 | 4 | 3 | 2 | 5 | 4 | 4 | 3 | | 7 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | nple, assume that 8 array is in cells: ``` 1 5 9 2 6 8 9 7 9 3 8 4 6 2 6 4 8 3 2 7 9 5 8 4 1 9 7 9 3 9 9 3 7 9 4 4 5 9 2 ``` Recursively sort quadrants, top $\rightarrow$ , bottom $\leftarrow$ : | 1 | 1 | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|---|---|---------------------------------------------------|---|---| | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 6 | | 3 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 8 | 8 | 8 | 9 | <ul><li>2</li><li>5</li><li>6</li><li>9</li></ul> | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | | 4 | 4 | 3 | 2 | 5 | 4 | 4 | 3 | | 7 | 6 | 5 | 5 | 9 | 2<br>4<br>8<br>9 | 7 | 7 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | Sort eac in paralle | 1 | 1 | |---|---| | 1 | 1 | | 3 | 3 | | 3 | 4 | | 4 | 4 | | 5 | 6 | | 7 | 8 | | | | me that in cells: Recursively sort quadrants, top $\rightarrow$ , bottom $\leftarrow$ : | 1 | 1 | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|-------------|---|---|---|---| | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 6 | | 3 | 4 | 4 | 3 5 | 6 | 6 | 7 | 7 | | 5 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0<br>2<br>5 | 2 | 2 | 1 | 0 | | 4 | 4 | 3 | 2 | 5 | 4 | 4 | 3 | | 7 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | |---|---|---|---|---|---| | 1 | 1 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 4 | 4 | | 3 | 4 | 3 | 3 | 5 | 5 | | 4 | 4 | 4 | 5 | 6 | 6 | | 5 | 6 | 5 | 5 | 9 | 8 | | 7 | 8 | 8 | 8 | 9 | 9 | | 9 | 9 | 8 | 8 | 9 | 9 | Recursively sort quadrants, top $\rightarrow$ , bottom $\leftarrow$ : | 1 | 1 | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|---|---|---|---|---| | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 6 | | 3 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | | 4 | 4 | 3 | 2 | 5 | 4 | 4 | 3 | | 7 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | |---|---|---|---|---|---|---|---| | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 3 | | 3 | 4 | 3 | 3 | 5 | 5 | 5 | 6 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | Recursively sort quadrants, top $\rightarrow$ , bottom $\leftarrow$ : | 1 | 1 | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|---|---|-----|---|-------------| | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 6 | | 3 | 4 | 4 | 5 | 6 | 6 | 7 | 3<br>6<br>7 | | 5 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 4 | 1 | 0 | | 4 | 4 | 3 | 2 | 5 | 4 | 4 | 3 | | 7 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | |---|---|---|---|---|---|---|---| | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 3 | | 3 | 4 | 3 | 3 | 5 | 5 | 5 | 6 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | ely sort quadrants, bottom ←: | 2 | 3 | 2 | 2 | 2 | 3 | |---|---|---|---|---|---| | 3 | 3 | 4 | 5 | 5 | 6 | | ŀ | 5 | 6 | 6 | 7 | 7 | | 3 | 8 | 9 | 9 | 9 | 9 | | ) | 0 | 2 | 2 | 1 | 0 | | 3 | 2 | 5 | 4 | 4 | 3 | | ) | 5 | 9 | 8 | 7 | 7 | | 3 | 8 | 9 | 9 | 9 | 9 | #### Sort each column in parallel: | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | |---|---|---|---|---|---|---|---| | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 3 | | 3 | 4 | 3 | 3 | 5 | 5 | 5 | 6 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | Sort eac alternate | 0 | 0 | C | |---|---------------|-----| | 3 | 2 | 2 | | 3 | 3 | (1) | | 6 | 5 | ( T | | 4 | 4 | 4 | | 9 | 8 | 7 | | 7 | 8 | 8 | | | $\overline{}$ | | | 2 | 3 | |---|---| | 5 | 6 | | 7 | 7 | | 9 | 9 | | 1 | 0 | | 4 | 3 | | 7 | 7 | | 9 | 9 | Sort each column in parallel: | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | |---|---|---|---|---|---|---|---| | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 3 | | 3 | 4 | 3 | 3 | 5 | 5 | 5 | 6 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | Sort each row in particular alternately $\leftarrow$ , $\rightarrow$ : | 0 | 0 | 0 | 1 | 1 | 1 | |---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | | 4 | 4 | 4 | 5 | 6 | 6 | | 9 | 8 | 7 | 7 | 6 | 5 | | 7 | 8 | 8 | 8 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | Sort each column in parallel: | _ | | | | | | | | | |---|---|---|---|---|---|---|---|---| | | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | | | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | | | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 3 | | | 3 | 4 | 3 | 3 | 5 | 5 | 5 | 6 | | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | | 5 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | | | | | | | | | | | Sort each row in parallel, alternately $\leftarrow$ , $\rightarrow$ : | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 9 | 8 | 7 | 7 | 6 | 5 | 5 | 5 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 8 | # Sort each column in parallel: | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 0 | |---|---|---|---|---|---|---|---| | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 3 | | 3 | 4 | 3 | 3 | 5 | 5 | 5 | 6 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 5 | 6 | 5 | 5 | 9 | 8 | 7 | 7 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 8 | 8 | 9 | 9 | 9 | 9 | Sort each row in parallel, alternately $\leftarrow$ , $\rightarrow$ : | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | |---|---|---|--------|---|---|-------------------------------|-------------------------------| | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | 3 | 3 | | | | | | | | | | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 9 | 8 | 7 | 5<br>7 | 6 | 5 | <ul><li>7</li><li>5</li></ul> | <ul><li>7</li><li>5</li></ul> | | | | | | | | | | h column | ) | 0 | 2 | 2 | 1 | 0 | |---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 4 | 4 | 4 | 3 | | 3 | 3 | 5 | 5 | 5 | 6 | | ŀ | 5 | 6 | 6 | 7 | 7 | | • | 5 | 9 | 8 | 7 | 7 | | 3 | 8 | 9 | 9 | 9 | 9 | | 3 | 8 | 9 | 9 | 9 | 9 | Sort each row in parallel, alternately $\leftarrow$ , $\rightarrow$ : 39 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 9 | 8 | 7 | 7 | 6 | 5 | 5 | 5 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 8 | Sort eac in parall | | 0 | 0 | | |---|---|---|--| | | 3 | 2 | | | • | 3 | 3 | | | • | 4 | 4 | | | | 6 | 5 | | | | 7 | 8 | | | | 9 | 8 | | | | | | | Sort each row in parallel, alternately $\leftarrow$ , $\rightarrow$ : | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 9 | 8 | 7 | 7 | 6 | 5 | 5 | 5 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 8 | | 0 | 0 | 0 | 1 | 1 | 1 | |---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | | 9 | 8 | 8 | 8 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | Sort each row in parallel, alternately $\leftarrow$ , $\rightarrow$ : | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | |---|---|--------|--------|---|-------------------------------|--------|--------| | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | 3 | 3 | | | | | | | | | | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 9 | 8 | 4<br>7 | 5<br>7 | 6 | <ul><li>6</li><li>5</li></ul> | 7<br>5 | 7<br>5 | | | | | | | | | | | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | 5 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | 7 | 7 | | 9 | 8 | 8 | 8 | 9 | 9 | 8 | 8 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Sort each row in parallel, alternately $\leftarrow$ , $\rightarrow$ : | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 4 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 6 | 6 | 7 | 7 | | 9 | 8 | 7 | 7 | 6 | 5 | 5 | 5 | | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 8 | | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | 5 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | 7 | 7 | | 9 | 8 | 8 | 8 | 9 | 9 | 8 | 8 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | | ) | 1 | 1 | 1 | 2 | 2 | |---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 1 | 1 | | 3 | 3 | 3 | 4 | 4 | 4 | | ) | 5 | 4 | 3 | 3 | 3 | | ŀ | 5 | 6 | 6 | 7 | 7 | | 7 | 7 | 6 | 5 | 5 | 5 | | 3 | 8 | 9 | 9 | 9 | 9 | | ) | 9 | 9 | 9 | 8 | 8 | Sort each column in parallel: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | 5 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | 7 | 7 | | 9 | 8 | 8 | 8 | 9 | 9 | 8 | 8 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Sort each $\leftarrow$ or $\rightarrow$ | 0 | 0 | | |---|---|---| | 2 | 2 | 2 | | 3 | 3 | | | 4 | 4 | | | 5 | 5 | 5 | | 6 | 6 | 7 | | 8 | 8 | 3 | | | | | arallel, ### Sort each column in parallel: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | 5 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | 7 | 7 | | 9 | 8 | 8 | 8 | 9 | 9 | 8 | 8 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Sort each row in positive $\leftarrow$ or $\rightarrow$ as desired | 0 | 0 | 0 | 1 | 1 | 1 | |---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | | 5 | 5 | 5 | 5 | 5 | 5 | | 6 | 6 | 7 | 7 | 7 | 7 | | 8 | 8 | 8 | 8 | 8 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 41 # Sort each column in parallel: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | 5 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | 7 | 7 | | 9 | 8 | 8 | 8 | 9 | 9 | 8 | 8 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | # Sort each column in parallel: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | | 6 | 5 | 5 | 5 | 6 | 5 | 5 | 5 | | 7 | 8 | 7 | 7 | 6 | 6 | 7 | 7 | | 9 | 8 | 8 | 8 | 9 | 9 | 8 | 8 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | h column el: | ) | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | | ŀ | 5 | 4 | 4 | 4 | 4 | | • | 5 | 6 | 5 | 5 | 5 | | 7 | 7 | 6 | 6 | 7 | 7 | | 3 | 8 | 9 | 9 | 8 | 8 | | ) | 9 | 9 | 9 | 9 | 9 | Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Chips ar towards parallelis GPUs: p Old Xeo New Xeo Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Chips are in fact entowards having the parallelism and co GPUs: parallel + Old Xeon Phi: pai New Xeon Phi: pa Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Chips are in fact evolving towards having this much parallelism and communicat GPUs: parallel + global RA Old Xeon Phi: parallel + rin New Xeon Phi: parallel + m Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. 43 Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 42 Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. Algorithm designers don't even get the right exponent without taking this into account. Sort each row in parallel, $\leftarrow$ or $\rightarrow$ as desired: | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |---|---|---|---|---|---|---|---| | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | | 8 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | Chips are in fact evolving towards having this much parallelism and communication. GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. Algorithm designers don't even get the right exponent without taking this into account. Shock waves from subroutines into high-level algorithm design.