Is branch prediction important for performance? #### Daniel J. Bernstein Spectre paper: "Modern processors use branch prediction and speculative execution to maximize performance." Wikipedia: "Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86." The article cited by Wikipedia says: "Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path." Is branch prediction important for performance? #### Daniel J. Bernstein Spectre paper: "Modern processors use branch prediction and speculative execution to maximize performance." Wikipedia: "Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86." The article cited by Wikipedia says: "Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path." — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. Is branch prediction important for performance? #### Daniel J. Bernstein Spectre paper: "Modern processors use branch prediction and speculative execution to maximize performance." Wikipedia: "Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86." The article cited by Wikipedia says: "Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path." — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. n prediction nt for performance? . Bernstein paper: "Modern rs use branch prediction culative execution to e performance." ia: "Branch predictors ritical role in achieving ective performance modern pipelined cessor architectures x86." The article cited by Wikipedia says: "Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path." — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. The CP Cycle 1: fetch decode register execute register ormance? Modern nch prediction ecution to ance." ch predictors in achieving ormance ipelined chitectures The article cited by Wikipedia says: "Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path." Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline # Cycle 1: fetch decode register read execute register write 1 tion ors ng The article cited by Wikipedia says: "Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path." Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 1: | fetch | | |----------------|--| | decode | | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 1: | fetch | a=b+c | |----------------|-------| | decode | | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 2: | fetch | | |----------------|-------| | decode | a=b+c | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 3: | fetch | | |----------------|-------| | decode | | | register read | a=b+c | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline #### Cycle 4: | fetch | | |----------------|-------| | decode | | | register read | | | execute | a=b+c | | register write | | Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline ### Cycle 5: | fetch | | |----------------|-------| | decode | | | register read | | | execute | | | register write | a=b+c | 1 instruction finishes in 5 cycles. Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline Another program, cycle 1: | fetch | a=b+c | |----------------|-------| | decode | | | register read | | | execute | | | register write | | Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 2: | fetch | d=e+f | |----------------|-------| | decode | a=b+c | | register read | | | execute | | | register write | | Second instruction is fetched; first instruction is decoded. Hardware units operate in parallel. Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 3: | fetch | g=h-i | |----------------|-------| | decode | d=e+f | | register read | a=b+c | | execute | | | register write | | Third instruction is fetched; second instruction is decoded; first instruction does register read. — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 4: | fetch | j=k+1 | |----------------|-------| | decode | g=h-i | | register read | d=e+f | | execute | a=b+c | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline ### Cycle 5: | fetch | m=n-o | |----------------|-------| | decode | j=k+1 | | register read | g=h-i | | execute | d=e+f | | register write | a=b+c | Program continues this way. Throughput: 1 instruction/cycle. Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline Another program, cycle 1: | fetch | a=b+c | |----------------|-------| | decode | | | register read | | | execute | | | register write | | Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline #### Cycle 2: | fetch | d=a-e | |----------------|-------| | decode | a=b+c | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 3: | fetch | • • • | |----------------|-------| | decode | d=a-e | | register read | a=b+c | | execute | | | register write | | Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline #### Cycle 4: | fetch | • • • | |----------------|-------| | decode | • • • | | register read | d=a-e | | execute | a=b+c | | register write | | Register-read unit is idle, waiting for a to be ready. — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 5: | fetch | • • • | |----------------|-------| | decode | • • • | | register read | d=a-e | | execute | | | register write | a=b+c | Execute unit is idle. Typical CPUs design pipelines to eliminate this slowdown: fast-forward a to next operation. Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline Another program, cycle 1: | fetch | a=b+c | |----------------|-------| | decode | | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 2: | fetch | d=e+f | |----------------|-------| | decode | a=b+c | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 3: | fetch | g=h-i | |----------------|-------| | decode | d=e+f | | register read | a=b+c | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline #### Cycle 4: | fetch | if(g<0) | |----------------|---------| | decode | g=h-i | | register read | d=e+f | | execute | a=b+c | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 5: | fetch | | |----------------|---------| | decode | if(g<0) | | register read | g=h-i | | execute | d=e+f | | register write | a=b+c | Without branch prediction, fetch unit doesn't know which instruction to fetch now! Waiting for if to write "instruction pointer" register. — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 6: | fetch | | |----------------|---------| | decode | | | register read | if(g<0) | | execute | g=h-i | | register write | d=e+f | Fetch is still waiting. Typical CPUs: longer pipelines; longer delays than this picture. (Assume no hyperthreading.) Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. # The CPU pipeline Cycle 5, speculative execution: | fetch | g=-g | |----------------|---------| | decode | if(g<0) | | register read | g=h-i | | execute | d=e+f | | register write | a=b+c | Branch predictor guesses which instruction to fetch. More work to undo everything if guess turns out to be wrong, but usually guess is correct. Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline Better program, cycle 1: | fetch | <0? g=h-i | |----------------|-----------| | decode | | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 2: | fetch | a=b+c | |----------------|-----------| | decode | <0? g=h-i | | register read | | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 3: | fetch | d=e+f | |----------------|-----------| | decode | a=b+c | | register read | <0? g=h-i | | execute | | | register write | | — Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 4: | fetch | j=k+1 | |----------------|-----------| | decode | d=e+f | | register read | a=b+c | | execute | <0? g=h-i | | register write | | Omitting branch prediction reduces energy even more. Eliminates all wrong-path instructions. Also eliminates cost of prediction+speculation. The real question is **latency**. ### The CPU pipeline ### Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit if programs compute branch conditions *P* cycles in advance, where *P* is pipeline length. cle cited by Wikipedia Branch predictor (BP) is tial component in modern rs since high BP accuracy rove performance and nergy by decreasing ber of instructions d on wrong-path." ting branch prediction energy even more. es all wrong-path ons. Also eliminates prediction+speculation. question is **latency**. # The CPU pipeline # Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit if programs compute branch conditions P cycles in advance, where *P* is pipeline length. CPUs to applying to large Massivel Why sho branch o rmance and decreasing tructions g-path." ch prediction en more. ng-path eliminates ⊢speculation. is **latency**. # The CPU pipeline # Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit **if programs compute branch conditions** *P* **cycles in advance**, where *P* is pipeline length. CPUs today spend applying simple control large volumes of Massively parallelia Why shouldn't probranch conditions • lia P) is nodern curacy on on. # The CPU pipeline # Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit **if programs compute branch conditions** *P* **cycles in advance**, where *P* is pipeline length. CPUs today spend almost a applying simple computation to large volumes of data. Massively parallelizable. Why shouldn't programs conbranch conditions in advance ### 3 ### The CPU pipeline ### Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit **if programs compute branch conditions** *P* **cycles in advance**, where *P* is pipeline length. CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? # The CPU pipeline # Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit if programs compute branch conditions P cycles in advance, where P is pipeline length. CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. 3 Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. # The CPU pipeline # Cycle 5: | fetch | if(?) | |----------------|-----------| | decode | j=k+1 | | register read | d=e+f | | execute | a=b+c | | register write | <0? g=h-i | Fast-forward flag to fetch unit. Branch prediction has zero benefit **if programs compute branch conditions** *P* **cycles in advance**, where *P* is pipeline length. CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. Insn-set extensions for more cases: "branch-relevant" priority bit; multiple flags; loop counter. (Count down early in pipeline.) Inner loops I've studied don't need more complicated patterns. | | if(?) | |-------|-----------| | | j=k+1 | | read | d=e+f | | | a=b+c | | write | <0? g=h-i | ward flag to fetch unit. prediction has zero benefit ams compute branch ons P cycles in advance, is pipeline length. CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. Insn-set extensions for more cases: "branch-relevant" priority bit; multiple flags; loop counter. (Count down early in pipeline.) Inner loops I've studied don't need more complicated patterns. How did itself tha importai | | if(?) | |-----|-------| | | j=k+1 | | | d=e+f | | | a=b+c | | <0? | g=h-i | has zero benefit bute branch les in advance, e length. CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. Insn-set extensions for more cases: "branch-relevant" priority bit; multiple flags; loop counter. (Count down early in pipeline.) Inner loops I've studied don't need more complicated patterns. How did the commitself that branch important for perf if(?) j=k+l d=e+f a=b+c g=h-i nit. penefit **ch** ance, CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. Insn-set extensions for more cases: "branch-relevant" priority bit; multiple flags; loop counter. (Count down early in pipeline.) Inner loops I've studied don't need more complicated patterns. How did the community coritself that branch prediction important for performance? CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. Insn-set extensions for more cases: "branch-relevant" priority bit; multiple flags; loop counter. (Count down early in pipeline.) Inner loops I've studied don't need more complicated patterns. How did the community convince itself that branch prediction is important for performance? CPUs today spend almost all time applying simple computations to large volumes of data. Massively parallelizable. Why shouldn't programs compute branch conditions in advance? Most cases are handled by simple instruction scheduling. Insn-set extensions for more cases: "branch-relevant" priority bit; multiple flags; loop counter. (Count down early in pipeline.) Inner loops I've studied don't need more complicated patterns. How did the community convince itself that branch prediction is important for performance? 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") → 2000s/2010s beliefs. day spend almost all time simple computations volumes of data. y parallelizable. ouldn't programs compute conditions in advance? ses are handled by nstruction scheduling. extensions for more cases: relevant" priority bit; flags; loop counter. down early in pipeline.) ops I've studied don't re complicated patterns. How did the community convince itself that branch prediction is important for performance? 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") → 2000s/2010s beliefs. The fund Can a w with wel remove ' for brane l almost all time mputations of data. in advance? ndled by scheduling. for more cases: priority bit; p counter. in pipeline.) udied don't cated patterns. How did the community convince itself that branch prediction is important for performance? 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") → 2000s/2010s beliefs. The fundamental of Can a well-designed with well-designed remove the speed for branch predicti ll time npute e? **5**. cases: t; ie.) erns. How did the community convince itself that branch prediction is important for performance? 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") $\rightarrow$ 2000s/2010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? 1980s insn sets, CPU costs ightarrow1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") $\rightarrow$ 2000s/2010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4–6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") $\rightarrow$ 2000s/2010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") → 2000s/2010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") → 2000s/2010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." 5 How did the community convince itself that branch prediction is important for performance? 1980s insn sets, CPU costs $\rightarrow$ 1990s compilers, applications, data volumes, compiled code $\rightarrow$ 1990s/2000s hype (e.g., "Since programs typically encounter branches every 4-6 instructions, inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs") → 2000s/2010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." — No. Obsolete view of performance. Need well-designed software for good speed already today. the community convince at branch prediction is not for performance? sn sets, CPU costs $\rightarrow$ empilers, applications, umes, compiled code $\rightarrow$ 000s hype (e.g., "Since s typically encounter s every 4–6 instructions, te branch prediction severe performance tion in highly superscalar y pipelined designs") → 010s beliefs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." — No. Obsolete view of performance. Need well-designed software for good speed already today. "Fundar compute for these Look at, Inspect nunity convince prediction is ormance? $PU costs \rightarrow$ ipplications, npiled code $\rightarrow$ (e.g., "Since encounter 5 instructions, prediction rformance hly superscalar d designs") ightarrow fs. The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." — No. Obsolete view of performance. Need well-designed software for good speed already today. "Fundamentally, y compute branches for these important Look at, e.g., int Inspect data, bran 5 calar The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." — No. Obsolete view of performance. Need well-designed software for good speed already today. "Fundamentally, you cannot compute branches in advance for these important computations Look at, e.g., int32[n] head inspect data, branch, repeat The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." — No. Obsolete view of performance. Need well-designed software for good speed already today. "Fundamentally, you cannot compute branches in advance for these important computations. Look at, e.g., int32[n] heapsort. Inspect data, branch, repeat." The fundamental question: Can a well-designed insn set with well-designed software remove the speed incentive for branch prediction? "We need to look at current insn sets." — Yes, interesting short-term question. Not my question in this talk. "We need to look at badly written software." — No. Obsolete view of performance. Need well-designed software for good speed already today. "Fundamentally, you cannot compute branches in advance for these important computations. Look at, e.g., int32[n] heapsort. Inspect data, branch, repeat." — The current speed records for int32[n] sorting on Intel CPUs are held by sorting networks! Data-independent branches defined purely by n. Performance, parallelizability, predictability have clear connections. sorting.cr.yp.to: software + verification tools.