Understanding branch prediction efficiencyThe inner workings of Spectre (v2)Performance of “conditional call” on amd64How to efficiently count the number of keys/properties of an object in JavaScript?Efficiency of Java “Double Brace Initialization”?Efficiency of purely functional programmingIntel x86 0x2E/0x3E Prefix Branch Prediction actually used?d is less efficient than [0-9]Can branch prediction cause illegal instruction?Branch target prediction in conjunction with branch prediction?Intel CPUs Instruction Queue provides static branch prediction?Branch prediction and performanceBranch prediction & speculative fetch mitigation

Output a Super Mario Image

I was promised a work PC but still awaiting approval 3 months later so using my own laptop - Is it fair to ask employer for laptop insurance?

I asked for a graduate student position from a professor. He replied "welcome". What does that mean?

How are aircraft depainted?

Does an oscilloscope subtract voltages as phasors?

Telling my mother that I have anorexia without panicking her

2000s space film where an alien species has almost wiped out the human race in a war

Confirm the ending of a string

Is low emotional intelligence associated with right-wing and prejudiced attitudes?

Is a suit against a University Dorm for changing policies on a whim likely to succeed (USA)?

Is there a real-world mythological counterpart to WoW's "kill your gods for power" theme?

Are space camera sensors usually round, or square?

Is there a reliable way to hide/convey a message in vocal expressions (speech, song,...)

What is the CR of a Metallic Dragon that used Change Shape?

Why is the Digital 0 not 0V in computer systems?

What hard drive connector is this?

Finding the number of digits of a given integer.

Write a function that returns an iterable object of all valid points 4-directionally adjacent to (x, y)

If I want an interpretable model, are there methods other than Linear Regression?

Can a warforged druid use composite plating?

Some Prime Peerage

Is plasma membrane permeable to sucrose

Difference in using Lightning Component <lighting:badge/> and Normal DOM with slds <span class="slds-badge"></span>? Which is Better and Why?

Should you only use colons and periods in dialogues?



Understanding branch prediction efficiency


The inner workings of Spectre (v2)Performance of “conditional call” on amd64How to efficiently count the number of keys/properties of an object in JavaScript?Efficiency of Java “Double Brace Initialization”?Efficiency of purely functional programmingIntel x86 0x2E/0x3E Prefix Branch Prediction actually used?d is less efficient than [0-9]Can branch prediction cause illegal instruction?Branch target prediction in conjunction with branch prediction?Intel CPUs Instruction Queue provides static branch prediction?Branch prediction and performanceBranch prediction & speculative fetch mitigation






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















I tried to measure branch prediction cost, I created a little program.



It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.



Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:



N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9


For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.



My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?



(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)



Here's the code:



#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2])
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;


int main()
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];

uint64_t sum = 0;

#if 1
const int N = 16;

unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++)
for (int j=0; j<1<<N; j++)
if (buffer[j])
sum += one;



#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++)
if (next(s)&1)
sum += one;



#endif
std::cout<<sum<<"n";



(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)



Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):



 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>









share|improve this question





















  • 1





    Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

    – Hadi Brais
    Mar 28 at 12:35











  • I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

    – Hadi Brais
    Mar 28 at 12:40







  • 1





    In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

    – Brendan
    Mar 31 at 13:56






  • 1





    @Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

    – geza
    Mar 31 at 14:06







  • 1





    @Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

    – geza
    Apr 1 at 9:48


















1















I tried to measure branch prediction cost, I created a little program.



It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.



Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:



N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9


For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.



My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?



(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)



Here's the code:



#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2])
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;


int main()
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];

uint64_t sum = 0;

#if 1
const int N = 16;

unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++)
for (int j=0; j<1<<N; j++)
if (buffer[j])
sum += one;



#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++)
if (next(s)&1)
sum += one;



#endif
std::cout<<sum<<"n";



(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)



Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):



 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>









share|improve this question





















  • 1





    Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

    – Hadi Brais
    Mar 28 at 12:35











  • I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

    – Hadi Brais
    Mar 28 at 12:40







  • 1





    In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

    – Brendan
    Mar 31 at 13:56






  • 1





    @Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

    – geza
    Mar 31 at 14:06







  • 1





    @Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

    – geza
    Apr 1 at 9:48














1












1








1








I tried to measure branch prediction cost, I created a little program.



It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.



Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:



N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9


For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.



My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?



(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)



Here's the code:



#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2])
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;


int main()
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];

uint64_t sum = 0;

#if 1
const int N = 16;

unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++)
for (int j=0; j<1<<N; j++)
if (buffer[j])
sum += one;



#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++)
if (next(s)&1)
sum += one;



#endif
std::cout<<sum<<"n";



(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)



Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):



 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>









share|improve this question
















I tried to measure branch prediction cost, I created a little program.



It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.



Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:



N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9


For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.



My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?



(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)



Here's the code:



#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2])
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;


int main()
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];

uint64_t sum = 0;

#if 1
const int N = 16;

unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++)
for (int j=0; j<1<<N; j++)
if (buffer[j])
sum += one;



#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++)
if (next(s)&1)
sum += one;



#endif
std::cout<<sum<<"n";



(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)



Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):



 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>






performance x86 x86-64 cpu-architecture branch-prediction






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 29 at 5:18









Peter Cordes

157k23 gold badges252 silver badges403 bronze badges




157k23 gold badges252 silver badges403 bronze badges










asked Mar 28 at 10:30









gezageza

16.5k3 gold badges40 silver badges96 bronze badges




16.5k3 gold badges40 silver badges96 bronze badges










  • 1





    Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

    – Hadi Brais
    Mar 28 at 12:35











  • I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

    – Hadi Brais
    Mar 28 at 12:40







  • 1





    In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

    – Brendan
    Mar 31 at 13:56






  • 1





    @Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

    – geza
    Mar 31 at 14:06







  • 1





    @Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

    – geza
    Apr 1 at 9:48













  • 1





    Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

    – Hadi Brais
    Mar 28 at 12:35











  • I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

    – Hadi Brais
    Mar 28 at 12:40







  • 1





    In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

    – Brendan
    Mar 31 at 13:56






  • 1





    @Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

    – geza
    Mar 31 at 14:06







  • 1





    @Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

    – geza
    Apr 1 at 9:48








1




1





Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35





Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35













I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40






I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40





1




1





In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56





In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56




1




1





@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06






@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06





1




1





@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48






@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48













1 Answer
1






active

oldest

votes


















1
















Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:



N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches


As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.



The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.



The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):



(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.





share|improve this answer






















  • 1





    Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

    – Hadi Brais
    Mar 31 at 11:38











  • @HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

    – geza
    Mar 31 at 11:49












  • Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

    – Hadi Brais
    Mar 31 at 13:05











  • ...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

    – Hadi Brais
    Mar 31 at 13:05











  • @HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

    – geza
    Mar 31 at 13:22










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55395350%2funderstanding-branch-prediction-efficiency%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1
















Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:



N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches


As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.



The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.



The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):



(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.





share|improve this answer






















  • 1





    Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

    – Hadi Brais
    Mar 31 at 11:38











  • @HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

    – geza
    Mar 31 at 11:49












  • Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

    – Hadi Brais
    Mar 31 at 13:05











  • ...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

    – Hadi Brais
    Mar 31 at 13:05











  • @HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

    – geza
    Mar 31 at 13:22















1
















Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:



N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches


As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.



The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.



The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):



(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.





share|improve this answer






















  • 1





    Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

    – Hadi Brais
    Mar 31 at 11:38











  • @HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

    – geza
    Mar 31 at 11:49












  • Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

    – Hadi Brais
    Mar 31 at 13:05











  • ...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

    – Hadi Brais
    Mar 31 at 13:05











  • @HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

    – geza
    Mar 31 at 13:22













1














1










1









Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:



N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches


As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.



The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.



The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):



(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.





share|improve this answer















Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:



N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches


As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.



The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.



The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):



(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 31 at 11:51

























answered Mar 31 at 10:43









gezageza

16.5k3 gold badges40 silver badges96 bronze badges




16.5k3 gold badges40 silver badges96 bronze badges










  • 1





    Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

    – Hadi Brais
    Mar 31 at 11:38











  • @HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

    – geza
    Mar 31 at 11:49












  • Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

    – Hadi Brais
    Mar 31 at 13:05











  • ...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

    – Hadi Brais
    Mar 31 at 13:05











  • @HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

    – geza
    Mar 31 at 13:22












  • 1





    Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

    – Hadi Brais
    Mar 31 at 11:38











  • @HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

    – geza
    Mar 31 at 11:49












  • Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

    – Hadi Brais
    Mar 31 at 13:05











  • ...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

    – Hadi Brais
    Mar 31 at 13:05











  • @HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

    – geza
    Mar 31 at 13:22







1




1





Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38





Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38













@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49






@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49














Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05





Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05













...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05





...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05













@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22





@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.




















draft saved

draft discarded















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55395350%2funderstanding-branch-prediction-efficiency%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript