Understanding branch prediction efficiencyThe inner workings of Spectre (v2)Performance of “conditional call” on amd64How to efficiently count the number of keys/properties of an object in JavaScript?Efficiency of Java “Double Brace Initialization”?Efficiency of purely functional programmingIntel x86 0x2E/0x3E Prefix Branch Prediction actually used?d is less efficient than [0-9]Can branch prediction cause illegal instruction?Branch target prediction in conjunction with branch prediction?Intel CPUs Instruction Queue provides static branch prediction?Branch prediction and performanceBranch prediction & speculative fetch mitigation

Output a Super Mario Image

I was promised a work PC but still awaiting approval 3 months later so using my own laptop - Is it fair to ask employer for laptop insurance?

I asked for a graduate student position from a professor. He replied "welcome". What does that mean?

How are aircraft depainted?

Does an oscilloscope subtract voltages as phasors?

Telling my mother that I have anorexia without panicking her

2000s space film where an alien species has almost wiped out the human race in a war

Confirm the ending of a string

Is low emotional intelligence associated with right-wing and prejudiced attitudes?

Is a suit against a University Dorm for changing policies on a whim likely to succeed (USA)?

Is there a real-world mythological counterpart to WoW's "kill your gods for power" theme?

Are space camera sensors usually round, or square?

Is there a reliable way to hide/convey a message in vocal expressions (speech, song,...)

What is the CR of a Metallic Dragon that used Change Shape?

Why is the Digital 0 not 0V in computer systems?

What hard drive connector is this?

Finding the number of digits of a given integer.

Write a function that returns an iterable object of all valid points 4-directionally adjacent to (x, y)

If I want an interpretable model, are there methods other than Linear Regression?

Can a warforged druid use composite plating?

Some Prime Peerage

Is plasma membrane permeable to sucrose

Difference in using Lightning Component <lighting:badge/> and Normal DOM with slds <span class="slds-badge"></span>? Which is Better and Why?

Should you only use colons and periods in dialogues?

Understanding branch prediction efficiency

The inner workings of Spectre (v2)Performance of “conditional call” on amd64How to efficiently count the number of keys/properties of an object in JavaScript?Efficiency of Java “Double Brace Initialization”?Efficiency of purely functional programmingIntel x86 0x2E/0x3E Prefix Branch Prediction actually used?d is less efficient than [0-9]Can branch prediction cause illegal instruction?Branch target prediction in conjunction with branch prediction?Intel CPUs Instruction Queue provides static branch prediction?Branch prediction and performanceBranch prediction & speculative fetch mitigation

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:

N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)

Here's the code:

#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2]) 
 uint64_t s1 = s[0];
 uint64_t s0 = s[1];
 uint64_t result = s0 + s1;
 s[0] = s0;
 s1 ^= s1 << 23;
 s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
 return result;


int main() 
 uint64_t s[2];
 s[0] = init[0];
 s[1] = init[1];

 uint64_t sum = 0;

#if 1
 const int N = 16;

 unsigned char buffer[1<<N];
 for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

 for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) 
 for (int j=0; j<1<<N; j++) 
 if (buffer[j]) 
 sum += one;
 
 
 
#else
 for (uint64_t i=0; i<uint64_t(1)<<32; i++) 
 if (next(s)&1) 
 sum += one;
 
 

#endif
 std::cout<<sum<<"n";

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
 401274: 74 07 je 40127d <main+0xad>
 401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
 40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
 401282: 74 07 je 40128b <main+0xbb>
 401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
 40128b: 48 83 c1 02 add rcx,0x2
 40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
 401296: 75 d8 jne 401270 <main+0xa0>

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35

I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40

1

In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56

1

@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06

1

@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48

|
show 3 more comments

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

Here's the code:

#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2]) 
 uint64_t s1 = s[0];
 uint64_t s0 = s[1];
 uint64_t result = s0 + s1;
 s[0] = s0;
 s1 ^= s1 << 23;
 s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
 return result;


int main() 
 uint64_t s[2];
 s[0] = init[0];
 s[1] = init[1];

 uint64_t sum = 0;

#if 1
 const int N = 16;

 unsigned char buffer[1<<N];
 for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

 for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) 
 for (int j=0; j<1<<N; j++) 
 if (buffer[j]) 
 sum += one;
 
 
 
#else
 for (uint64_t i=0; i<uint64_t(1)<<32; i++) 
 if (next(s)&1) 
 sum += one;
 
 

#endif
 std::cout<<sum<<"n";

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
 401274: 74 07 je 40127d <main+0xad>
 401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
 40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
 401282: 74 07 je 40128b <main+0xbb>
 401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
 40128b: 48 83 c1 02 add rcx,0x2
 40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
 401296: 75 d8 jne 401270 <main+0xa0>

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35

I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40

1

In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56

1

@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06

1

@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48

|
show 3 more comments

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

Here's the code:

#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2]) 
 uint64_t s1 = s[0];
 uint64_t s0 = s[1];
 uint64_t result = s0 + s1;
 s[0] = s0;
 s1 ^= s1 << 23;
 s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
 return result;


int main() 
 uint64_t s[2];
 s[0] = init[0];
 s[1] = init[1];

 uint64_t sum = 0;

#if 1
 const int N = 16;

 unsigned char buffer[1<<N];
 for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

 for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) 
 for (int j=0; j<1<<N; j++) 
 if (buffer[j]) 
 sum += one;
 
 
 
#else
 for (uint64_t i=0; i<uint64_t(1)<<32; i++) 
 if (next(s)&1) 
 sum += one;
 
 

#endif
 std::cout<<sum<<"n";

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
 401274: 74 07 je 40127d <main+0xad>
 401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
 40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
 401282: 74 07 je 40128b <main+0xbb>
 401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
 40128b: 48 83 c1 02 add rcx,0x2
 40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
 401296: 75 d8 jne 401270 <main+0xa0>

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

Here's the code:

#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = 314159165, 27182818 ;
// volatile uint64_t init[2] = 0, 0 ;
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2]) 
 uint64_t s1 = s[0];
 uint64_t s0 = s[1];
 uint64_t result = s0 + s1;
 s[0] = s0;
 s1 ^= s1 << 23;
 s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
 return result;


int main() 
 uint64_t s[2];
 s[0] = init[0];
 s[1] = init[1];

 uint64_t sum = 0;

#if 1
 const int N = 16;

 unsigned char buffer[1<<N];
 for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

 for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) 
 for (int j=0; j<1<<N; j++) 
 if (buffer[j]) 
 sum += one;
 
 
 
#else
 for (uint64_t i=0; i<uint64_t(1)<<32; i++) 
 if (next(s)&1) 
 sum += one;
 
 

#endif
 std::cout<<sum<<"n";

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
 401274: 74 07 je 40127d <main+0xad>
 401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
 40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
 401282: 74 07 je 40128b <main+0xbb>
 401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
 40128b: 48 83 c1 02 add rcx,0x2
 40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
 401296: 75 d8 jne 401270 <main+0xa0>

performance x86 x86-64 cpu-architecture branch-prediction

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

edited Mar 29 at 5:18

Peter Cordes

157k23 gold badges252 silver badges403 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

asked Mar 28 at 10:30

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35

I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40

1

In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56

1

@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06

1

@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48

|
show 3 more comments

1

Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35

I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40

1

In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56

1

@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06

1

@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48

Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history.

– Hadi Brais
Mar 28 at 12:35

I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16.

– Hadi Brais
Mar 28 at 12:40

In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation.

– Brendan
Mar 31 at 13:56

@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more.

– geza
Mar 31 at 14:06

@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not.

– geza
Apr 1 at 9:48

|
show 3 more comments

1 Answer
1

active

oldest

votes

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches

As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.

The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

edited Mar 31 at 11:51

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49

Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05

...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05

@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22

|
show 10 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55395350%2funderstanding-branch-prediction-efficiency%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

edited Mar 31 at 11:51

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49

Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05

...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05

@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22

|
show 10 more comments

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

edited Mar 31 at 11:51

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49

Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05

...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05

@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22

|
show 10 more comments

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

edited Mar 31 at 11:51

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9

Note: branch-misses (%) is calculated for 2^32 branches

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

edited Mar 31 at 11:51

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

edited Mar 31 at 11:51

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

answered Mar 31 at 10:43

geza

16.5k3 gold badges40 silver badges96 bronze badges

1

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49

Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05

...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05

@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22

|
show 10 more comments

1

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49

Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05

...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05

@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.

– Hadi Brais
Mar 31 at 11:38

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.

– geza
Mar 31 at 11:49

Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...

– Hadi Brais
Mar 31 at 13:05

...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).

– Hadi Brais
Mar 31 at 13:05

@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).

– geza
Mar 31 at 13:22

|
show 10 more comments

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1