Reducing The Costs Of Comparisons and Branches

Combining Branches
Let's pretend for the moment that well predicted branches have zero cost and that all the code we need is ready in the instruction cache... If this were true, would we still want to eliminate branches? Yes.Not only are there the general overhead and scheduling costs to branches, but we cannot depend on the compiler to emit optimal-enough code when dealing with conditional expressions. Let's take a look at a simple example:
uint32_t test_1( uint32_t arg0, uint32_t arg1 ) { if ( ( ( arg0 & 0x00040000 ) != 0 ) || ( ( arg1 & 0x00080000 ) != 0 ) ) { return (arg0); } return (arg1); }
Pretty straightforward. We're testing a couple of bits in the arguments to decide which argument we want to return. I want to point out that this example is particularly easy to handle because the C short-circuiting rules do not apply here. The compiler is free to do both tests on either side of the OR because (it can be trivially found) that there are no side-effects. Let's look at the compiler output (I've added some comments):
test_1: andis. 9,3,4 # tmp9 = arg0 & ( 0x0004 << 16 ) # CR[0] = tmp9 <=> 0 rlwinm 9,4,13,31,31 # tmp9 = tmp0 mr 0,3 # tmp0 = arg0 stwu 1,-16(1) # sp[-16] = sp # sp = sp-16 mr 3,4 # result = arg1 cmpwi 7,9,0 # CR[7] = tmp0 <=> 0 # TEST_ARG0: bne- 0,.L3 # if_unlikely ( CR[0] != 0 ) goto COPY_ARG0 # i.e. ( arg0 & 0x00040000 ) != 0 ) # TEST_ARG1: beq- 7,.L1 # if_unlikely ( CR[7] == 0 ) goto DONE # i.e. ( arg1 & 0x00080000 ) == 0 ) .L3: # COPY_ARG0: mr 3,0 # result = tmp0 .L1: # DONE: addi 1,1,16 # sp = sp + 16 blr # RETURN (result)
Aside: Note that the stack pointer (sp) is being pushed and popped although the stack isn't being used in this function. Presumably, a function this small would be inlined so that would not occur. BTW, the -fomit-frame-pointer flag is ignored, if you're thinking that would help... I'm going to remove the stack push/pop code from any compiler output from this point on, unless the stack is actually being used. We'll assume the functions are inlined and that if they aren't you are aware of this cost.
This version of GCC always saves the stack pointer to the stack, whether or not you are using it. Inline small functions.
First notice the static branch prediction: (Syntax note: "-" added to a branch instruction indicates it is unlikely. Conversely, "+" indicates a likely branch.) It's unlikely that we will return arg0 because of the first test, but it is also unlikely that we will return arg1 due to the second test. Which means that it is unlikely this function will be fast. i.e.
if ( ( arg0 & 0x00040000 ) != 0 ) TEST_ARG0 is UNLIKELY, return(arg0) else if ( ( arg1 & 0x00080000 ) == 0 ) TEST_ARG1 is UNLIKELY, return(arg1) return (arg0)
The compiler has decided that the most likely scenario is that the first test fails and the second test succeeds. Was that what you expected? The and with arg0 is pretty interesting too:
andis. 9,3,4 # tmp9 = arg0 & ( 0x0004 << 16 ) # CR[0] = tmp9 <=> 0
"andis." means and with 16 bit immediate and shift, then update CR[0] -- The dot at the end of an instruction means "then update CR[0]". It's that dot that we should be concerned about. The CR is a very limited resource (32 bits = 8 x 4 bits) shared by all the execution units.
When multiple forms were found (most cases), sequences that did not set the Carry bit, had more parallelism, had fewer register operands, or generalized to 64-bit implementations were favored. A clever compiler might want to consider multiple sequences to minimize resource conflicts.
Prefer instructions that do not affect the Condition Register (CR).
Note the register moves:
mr 0,3 ... mr 3,4 ... mr 3,0
A lot of moves ("mr" means move register) among registers is usually a good indication of some kind of optimization barrior. Seeing three moves in such a short instruction stream is a good hint that something needs fixing.
Look for too many register moves in your code. By themselves they aren't that expensive, but they are often a good litmus test for slow code.
What we want: the whole test should be unlikely Following the logic that branches should be assumed not taken, we want to make it as simple as possible for the compiler to understand what we want. We can do this fairly easily by combining our comparison operations and only testing a single value. Here's the change:
uint32_t test_2( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; if ( test_comb != 0 ) { return (arg0); } return (arg1); }
There is absolutely no possible way for the compiler to generate more than one branch in this scenario. We are only testing one value. The rest is simple bit operations. Can't be anything else. Really. Are you convinced yet? :) Take a look at the output:
test_3: rlwinm 9,3,0,13,13 # arg0_test = arg0 & 0x00040000 rlwinm 0,4,0,12,12 # arg1_test = arg1 & 0x00080000 or. 11,9,0 # test_comb = arg0_test | arg1_test # CR[0] = test_comb <=> 0 # TEST_TEST_COMB: bne- 0,.L7 # if_unlikely ( CR[0] != 0 ) goto DONE # i.e. ( test_comb != 0 ) mr 3,4 # result = arg1 .L7: # DONE: blr # RETURN (result)
Pretty much what you expected right?
bne- 0,.L7 # if_unlikely ( CR[0] != 0 ) goto DONE
The branch not-taken is unlikely. Everything is looking up. But did you see the "or." coming? That's right, the compiler's trying to be tricky and use the result of the or to determine the test value instead of doing a separate test. Again, prefer instruction sequences that do not modify or read the Condition Register.
Combine your comparisons into a single value before you branch. This makes the compiler's job easier and your code will be faster.
How can we reduce the dependencies on the Condition Register? Let me give you a hint. Here's the answer.
uint32_t test_3( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31); uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) ); return (result); }
And the final output:
test_4: rlwinm 8,3,0,13,13 rlwinm 10,4,0,12,12 or 7,8,10 neg 6,7 srawi 0,6,31 and 9,3,0 andc 3,4,0 or 3,9,3 blr