Combining Branches
Let's pretend for the moment that well predicted branches have zero
cost and that all the code we need is ready in the instruction cache...
If this were true, would we still want to eliminate branches?
Yes.Not only are there the general overhead and scheduling costs
to branches, but we cannot depend on the compiler to emit
optimal-enough code when dealing with conditional expressions. Let's
take a look at a simple example:
uint32_t test_1( uint32_t arg0, uint32_t arg1 )
{
if ( ( ( arg0 & 0x00040000 ) != 0 ) || ( ( arg1 & 0x00080000 ) != 0 ) )
{
return (arg0);
}
return (arg1);
}
Pretty straightforward. We're testing a couple of bits in the arguments
to decide which argument we want to return.
I want to point out that this example is particularly easy to handle
because the C short-circuiting rules do not apply here. The compiler is
free to do both tests on either side of the OR because (it can be
trivially found) that there are no side-effects.
Let's look at the compiler output (I've added some comments):
test_1:
andis. 9,3,4 # tmp9 = arg0 & ( 0x0004 << 16 )
# CR[0] = tmp9 <=> 0
rlwinm 9,4,13,31,31 # tmp9 = tmp0
mr 0,3 # tmp0 = arg0
stwu 1,-16(1) # sp[-16] = sp
# sp = sp-16
mr 3,4 # result = arg1
cmpwi 7,9,0 # CR[7] = tmp0 <=> 0
# TEST_ARG0:
bne- 0,.L3 # if_unlikely ( CR[0] != 0 ) goto COPY_ARG0
# i.e. ( arg0 & 0x00040000 ) != 0 )
# TEST_ARG1:
beq- 7,.L1 # if_unlikely ( CR[7] == 0 ) goto DONE
# i.e. ( arg1 & 0x00080000 ) == 0 )
.L3: # COPY_ARG0:
mr 3,0 # result = tmp0
.L1: # DONE:
addi 1,1,16 # sp = sp + 16
blr # RETURN (result)
Aside: Note that the stack pointer (sp) is being pushed
and popped although the stack isn't being used in this function.
Presumably, a function this small would be inlined so that would not occur. BTW, the -fomit-frame-pointer
flag is ignored, if you're thinking that would help... I'm going to
remove the stack push/pop code from any compiler output from this point
on, unless the stack is actually being used. We'll assume the functions
are inlined and that if they aren't you are aware of this cost.
This version of GCC always saves the stack pointer to the stack, whether or not you are using it. Inline small functions.
First notice the static branch prediction:
(Syntax note: "-" added to a branch instruction indicates it is unlikely. Conversely, "+" indicates a likely branch.)
It's unlikely that we will return arg0 because of the first test, but it is also unlikely that we will return arg1 due to the second test. Which means that it is unlikely this function will be fast.
i.e.
if ( ( arg0 & 0x00040000 ) != 0 ) TEST_ARG0 is UNLIKELY, return(arg0)
else if ( ( arg1 & 0x00080000 ) == 0 ) TEST_ARG1 is UNLIKELY, return(arg1)
return (arg0)
The compiler has decided that the most likely scenario is that the first test fails and the second test succeeds. Was that what you expected?
The and with arg0 is pretty interesting too:
andis. 9,3,4 # tmp9 = arg0 & ( 0x0004 << 16 )
# CR[0] = tmp9 <=> 0
"andis." means and with 16 bit immediate and shift, then update CR[0]
-- The dot at the end of an instruction means "then update CR[0]". It's
that dot that we should be concerned about. The CR is a very limited
resource (32 bits = 8 x 4 bits) shared by all the execution units. When multiple forms were found (most cases),
sequences that did not set the Carry bit, had more parallelism, had
fewer register operands, or generalized to 64-bit implementations were
favored. A clever compiler might want to consider multiple sequences to
minimize resource conflicts.
Prefer instructions that do not affect the Condition Register (CR).
Note the register moves:
mr 0,3
...
mr 3,4
...
mr 3,0
A lot of moves ("mr" means move register) among
registers is usually a good indication of some kind of optimization
barrior. Seeing three moves in such a short instruction stream is a
good hint that something needs fixing.
Look for too many register moves in your
code. By themselves they aren't that expensive, but they are often a
good litmus test for slow code.
What we want: the whole test should be unlikely
Following the logic that branches should be assumed not taken,
we want to make it as simple as possible for the compiler to understand
what we want. We can do this fairly easily by combining our comparison
operations and only testing a single value. Here's the change:
uint32_t test_2( uint32_t arg0, uint32_t arg1 )
{
uint32_t arg0_test = arg0 & 0x00040000;
uint32_t arg1_test = arg1 & 0x00080000;
uint32_t test_comb = arg0_test | arg1_test;
if ( test_comb != 0 )
{
return (arg0);
}
return (arg1);
}
There is absolutely no possible way for the compiler to generate
more than one branch in this scenario. We are only testing one value.
The rest is simple bit operations. Can't be anything else. Really. Are
you convinced yet? :) Take a look at the output:
test_3:
rlwinm 9,3,0,13,13 # arg0_test = arg0 & 0x00040000
rlwinm 0,4,0,12,12 # arg1_test = arg1 & 0x00080000
or. 11,9,0 # test_comb = arg0_test | arg1_test
# CR[0] = test_comb <=> 0
# TEST_TEST_COMB:
bne- 0,.L7 # if_unlikely ( CR[0] != 0 ) goto DONE
# i.e. ( test_comb != 0 )
mr 3,4 # result = arg1
.L7: # DONE:
blr # RETURN (result)
Pretty much what you expected right?
bne- 0,.L7 # if_unlikely ( CR[0] != 0 ) goto DONE
The branch not-taken is unlikely. Everything is looking up.
But did you see the "or." coming? That's right, the compiler's trying to be tricky and use the result of the or
to determine the test value instead of doing a separate test. Again,
prefer instruction sequences that do not modify or read the Condition
Register.
Combine your comparisons into a single
value before you branch. This makes the compiler's job easier and your
code will be faster.
How can we reduce the dependencies on the Condition Register?
Let me give you a hint. Here's the answer.
uint32_t test_3( uint32_t arg0, uint32_t arg1 )
{
uint32_t arg0_test = arg0 & 0x00040000;
uint32_t arg1_test = arg1 & 0x00080000;
uint32_t test_comb = arg0_test | arg1_test;
uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31);
uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) );
return (result);
}
And the final output:
test_4:
rlwinm 8,3,0,13,13
rlwinm 10,4,0,12,12
or 7,8,10
neg 6,7
srawi 0,6,31
and 9,3,0
andc 3,4,0
or 3,9,3
blr