More Techniques for Eliminating Branches

Use masks to eliminate integer branches
uint32_t test_3( uint32_t arg0, uint32_t arg1 )
{
uint32_t arg0_test = arg0 & 0x00040000;
uint32_t arg1_test = arg1 & 0x00080000;

if (arg0_test | arg1_test )
{
return (arg0);
}

return (arg1);
}
Can be converted to:
uint32_t test_3( uint32_t arg0, uint32_t arg1 )
{
uint32_t arg0_test = arg0 & 0x00040000;
uint32_t arg1_test = arg1 & 0x00080000;
uint32_t test_comb = arg0_test | arg1_test;
The important thing to recognize is that any negative number will have the most significant bit (MSB) set, and that's the only bit we care about. Because the and'ed values are always positive, negating them will always give a negative number, unless that number is zero. For zero, the negative is still exactly zero.
  uint32_t test_comb_sign_msb = -test_comb;
The cast to int32_t is important. This forces an arithmetic shift right (SRA), not a logical shift right (SRL). An SRA will use the MSB of the integer to fill the gap on the left caused by the shift (and is used with signed values). A SRL will fill the gap with zero (and is used with unsigned values). And because we know the sign is stored in the MSB and because we are shifting the MSB all the way to the right we know that the entire value will be filled with the sign bit. In other words, we are saturating the value with the sign bit.

The cast back into an unsigned value is just to be sure that there are no unwanted sign extensions. In this particular case it wouldn't really make a difference, but it's a good habit.
  uint32_t mask_comb          = (uint32_t)(((int32_t)test_comb_sign_msb)>>31);
If all the bits of mask_comb are one, then ( arg0 & mask_comb ) must be arg0 and ( arg1 & (~mask_comb) ) must be zero.
If all the bits of mask_comb are zero, then ( arg0 & mask_comb ) must be zero and ( arg1 & (~mask_comb) ) must be arg1.
 
uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) );

return (result);
}
Use signed values for math. Use unsigned values for bit manipulation, unless there is a need for a signed intruction. Remember to cast it back at the end.
The PPU has two instructions that are very handy for dealing with masks:
andc (and with Complement) ( c = a & ~b )
orc (or with Complement) ( c = a | ~b )

Which makes this operation extremely quick. And now that the result has been selected based on the mask, it can simply be returned.

Looking at the generated code:
test_3:
rlwinm 8,3,0,13,13 # arg0_test = arg0 << 13
rlwinm 10,4,0,12,12 # arg1_test = arg1 << 12
or 7,8,10 # test_comb = arg0_test | arg1_test
neg 6,7 # test_comb_neg = -test_comb
srawi 0,6,31 # mask_comb = test_comb_neg SRA 31
and 9,3,0 # arg0_result = arg0 & mask_comb
andc 3,4,0 # arg1_result = arg1 & ~mask_comb
or 3,9,3 # result = arg0_result | arg1_result
blr # RETURN (result)
There are two main things to notice:
  • No branches
  • No comparison operations, thus no dependencies on CR at all
Because there are no optimization barriers in this code, when it is inlined into another section of code the compiler will be able to schedule it very well. With the branches and comparison instructions, the compiler is much more limited on how the instructions can be scheduled and must also deal with any contention for the condition registers.

Use the fsel instruction to eliminate floating point branches
There's additional FPU instruction available on the PPU that will selectively move between a choice of two floating point registers based on the value of a third. fsel.

fsel result, test, arg0, arg1 
Which is logically equivalent to:
result = ( test >= 0.0 ) ? arg1 : arg0; 
EXAMPLE (1):
float test_4( float arg0, float arg1 )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - 1.0;
const float arg1_test = arg1 - 0.5;
const float arg0_result = ( arg0_test >= 0.0f ) ? result1 : result0;
const float arg1_result = ( arg1_test >= 0.0f ) ? result0 : arg0_result;

return ( arg1_result );
}
.LC5:
.long 1065353216
.LC6:
.long 1056964608
.LC7:
.long 0
test:
lis 7,.LC5@ha
fmr 11,1
la 6,.LC5@l(7)
stwu 1,-32(1)
lis 5,.LC6@ha
lfs 5,0(6)
fsubs 12,1,5
la 4,.LC6@l(5)
stfs 2,16(1)
lis 3,.LC7@ha
lfs 3,0(4)
la 9,.LC7@l(3)
fsubs 2,2,3
lwz 0,16(1)
lfs 13,0(9)
mtctr 0
fcmpu 7,12,13
fcmpu 6,2,13
bge- 7,.L16
stfs 1,16(1)
lwz 8,16(1)
mtctr 8
.L16:
bge- 6,.L19
mfctr 10
stw 10,16(1)
lfs 11,16(1)
.L19:
fmr 1,11
addi 1,1,32
blr
Use instead:
static inline float fsel_gez( float test, float arg0, float arg1 )
{
float result;

__asm__ ( "fsel %0, %1, %2, %3": "=f"(result): "f"(test), "f"(arg0), "f"(arg1) );

return (result);
}

float test( float arg0, float arg1 )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - 1.0;
const float arg1_test = arg1 - 0.5;
const float arg0_result = fsel_gez( arg0_test, result1, result0 );
const float arg1_result = fsel_gez( arg1_test, result0, arg0_result );

return ( arg1_result );
}
The less-than-equal comparison has been inverted to a greater-than-equal comparison. Which means that equal-to-zero will give the incorrect result. For many circumstances this is acceptable. An epsilon value may be added to the comparison in order to select on equal-to within a certain range.

The results from this are significantly better:
.LC5:
.long 1065353216
.LC6:
.long 1056964608
test:
lis 5,.LC5@ha
lis 3,.LC6@ha
la 4,.LC5@l(5)
la 9,.LC6@l(3)
stwu 1,-16(1)
lfs 5,0(4)
addi 1,1,16
lfs 13,0(9)
fsubs 4,1,5
fsubs 3,2,13
fsel 0, 4, 2, 1
fsel 2, 3, 1, 0
fmr 1,2
blr
EXAMPLE (2):

To calculate the minimum of three floats.
Although this function can be reduced, it serves as an example of using the result of fsel to generate a mask (0.0 or 1.0) which can then be used in calculations.
float f_min_rgb( float r, float g, float b )
{
float d_rg = g - r; /* >= 0 when ( r <= g ) */
float d_rb = b - r; /* >= 0 when ( r <= b ) */
float d_gb = b - g; /* >= 0 when ( g <= b ) */
float m_rg = fsel_gez( d_rg, 1.0f, 0.0f ); /* ( r <= g ) */
float m_rb = fsel_gez( d_rb, 1.0f, 0.0f ); /* ( r <= b ) */
float m_gb = fsel_gez( d_gb, 1.0f, 0.0f ); /* ( g <= b ) */
float m_gr = 1.0f - m_rg; /* ( g <= r ) */
float m_br = 1.0f - m_rb; /* ( b <= r ) */
float m_bg = 1.0f - m_gb; /* ( b <= g ) */
float r_gb = m_rg * m_rb; /* (( r <= g ) && ( r <= b )) */
float g_rb = m_gr * m_gb; /* (( g <= r ) && ( g <= b )) */
float b_rg = m_br * m_bg; /* (( b <= r ) && ( b <= g )) */
float gb_r = 1.0f - r_gb; /* (( g <= r ) && ( b <= r )) */
float rb_g = 1.0f - g_rb; /* (( r <= g ) && ( b <= g )) */
float result_r = r * r_gb; /* (( r <= g ) && ( r <= b )) ? r : 0.0 */
float result_g = g * g_rb * gb_r; /* (( g < r ) && ( g <= b )) ? g : 0.0 */
float result_b = b * b_rg * rb_g; /* (( b < r ) && ( b < g )) ? b : 0.0 */
float result = result_r + result_g + result_b;

return (result);
}
Use fast versions of floor() and ceil() to eliminate floating point branches
Describe example of using float->int conversion as a mask for step functions
Predicated Reads and Writes
Predicated read example:
if ( foo.a > b )
{
c += foo.a;
}
In this case (predicated read), create a local zero variable to the effect of
mask   = test_gt( foo.a, b );
add_c = select( mask, foo.a, zero );
c += add_c;
Predicated write example:
if ( a > b )
{
foo.c += a;
}
In this case have a junk variable on the stack. The stack is very likely to be on-cache, so the write to L1 will be faster than the cost of even a few percentage points of mis-predicted branches: (There is no PPU or SPU equivalent to the P4 cmove instruction)

out_junk   = &junk;
out_foo_c = &foo.c
mask = test_gt( a, b );
out_addr = select( mask, out_foo_c, out_junk );
*out_addr += a;