Use masks to eliminate integer branches
uint32_t test_3( uint32_t arg0, uint32_t arg1 )Can be converted to:
{
uint32_t arg0_test = arg0 & 0x00040000;
uint32_t arg1_test = arg1 & 0x00080000;
if (arg0_test | arg1_test )
{
return (arg0);
}
return (arg1);
}
uint32_t test_3( uint32_t arg0, uint32_t arg1 )The important thing to recognize is that any negative number will have the most significant bit (MSB) set, and that's the only bit we care about. Because the and'ed values are always positive, negating them will always give a negative number, unless that number is zero. For zero, the negative is still exactly zero.
{
uint32_t arg0_test = arg0 & 0x00040000;
uint32_t arg1_test = arg1 & 0x00080000;
uint32_t test_comb = arg0_test | arg1_test;
uint32_t test_comb_sign_msb = -test_comb;The cast to int32_t is important. This forces an arithmetic shift right (SRA), not a logical shift right (SRL). An SRA will use the MSB of the integer to fill the gap on the left caused by the shift (and is used with signed values). A SRL will fill the gap with zero (and is used with unsigned values). And because we know the sign is stored in the MSB and because we are shifting the MSB all the way to the right we know that the entire value will be filled with the sign bit. In other words, we are saturating the value with the sign bit.
The cast back into an unsigned value is just to be sure that there are no unwanted sign extensions. In this particular case it wouldn't really make a difference, but it's a good habit.
uint32_t mask_comb = (uint32_t)(((int32_t)test_comb_sign_msb)>>31);If all the bits of mask_comb are one, then ( arg0 & mask_comb ) must be arg0 and ( arg1 & (~mask_comb) ) must be zero.
If all the bits of mask_comb are zero, then ( arg0 & mask_comb ) must be zero and ( arg1 & (~mask_comb) ) must be arg1.
uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) );
return (result);
}
Use signed values for math. Use unsigned values for bit manipulation,
unless there is a need for a signed intruction. Remember to cast it
back at the end.
The PPU has two instructions that are very handy for dealing with masks:andc (and with Complement) ( c = a & ~b )Which makes this operation extremely quick. And now that the result has been selected based on the mask, it can simply be returned.
orc (or with Complement) ( c = a | ~b )
Looking at the generated code:
test_3:There are two main things to notice:
rlwinm 8,3,0,13,13 # arg0_test = arg0 << 13
rlwinm 10,4,0,12,12 # arg1_test = arg1 << 12
or 7,8,10 # test_comb = arg0_test | arg1_test
neg 6,7 # test_comb_neg = -test_comb
srawi 0,6,31 # mask_comb = test_comb_neg SRA 31
and 9,3,0 # arg0_result = arg0 & mask_comb
andc 3,4,0 # arg1_result = arg1 & ~mask_comb
or 3,9,3 # result = arg0_result | arg1_result
blr # RETURN (result)
- No branches
- No comparison operations, thus no dependencies on CR at all
Use the fsel instruction to eliminate floating point branches
There's additional FPU instruction available on the PPU that will
selectively move between a choice of two floating point registers based
on the value of a third. fsel. fsel result, test, arg0, arg1Which is logically equivalent to:
result = ( test >= 0.0 ) ? arg1 : arg0;EXAMPLE (1):
float test_4( float arg0, float arg1 )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - 1.0;
const float arg1_test = arg1 - 0.5;
const float arg0_result = ( arg0_test >= 0.0f ) ? result1 : result0;
const float arg1_result = ( arg1_test >= 0.0f ) ? result0 : arg0_result;
return ( arg1_result );
}
.LC5:Use instead:
.long 1065353216
.LC6:
.long 1056964608
.LC7:
.long 0
test:
lis 7,.LC5@ha
fmr 11,1
la 6,.LC5@l(7)
stwu 1,-32(1)
lis 5,.LC6@ha
lfs 5,0(6)
fsubs 12,1,5
la 4,.LC6@l(5)
stfs 2,16(1)
lis 3,.LC7@ha
lfs 3,0(4)
la 9,.LC7@l(3)
fsubs 2,2,3
lwz 0,16(1)
lfs 13,0(9)
mtctr 0
fcmpu 7,12,13
fcmpu 6,2,13
bge- 7,.L16
stfs 1,16(1)
lwz 8,16(1)
mtctr 8
.L16:
bge- 6,.L19
mfctr 10
stw 10,16(1)
lfs 11,16(1)
.L19:
fmr 1,11
addi 1,1,32
blr
static inline float fsel_gez( float test, float arg0, float arg1 )The less-than-equal comparison has been inverted to a greater-than-equal comparison. Which means that equal-to-zero will give the incorrect result. For many circumstances this is acceptable. An epsilon value may be added to the comparison in order to select on equal-to within a certain range.
{
float result;
__asm__ ( "fsel %0, %1, %2, %3": "=f"(result): "f"(test), "f"(arg0), "f"(arg1) );
return (result);
}
float test( float arg0, float arg1 )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - 1.0;
const float arg1_test = arg1 - 0.5;
const float arg0_result = fsel_gez( arg0_test, result1, result0 );
const float arg1_result = fsel_gez( arg1_test, result0, arg0_result );
return ( arg1_result );
}
The results from this are significantly better:
.LC5:EXAMPLE (2):
.long 1065353216
.LC6:
.long 1056964608
test:
lis 5,.LC5@ha
lis 3,.LC6@ha
la 4,.LC5@l(5)
la 9,.LC6@l(3)
stwu 1,-16(1)
lfs 5,0(4)
addi 1,1,16
lfs 13,0(9)
fsubs 4,1,5
fsubs 3,2,13
fsel 0, 4, 2, 1
fsel 2, 3, 1, 0
fmr 1,2
blr
To calculate the minimum of three floats.
Although this function can be reduced, it serves as an example of using
the result of fsel to generate a mask (0.0 or 1.0) which can then be
used in calculations.
float f_min_rgb( float r, float g, float b )
{
float d_rg = g - r; /* >= 0 when ( r <= g ) */
float d_rb = b - r; /* >= 0 when ( r <= b ) */
float d_gb = b - g; /* >= 0 when ( g <= b ) */
float m_rg = fsel_gez( d_rg, 1.0f, 0.0f ); /* ( r <= g ) */
float m_rb = fsel_gez( d_rb, 1.0f, 0.0f ); /* ( r <= b ) */
float m_gb = fsel_gez( d_gb, 1.0f, 0.0f ); /* ( g <= b ) */
float m_gr = 1.0f - m_rg; /* ( g <= r ) */
float m_br = 1.0f - m_rb; /* ( b <= r ) */
float m_bg = 1.0f - m_gb; /* ( b <= g ) */
float r_gb = m_rg * m_rb; /* (( r <= g ) && ( r <= b )) */
float g_rb = m_gr * m_gb; /* (( g <= r ) && ( g <= b )) */
float b_rg = m_br * m_bg; /* (( b <= r ) && ( b <= g )) */
float gb_r = 1.0f - r_gb; /* (( g <= r ) && ( b <= r )) */
float rb_g = 1.0f - g_rb; /* (( r <= g ) && ( b <= g )) */
float result_r = r * r_gb; /* (( r <= g ) && ( r <= b )) ? r : 0.0 */
float result_g = g * g_rb * gb_r; /* (( g < r ) && ( g <= b )) ? g : 0.0 */
float result_b = b * b_rg * rb_g; /* (( b < r ) && ( b < g )) ? b : 0.0 */
float result = result_r + result_g + result_b;
return (result);
}
Use fast versions of floor() and ceil() to eliminate floating point branches
Describe example of using float->int conversion as a mask for step functionsPredicated Reads and Writes
Predicated read example:
if ( foo.a > b )In this case (predicated read), create a local zero variable to the effect of
{
c += foo.a;
}
mask = test_gt( foo.a, b );Predicated write example:
add_c = select( mask, foo.a, zero );
c += add_c;
if ( a > b )In this case have a junk variable on the stack. The stack is very likely to be on-cache, so the write to L1 will be faster than the cost of even a few percentage points of mis-predicted branches: (There is no PPU or SPU equivalent to the P4 cmove instruction)
{
foo.c += a;
}
out_junk = &junk;
out_foo_c = &foo.c
mask = test_gt( a, b );
out_addr = select( mask, out_foo_c, out_junk );
*out_addr += a;