# Using masks to accelerate integer code

Let's go back to the last example:
uint32_t test_3( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31); uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) ); return (result); }
Here's the interesting bit:
uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31);
What we're doing here is creating an integer mask based on the results from combined test. An integer mask is similar to a predicate value except that instead of true being defined as one, true is defined as "all bits on", which in two's complement is negative one. The code above is logically equivalent to:
uint32_t mask_comb = ( test_comb != 0 ) ? 0xffffffff : 0x00000000;
How does it work?
uint32_t test_comb_sign_msb = test_comb | -test_comb;
The important thing to recognize is that any negative number will have the most significant bit (MSB) set, and that's the only bit we care about. So for any value of test_comb that is not zero, test_comb OR negative test_comb will result in the MSB being set. However, for zero (since there is no negative zero in integer code) zero OR negative zero is still exactly zero.
uint32_t test_comb_saturate_sign = ((int32_t)test_comb_sign_msb) >> 31;
The cast to int32_t is important. We are telling the compiler that we want a arithmetic shift right (SRA), not a logical shift right (SRL). An SRA will use the MSB of the integer to fill the gap on the left caused by the shift (and is used with signed values). A SRL will fill the gap with zero (and is used with unsigned values). And because we know the sign is stored in the MSB and because we are shifting the MSB all the way to the right we know that the entire value will be filled with the sign bit. In other words, we are saturating the value with the sign bit.
The cast back into an unsigned value is just to be sure that we don't get any unwanted sign extensions. In this particular case it wouldn't really make a difference, but it's a good habit.
Use signed values for math. Use unsigned values for bit manipulation, unless you have a reason to need a signed intruction - just remember to cast it back when you're done.
Okay. We have all the bits set to zero, or all set to one. Now what?
uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) );
What we're saying here is: if all the bits of mask_comb are one, then ( arg0 & mask_comb ) must be arg0 and ( arg1 & (~mask_comb) ) must be zero ... if all the bits of mask_comb are zero, then ( arg0 & mask_comb ) must be zero and ( arg1 & (~mask_comb) ) must be arg1 You might think that ( arg1 & (~mask_comb) ) would map to two instructions: TMP = complement of mask_comb RESULT = arg1 and TMP But the PPU has two instructions that are very handy for dealing with masks: andc: And with Complement ( c = a & ~b ) orc: Or with Complement ( c = a | ~b ) Which makes this operation extremely quick. And now that the result has been selected based on the mask, it can simply be returned. Let look at the compiler output again:
test_3: rlwinm 8,3,0,13,13 # arg0_test = arg0 << 13 rlwinm 10,4,0,12,12 # arg1_test = arg1 << 12 or 7,8,10 # test_comb = arg0_test | arg1_test neg 6,7 # test_comb_neg = -test_comb srawi 0,6,31 # mask_comb = test_comb_neg SRA 31 and 9,3,0 # arg0_result = arg0 & mask_comb andc 3,4,0 # arg1_result = arg1 & ~mask_comb or 3,9,3 # result = arg0_result | arg1_result blr # RETURN (result)
What do we see here? [list] 1. No branches 2. No comparison operations, thus no dependencies on CR at all [/list] And because there are no optimization barriers in this code, when it is inlined into another section of code the compiler will be able to schedule it very well. With the branches and comparison instructions, the compiler is much more limited on how the instructions can be scheduled and must also deal with any contention for the condition registers. But doesn't that look a bit... messy? Sure. Let's clean it up a bit then by defining a couple of handy inline functions:
// Compare Not Zero for 32 bit integer values static inline uint32_t cmpnz_u32( const uint32_t arg ) { /* Set MSB if arg is positive */ const uint32_t arg_neg = -arg; /* Set MSB if arg is not zero */ const uint32_t arg_snz = arg | arg_neg; /* Set all bits to MSB */ const uint32_t arg_snz_sat = (uint32_t)( ((int32_t)arg_snz) >> 31 ); return ( arg_snz_sat ); } // Select between 32 bit integer values from mask static inline uint32_t sel_u32( const uint32_t mask, const uint32_t arg0, const uint32_t arg1 ) { /* Set to arg0 only if mask is set */ const uint32_t arg0_result = arg0 & mask; /* Set to arg1 only if mask is clear */ const uint32_t arg1_result = arg1 & (~mask); /* Only one result can be non-zero. */ const uint32_t result = arg0_result | arg1_result; return ( result ); }
And look how that cleans up the code...
uint32_t test_4( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; uint32_t mask_comb = cmpnz_u32( test_comb ); uint32_t result = sel_u32( mask_comb, arg0, arg1 ); return (result); }
But does that generate the same results? Yes. Identical results:
rlwinm 8,3,0,13,13 rlwinm 10,4,0,12,12 or 7,8,10 neg 6,7 srawi 0,6,31 and 9,3,0 andc 3,4,0 or 3,9,3 blr
Replacing integer branches with masks sometimes involves gymnastics with bitwise transformations. Here is a table of transformations intended to be helpful as reference:
-------------------------------------- x = 0 ------------- xor x, p, p -------------------------------------- x = -1 ------------- orc x, p, p -------------------------------------- x = p ------------- or x, p, p -------------------------------------- x = p ------------- and x, p, p -------------------------------------- x = ~p ------------- nor x, p, p -------------------------------------- x = ~p ------------- nand x, p, p -------------------------------------- x = p = p & -1 -------------- ----------------- or x, p, p orc t0, p, p and x, p, t0 -------------------------------------- x = 0 = p & 0 -------------- ----------------- xor x, p, p andi x, p, 0 -------------------------------------- x = p | p = p & p -------------- ----------------- or x, p, p and x, p, p -------------------------------------- x = p | p = p | ( p & q ) -------------- ----------------- or x, p, p and t0, p, q or x, p, t0 -------------------------------------- x = ~(p|p) = ~(p&p) -------------- ----------------- nor x, p, p nand x, p, p -------------------------------------- x = ~(p|p) = ~((p|p) & (p|p)) -------------- ----------------- nor x, p, p or t0, p, p nand x, t0, t0 -------------------------------------- x = ~(p|q) = ~((p|q) & (p|q)) -------------- ----------------- nor x, p, q or t0, p, q nand x, t0, t0 -------------------------------------- x = ~(p|q) = (~p) & (~q) -------------- ----------------- nor x, p, q nand t0, p, p andc x, t0, q -------------------------------------- x = ~(p^q) = ~((p^q) & (p^q)) -------------- ----------------- eqv x, p, q xor t0, p, q nand x, t0, t0 -------------------------------------- x = ~(p^q) = ~(p^q) -------------- ----------------- xor t0, p, q eqv x, p, q nor x, t0, t0 -------------------------------------- x = p & q = ~(p^q) & q; -------------- ----------------- and x, p, q xor t0, p, q andc x, q, t0 -------------------------------------- x = ~(p&q) = ~(p&q) -------------- ----------------- and t0, p, q nand x, p, q nor x, t0, t0 -------------------------------------- x = ~(p&q) = (p^q) & q -------------- ----------------- nand x, p, q xor t0, p, q and x, t0, q -------------------------------------- x = ~(p&q) = (~p) | (~q) -------------- ----------------- nand x, p, q nor t0, p, p orc x, t0, q -------------------------------------- x = p ^ q = (p|q) & ~(p&q) -------------- ----------------- xor x, p, q or t0, p, q nand t1, p, q and x, t0, t1 -------------------------------------- x = p & (q&r) = (p&q) & r -------------- ----------------- and t0, q, r and t0, p, q and x, p, t0 and x, t0, r -------------------------------------- x = p | (q|r) = (p|q) | r -------------- ----------------- or t0, q, r or t0, p, q or x, p, t0 or x, t0, r -------------------------------------- x = p & (q|r) = (p&q) | (p&r) -------------- ----------------- or t0, q, r and t0, p, q and x, p, t0 and t1, p, r or x, t0, t1 -------------------------------------- x = p | (q&r) = (p|q) & (p|r) -------------- ----------------- and t0, q, r or t0, p, q or x, p, t0 or t1, p, r and x, t0, t1 --------------------------------------
Other References
Here are some additional references that might prove useful:
Hosting for CellPerformance graciously provided by...  Linkedin Mike Acton Twitter @mike_acton Facebook Mike Acton Mike Acton's shared items in Google Reader