CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Using Masks To Accelerate Integer Code
Mike Acton
April 17, 2006
Using masks to accelerate integer code
Let's go back to the last example from Part 2:
uint32_t test_3( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31); uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) ); return (result); }
Here's the interesting bit:
uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31);
What we're doing here is creating an integer mask based on the results from combined test. An integer mask is similar to a predicate value except that instead of true being defined as one, true is defined as "all bits on", which in two's complement is negative one. The code above is logically equivalent to:
uint32_t mask_comb = ( test_comb != 0 ) ? 0xffffffff : 0x00000000;
How does it work?
uint32_t test_comb_sign_msb = test_comb | -test_comb;
The important thing to recognize is that any negative number will have the most significant bit (MSB) set, and that's the only bit we care about. So for any value of test_comb that is not zero, test_comb OR negative test_comb will result in the MSB being set. However, for zero (since there is no negative zero in integer code) zero OR negative zero is still exactly zero.
uint32_t test_comb_saturate_sign = ((int32_t)test_comb_sign_msb) >> 31;
The cast to int32_t is important. We are telling the compiler that we want a arithmetic shift right (SRA), not a logical shift right (SRL). An SRA will use the MSB of the integer to fill the gap on the left caused by the shift (and is used with signed values). A SRL will fill the gap with zero (and is used with unsigned values). And because we know the sign is stored in the MSB and because we are shifting the MSB all the way to the right we know that the entire value will be filled with the sign bit. In other words, we are saturating the value with the sign bit.
uint32_t mask_comb = (uint32_t)test_comb_saturate_sign;
The cast back into an unsigned value is just to be sure that we don't get any unwanted sign extensions. In this particular case it wouldn't really make a difference, but it's a good habit.
Use signed values for math. Use unsigned values for bit manipulation, unless you have a reason to need a signed intruction - just remember to cast it back when you're done.
Okay. We have all the bits set to zero, or all set to one. Now what?
uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) );
What we're saying here is: if all the bits of mask_comb are one, then ( arg0 & mask_comb ) must be arg0 and ( arg1 & (~mask_comb) ) must be zero ... if all the bits of mask_comb are zero, then ( arg0 & mask_comb ) must be zero and ( arg1 & (~mask_comb) ) must be arg1 You might think that ( arg1 & (~mask_comb) ) would map to two instructions: TMP = complement of mask_comb RESULT = arg1 and TMP But the PPU has two instructions that are very handy for dealing with masks: andc: And with Complement ( c = a & ~b ) orc: Or with Complement ( c = a | ~b ) Which makes this operation extremely quick. And now that the result has been selected based on the mask, it can simply be returned. Let look at the compiler output again:
test_3: rlwinm 8,3,0,13,13 # arg0_test = arg0 << 13 rlwinm 10,4,0,12,12 # arg1_test = arg1 << 12 or 7,8,10 # test_comb = arg0_test | arg1_test neg 6,7 # test_comb_neg = -test_comb srawi 0,6,31 # mask_comb = test_comb_neg SRA 31 and 9,3,0 # arg0_result = arg0 & mask_comb andc 3,4,0 # arg1_result = arg1 & ~mask_comb or 3,9,3 # result = arg0_result | arg1_result blr # RETURN (result)
What do we see here? [list] 1. No branches 2. No comparison operations, thus no dependencies on CR at all [/list] And because there are no optimization barriers in this code, when it is inlined into another section of code the compiler will be able to schedule it very well. With the branches and comparison instructions, the compiler is much more limited on how the instructions can be scheduled and must also deal with any contention for the condition registers. But doesn't that look a bit... messy? Sure. Let's clean it up a bit then by defining a couple of handy inline functions:
// Compare Not Zero for 32 bit integer values static inline uint32_t cmpnz_u32( const uint32_t arg ) { /* Set MSB if arg is positive */ const uint32_t arg_neg = -arg; /* Set MSB if arg is not zero */ const uint32_t arg_snz = arg | arg_neg; /* Set all bits to MSB */ const uint32_t arg_snz_sat = (uint32_t)( ((int32_t)arg_snz) >> 31 ); return ( arg_snz_sat ); } // Select between 32 bit integer values from mask static inline uint32_t sel_u32( const uint32_t mask, const uint32_t arg0, const uint32_t arg1 ) { /* Set to arg0 only if mask is set */ const uint32_t arg0_result = arg0 & mask; /* Set to arg1 only if mask is clear */ const uint32_t arg1_result = arg1 & (~mask); /* Only one result can be non-zero. */ const uint32_t result = arg0_result | arg1_result; return ( result ); }
And look how that cleans up the code...
uint32_t test_4( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; uint32_t mask_comb = cmpnz_u32( test_comb ); uint32_t result = sel_u32( mask_comb, arg0, arg1 ); return (result); }
But does that generate the same results? Yes. Identical results:
rlwinm 8,3,0,13,13 rlwinm 10,4,0,12,12 or 7,8,10 neg 6,7 srawi 0,6,31 and 9,3,0 andc 3,4,0 or 3,9,3 blr
Helpful Identities
Replacing integer branches with masks sometimes involves gymnastics with bitwise transformations. Here is a table of transformations intended to be helpful as reference:
-------------------------------------- x = 0 ------------- xor x, p, p -------------------------------------- x = -1 ------------- orc x, p, p -------------------------------------- x = p ------------- or x, p, p -------------------------------------- x = p ------------- and x, p, p -------------------------------------- x = ~p ------------- nor x, p, p -------------------------------------- x = ~p ------------- nand x, p, p -------------------------------------- x = p = p & -1 -------------- ----------------- or x, p, p orc t0, p, p and x, p, t0 -------------------------------------- x = 0 = p & 0 -------------- ----------------- xor x, p, p andi x, p, 0 -------------------------------------- x = p | p = p & p -------------- ----------------- or x, p, p and x, p, p -------------------------------------- x = p | p = p | ( p & q ) -------------- ----------------- or x, p, p and t0, p, q or x, p, t0 -------------------------------------- x = ~(p|p) = ~(p&p) -------------- ----------------- nor x, p, p nand x, p, p -------------------------------------- x = ~(p|p) = ~((p|p) & (p|p)) -------------- ----------------- nor x, p, p or t0, p, p nand x, t0, t0 -------------------------------------- x = ~(p|q) = ~((p|q) & (p|q)) -------------- ----------------- nor x, p, q or t0, p, q nand x, t0, t0 -------------------------------------- x = ~(p|q) = (~p) & (~q) -------------- ----------------- nor x, p, q nand t0, p, p andc x, t0, q -------------------------------------- x = ~(p^q) = ~((p^q) & (p^q)) -------------- ----------------- eqv x, p, q xor t0, p, q nand x, t0, t0 -------------------------------------- x = ~(p^q) = ~(p^q) -------------- ----------------- xor t0, p, q eqv x, p, q nor x, t0, t0 -------------------------------------- x = p & q = ~(p^q) & q; -------------- ----------------- and x, p, q xor t0, p, q andc x, q, t0 -------------------------------------- x = ~(p&q) = ~(p&q) -------------- ----------------- and t0, p, q nand x, p, q nor x, t0, t0 -------------------------------------- x = ~(p&q) = (p^q) & q -------------- ----------------- nand x, p, q xor t0, p, q and x, t0, q -------------------------------------- x = ~(p&q) = (~p) | (~q) -------------- ----------------- nand x, p, q nor t0, p, p orc x, t0, q -------------------------------------- x = p ^ q = (p|q) & ~(p&q) -------------- ----------------- xor x, p, q or t0, p, q nand t1, p, q and x, t0, t1 -------------------------------------- x = p & (q&r) = (p&q) & r -------------- ----------------- and t0, q, r and t0, p, q and x, p, t0 and x, t0, r -------------------------------------- x = p | (q|r) = (p|q) | r -------------- ----------------- or t0, q, r or t0, p, q or x, p, t0 or x, t0, r -------------------------------------- x = p & (q|r) = (p&q) | (p&r) -------------- ----------------- or t0, q, r and t0, p, q and x, p, t0 and t1, p, r or x, t0, t1 -------------------------------------- x = p | (q&r) = (p|q) & (p|r) -------------- ----------------- and t0, q, r or t0, p, q or x, p, t0 or t1, p, r and x, t0, t1 --------------------------------------
Other References
Here are some additional references that might prove useful:
Summary
About the author
Mike Acton is a Senior Architect working on PS3/Cell research at Highmoon Studios (Vivendi-Universal Games) who occasionally refers to himself in the third person. Most recently, Mike was the Lead Special Effects Programmer for Darkwatch at Highmoon Studios (previously Sammy Studios) in Carlsbad, CA. Previously Mike has held Lead Programming, Playstation 2 Engine Development and Research roles with Yuke's, VBlank and BlueSky/Titus. He worked for Sony Interactive Studios in PSX development.

Mike has made regular appearances as a speaker at SCEA develpment conferences and has spoken at GDC. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.

He prefers vi.

Also check out the non Cell BE articles by Mike Acton.