CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Reducing The Costs Of Comparisons and Branches
Mike Acton
April 17, 2006
Combining Branches
Let's pretend for the moment that well predicted branches have zero cost and that all the code we need is ready in the instruction cache... If this were true, would we still want to eliminate branches? Yes. Not only are there the general overhead and scheduling costs to branches, but we cannot depend on the compiler to emit optimal-enough code when dealing with conditional expressions. Let's take a look at a simple example:
uint32_t test_1( uint32_t arg0, uint32_t arg1 ) { if ( ( ( arg0 & 0x00040000 ) != 0 ) || ( ( arg1 & 0x00080000 ) != 0 ) ) { return (arg0); } return (arg1); }
Pretty straightforward. We're testing a couple of bits in the arguments to decide which argument we want to return. I want to point out that this example is particularly easy to handle because the C short-circuiting rules do not apply here. The compiler is free to do both tests on either side of the OR because (it can be trivially found) that there are no side-effects. Let's look at the compiler output (I've added some comments):
test_1: andis. 9,3,4 # tmp9 = arg0 & ( 0x0004 << 16 ) # CR[0] = tmp9 <=> 0 rlwinm 9,4,13,31,31 # tmp9 = tmp0 mr 0,3 # tmp0 = arg0 stwu 1,-16(1) # sp[-16] = sp # sp = sp-16 mr 3,4 # result = arg1 cmpwi 7,9,0 # CR[7] = tmp0 <=> 0 # TEST_ARG0: bne- 0,.L3 # if_unlikely ( CR[0] != 0 ) goto COPY_ARG0 # i.e. ( arg0 & 0x00040000 ) != 0 ) # TEST_ARG1: beq- 7,.L1 # if_unlikely ( CR[7] == 0 ) goto DONE # i.e. ( arg1 & 0x00080000 ) == 0 ) .L3: # COPY_ARG0: mr 3,0 # result = tmp0 .L1: # DONE: addi 1,1,16 # sp = sp + 16 blr # RETURN (result)
Aside: Note that the stack pointer (sp) is being pushed and popped although the stack isn't being used in this function. Presumably, a function this small would be inlined so that would not occur. BTW, the -fomit-frame-pointer flag is ignored, if you're thinking that would help... I'm going to remove the stack push/pop code from any compiler output from this point on, unless the stack is actually being used. We'll assume the functions are inlined and that if they aren't you are aware of this cost.
This version of GCC always saves the stack pointer to the stack, whether or not you are using it. Inline small functions.
First notice the static branch prediction: (Syntax note: "-" added to a branch instruction indicates it is unlikely. Conversely, "+" indicates a likely branch.) It's unlikely that we will return arg0 because of the first test, but it is also unlikely that we will return arg1 due to the second test. Which means that it is unlikely this function will be fast. i.e.
if ( ( arg0 & 0x00040000 ) != 0 ) TEST_ARG0 is UNLIKELY, return(arg0) else if ( ( arg1 & 0x00080000 ) == 0 ) TEST_ARG1 is UNLIKELY, return(arg1) return (arg0)
The compiler has decided that the most likely scenario is that the first test fails and the second test succeeds. Was that what you expected? The and with arg0 is pretty interesting too:
andis. 9,3,4 # tmp9 = arg0 & ( 0x0004 << 16 ) # CR[0] = tmp9 <=> 0
"andis." means and with 16 bit immediate and shift, then update CR[0] -- The dot at the end of an instruction means "then update CR[0]". It's that dot that we should be concerned about. The CR is a very limited resource (32 bits = 8 x 4 bits) shared by all the execution units.
When multiple forms were found (most cases), sequences that did not set the Carry bit, had more parallelism, had fewer register operands, or generalized to 64-bit implementations were favored. A clever compiler might want to consider multiple sequences to minimize resource conflicts.
Prefer instructions that do not affect the Condition Register (CR).
Note the register moves:
mr 0,3 ... mr 3,4 ... mr 3,0
A lot of moves ("mr" means move register) among registers is usually a good indication of some kind of optimization barrior. Seeing three moves in such a short instruction stream is a good hint that something needs fixing.
Look for too many register moves in your code. By themselves they aren't that expensive, but they are often a good litmus test for slow code.
What we want: the whole test should be unlikely Following the logic that branches should be assumed not taken, we want to make it as simple as possible for the compiler to understand what we want. We can do this fairly easily by combining our comparison operations and only testing a single value. Here's the change:
uint32_t test_2( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; if ( test_comb != 0 ) { return (arg0); } return (arg1); }
There is absolutely no possible way for the compiler to generate more than one branch in this scenario. We are only testing one value. The rest is simple bit operations. Can't be anything else. Really. Are you convinced yet? :) Take a look at the output:
test_3: rlwinm 9,3,0,13,13 # arg0_test = arg0 & 0x00040000 rlwinm 0,4,0,12,12 # arg1_test = arg1 & 0x00080000 or. 11,9,0 # test_comb = arg0_test | arg1_test # CR[0] = test_comb <=> 0 # TEST_TEST_COMB: bne- 0,.L7 # if_unlikely ( CR[0] != 0 ) goto DONE # i.e. ( test_comb != 0 ) mr 3,4 # result = arg1 .L7: # DONE: blr # RETURN (result)
Pretty much what you expected right?
bne- 0,.L7 # if_unlikely ( CR[0] != 0 ) goto DONE
The branch not-taken is unlikely. Everything is looking up. But did you see the "or." coming? That's right, the compiler's trying to be tricky and use the result of the or to determine the test value instead of doing a separate test. Again, prefer instruction sequences that do not modify or read the Condition Register.
Combine your comparisons into a single value before you branch. This makes the compiler's job easier and your code will be faster.
How can we reduce the dependencies on the Condition Register? Let me give you a hint. Here's the answer.
uint32_t test_3( uint32_t arg0, uint32_t arg1 ) { uint32_t arg0_test = arg0 & 0x00040000; uint32_t arg1_test = arg1 & 0x00080000; uint32_t test_comb = arg0_test | arg1_test; uint32_t mask_comb = (uint32_t)(((int32_t)( test_comb | -test_comb ))>>31); uint32_t result = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) ); return (result); }
And the final output:
test_4: rlwinm 8,3,0,13,13 rlwinm 10,4,0,12,12 or 7,8,10 neg 6,7 srawi 0,6,31 and 9,3,0 andc 3,4,0 or 3,9,3 blr
We'll go over how that works and why it works so much better in Part 3...