CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Choosing to Avoid Branches: A Small Altivec Example
Mike Acton
April 20, 2006
Balancing speed and instruction count
I came across an interesting bit of Altivec code [simdtech.org] on the Altivec mailing list [simdtech.org] today and I thought it would make a good example of how speed is not simply a matter of instruction count on the PPU and SPU.

The function (modified slightly for this example):
vector unsigned short test( vector unsigned short a, vector unsigned short b ) { const vector unsigned short mask = (vector unsigned short)vec_cmplt( a, b ); if (vec_all_ge(a, b)) { return (a); } return (mask); }
Ingoring the function stack push/pop, we would have a very few instructions. Something like:
vcmpgtuh. v0, v3, v2 mfcr r0 rlwinm r0, r0, 27, 1 vcmpgtuh v0, v3, v2 cmpwi cr7, r0, 0 bne+ cr7, L2 vor v2, v0, v0 L2:
Ben Weiss [shellandslate.com] wrote this snippet to demonstrate that GCC does not optimize out redundant calls when Altivec instructions are called both with and without the CR-modify bit set. It's worth noting that this code (in the general case) will be a performance loss. Not only will there very likely be performance issues in these instructions themselves, but it will also affect the surrounding code by creating an optimization barrier.

Read on to find out more about the issues and examine the alternative...
Tracing the redundant instruction
A problem with the emitted assembly above is that the "vcmpgtuh" is completely redundant. The "vcmpgtuh." already calculates the needed result. So why is it emitted?

The predicate (CR modifying) Altivec instructions are mapped to their builtin descriptions in rs6000.c (gcc/config/rs6000)
static const struct builtin_description_predicates bdesc_altivec_preds[] = { { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpbfp.", "__builtin_altivec_vcmpbfp_p", ALTIVEC_BUILTIN_VCMPBFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpeqfp.", "__builtin_altivec_vcmpeqfp_p", ALTIVEC_BUILTIN_VCMPEQFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpgefp.", "__builtin_altivec_vcmpgefp_p", ALTIVEC_BUILTIN_VCMPGEFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpgtfp.", "__builtin_altivec_vcmpgtfp_p", ALTIVEC_BUILTIN_VCMPGTFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpequw.", "__builtin_altivec_vcmpequw_p", ALTIVEC_BUILTIN_VCMPEQUW_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpgtsw.", "__builtin_altivec_vcmpgtsw_p", ALTIVEC_BUILTIN_VCMPGTSW_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpgtuw.", "__builtin_altivec_vcmpgtuw_p", ALTIVEC_BUILTIN_VCMPGTUW_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpgtuh.", "__builtin_altivec_vcmpgtuh_p", ALTIVEC_BUILTIN_VCMPGTUH_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpgtsh.", "__builtin_altivec_vcmpgtsh_p", ALTIVEC_BUILTIN_VCMPGTSH_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpequh.", "__builtin_altivec_vcmpequh_p", ALTIVEC_BUILTIN_VCMPEQUH_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpequb.", "__builtin_altivec_vcmpequb_p", ALTIVEC_BUILTIN_VCMPEQUB_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpgtsb.", "__builtin_altivec_vcmpgtsb_p", ALTIVEC_BUILTIN_VCMPGTSB_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpgtub.", "__builtin_altivec_vcmpgtub_p", ALTIVEC_BUILTIN_VCMPGTUB_P } };
Then in altivec.h the predicate intrinsics e.g. vec_all_gte() directly call the builtins. There doesn't appear to be any RTL references to them at all. So the scheduler has no way of knowing the two instructions are related.

For additional validation, examing the GCC debug output (*.34.sched2)
;; ====================================================== ;; -- basic block 0 from 61 to 17 -- after reload ;; ====================================================== ;; 0--> 61 %0=vrsave :nothing ;; 0--> 63 %12=%0|0xffffffffc0000000 :fxu_sim_cell ;; 2--> 62 [%1-0x4]=%0 :lsu_cell ;; 2--> 64 {vrsave=unspec/v[%12,vrsave] 30;clo:fxu_sim_cell ;; 4--> 14 {%6=unspec[%3,%2,`*vcmpgtuh.'] 175;:vus_cell ;; 6--> 25 %1=%2 :vus_cell ;; 8--> 16 %6=cmp(%6==0x0,0x0) :bru_cr_cell ;; 8--> 10 %0=unspec[%3,%2] 58 :vus_cell ;; 12--> 17 pc={(%6!=0x0)?L38:pc} :bru_cell
To enable debugging output in GCC, add the -dall compilation flag. For more information on the GCC debugging options see: Options for Debugging Your Program or GCC [gnu.org]

*.34.sched2 represents the final pass of the compiler and it would seem from this line
;; 4--> 14 {%6=unspec[%3,%2,`*vcmpgtuh.'] 175;:vus_cell
that "vcmpgtuh." is pretty much a text replacement.

Based on this, don't expect any of the Altivec instructions with the CR modify bit set to fold with the plain version under any conditions.
In addition to the various other reasons for avoiding the CR modifying Altivec instructions, this is a good reason to:
Avoid CR-modifying Altivec instructions (vec_any_*, vec_all_*)
What are the other problems?
On closer examination, these few lines of code have quite a few potential performance issues.
vcmpgtuh. v0, v3, v2
The "vcmpgtuh." instruction modifies the Condition Register (CR). This could cause problems because the Vector Execution Unit (VXU) must contend with the Floating Point Unit (FPU) and the Fixed Point Execution Unit (FXU) for access to the 32 bit CR. It would be resonable to expect that under fairly common conditions, this would cause a pipeline bubble.

Another disadvantage to modifying the CR is the added difficulty in instruction scheduling. The CR is a very scarce resource and the options the compiler has for moving the code (to maximize throughput) is now more reliant on the availability of the 4 bits of the CR needed to store this result than the actual instruction throughput or latency.

mfcr r0 rlwinm r0, r0, 27, 1
In order to free the Altivec modified field in the condition register, the value is moved to a fixed-point register (mfcr). The CR value must also be shifted in order to create a predicate value. In this case, the "vcmpgtuh." doesn't just lock the condition register, but also takes resources from the fixed-point unit.
bne+ cr7, L2
The first problem with introducing this branch instruction is that it creates an optimization barrier for the compiler. Although sometimes possible, it is very difficult to lift code from inside a branch target and schedule it with code in another block. And GCC is especially weak at this particular optimization.

vor v2,v0,v0
The "vor" instruction simply takes the result in the default case and moves it to the return register. By itself, there is very little lost, but in this context the instruction must wait the full latency of the compare instruction in the best case, and for a full pipeline flush in the worst case, in order to execute. And because the branch introduces an optimization barrier, there is very little oportunity to schedule instructions during that period.

What's the alternative?
Simply put - remove the branch.

Here is one possible implementation:
const vector unsigned int zero = vec_xor( a, a ); const vector unsigned short mask_lt = vec_cmplt( a, b ); const vector unsigned short mask_e0 = vec_sld( mask_lt, mask_lt, 8 ); const vector unsigned short mask_e1 = vec_sld( mask_lt, mask_lt, 12 ); const vector unsigned short mask_e2 = vec_and( mask_lt, mask_e0 ); const vector unsigned int mask_e3 = (vector unsigned int)vec_and( mask_e2, mask_e1 ); const vector unsigned short mask_all_ge = (vector unsigned short)vec_cmpeq( mask_e3, zero ); const vector unsigned short result = vec_sel( a, mask_lt, mask_all_ge );
Which generated the following results for ppu-gcc (3.4.1) (-O3)
vspltish v0, 0 vcmpgtuh v3, v3, v2 vor v13, v2, v2 vsldoi v2, v3, v3, 8 vsldoi v1, v3, v3, 12 vand v2, v3, v2 vand v2, v2, v1 vcmpequw v2, v2, v0 vsel v2, v13, v3, v2
Examine the difference in complexity between the original and branch-free version of this function:
Original:
Sched2, original code
Branch-free:
Sched2, branch-free code
Result
Although the branch-free code emits a slightly longer sequence, the final complexity is lower (notice that there is only a simple, single path - which was, of course, the point.)
About the author
Mike Acton is a Senior Architect working on PS3/Cell research at Highmoon Studios (Vivendi-Universal Games) who occasionally refers to himself in the third person. Most recently, Mike was the Lead Special Effects Programmer for Darkwatch at Highmoon Studios (previously Sammy Studios) in Carlsbad, CA. Previously Mike has held Lead Programming, Playstation 2 Engine Development and Research roles with Yuke's, VBlank and BlueSky/Titus. He worked for Sony Interactive Studios in PSX development.

Mike has made regular appearances as a speaker at SCEA develpment conferences and has spoken at GDC. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.

He prefers vi.

Also check out the non Cell BE articles by Mike Acton.