CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
More Techniques for Eliminating Branches
Mike Acton
April 11, 2006
Better Performance Through Branch Elimination
Part 1: Introduction
Part 2: Background on Branching
Part 3: Benefits to Branch Elimination
Part 4: Programming with Branches, Patterns and Tips
Part 5: More Techniques for Eliminating Branches

Additional Examples:

Increment And Decrement Wrapping Values
Occasionally you have a set of values that you want to wrap around as you increment and decrement them. But the straightfoward implementation can have a big impact on processors where comparisons and branches are expensive (e.g. PowerPC). This article presents a straightforward branch-free implementation of these functions.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Use masks to eliminate integer branches
uint32_t test_3( uint32_t arg0, uint32_t arg1 )
{
  uint32_t arg0_test = arg0 & 0x00040000;
  uint32_t arg1_test = arg1 & 0x00080000;  

  if (arg0_test | arg1_test )
  {
    return (arg0);
  }

  return (arg1);
}
Can be converted to:
uint32_t test_3( uint32_t arg0, uint32_t arg1 )
{
  uint32_t arg0_test          = arg0 & 0x00040000;
  uint32_t arg1_test          = arg1 & 0x00080000;  
  uint32_t test_comb          = arg0_test | arg1_test;
The important thing to recognize is that any negative number will have the most significant bit (MSB) set, and that's the only bit we care about. Because the and'ed values are always positive, negating them will always give a negative number, unless that number is zero. For zero, the negative is still exactly zero.
  uint32_t test_comb_sign_msb = -test_comb;
The cast to int32_t is important. This forces an arithmetic shift right (SRA), not a logical shift right (SRL). An SRA will use the MSB of the integer to fill the gap on the left caused by the shift (and is used with signed values). A SRL will fill the gap with zero (and is used with unsigned values). And because we know the sign is stored in the MSB and because we are shifting the MSB all the way to the right we know that the entire value will be filled with the sign bit. In other words, we are saturating the value with the sign bit.

The cast back into an unsigned value is just to be sure that there are no unwanted sign extensions. In this particular case it wouldn't really make a difference, but it's a good habit.
  uint32_t mask_comb          = (uint32_t)(((int32_t)test_comb_sign_msb)>>31);
If all the bits of mask_comb are one, then ( arg0 & mask_comb ) must be arg0 and ( arg1 & (~mask_comb) ) must be zero.
If all the bits of mask_comb are zero, then ( arg0 & mask_comb ) must be zero and ( arg1 & (~mask_comb) ) must be arg1.
 
  uint32_t result             = ( arg0 & mask_comb ) | ( arg1 & (~mask_comb) );

  return (result);
}
Use signed values for math. Use unsigned values for bit manipulation, unless there is a need for a signed intruction. Remember to cast it back at the end.
The PPU has two instructions that are very handy for dealing with masks:
andc (and with Complement) ( c = a & ~b )
orc  (or with Complement)  ( c = a | ~b )
Which makes this operation extremely quick. And now that the result has been selected based on the mask, it can simply be returned.

Looking at the generated code:
test_3:
    rlwinm 8,3,0,13,13      # arg0_test     = arg0 << 13
    rlwinm 10,4,0,12,12     # arg1_test     = arg1 << 12
    or 7,8,10               # test_comb     = arg0_test | arg1_test
    neg 6,7                 # test_comb_neg = -test_comb
    srawi 0,6,31            # mask_comb     = test_comb_neg SRA 31
    and 9,3,0               # arg0_result   = arg0 & mask_comb
    andc 3,4,0              # arg1_result   = arg1 & ~mask_comb
    or 3,9,3                # result        = arg0_result | arg1_result
    blr                     # RETURN (result)
There are two main things to notice:
  • No branches
  • No comparison operations, thus no dependencies on CR at all
Because there are no optimization barriers in this code, when it is inlined into another section of code the compiler will be able to schedule it very well. With the branches and comparison instructions, the compiler is much more limited on how the instructions can be scheduled and must also deal with any contention for the condition registers.

Use the fsel instruction to eliminate floating point branches
There's additional FPU instruction available on the PPU that will selectively move between a choice of two floating point registers based on the value of a third. fsel.

fsel result, test, arg0, arg1 
Which is logically equivalent to:
result = ( test >= 0.0 ) ? arg1 : arg0; 
EXAMPLE (1):
float test_4( float arg0, float arg1 )
{
  const float result0     = arg0;
  const float result1     = arg1;
  const float arg0_test   = arg0 - 1.0;
  const float arg1_test   = arg1 - 0.5;
  const float arg0_result = ( arg0_test >= 0.0f ) ? result1 : result0;
  const float arg1_result = ( arg1_test >= 0.0f ) ? result0 : arg0_result;
  
  return ( arg1_result );
}
.LC5:
    .long    1065353216
.LC6:
    .long    1056964608
.LC7:
    .long    0
test:
    lis 7,.LC5@ha 
    fmr 11,1   
    la 6,.LC5@l(7) 
    stwu 1,-32(1)  
    lis 5,.LC6@ha  
    lfs 5,0(6)   
    fsubs 12,1,5  
    la 4,.LC6@l(5)
    stfs 2,16(1) 
    lis 3,.LC7@ha
    lfs 3,0(4)   
    la 9,.LC7@l(3)
    fsubs 2,2,3 
    lwz 0,16(1)
    lfs 13,0(9) 
    mtctr 0     
    fcmpu 7,12,13
    fcmpu 6,2,13 
    bge- 7,.L16 
    stfs 1,16(1)
    lwz 8,16(1)
    mtctr 8    
.L16:
    bge- 6,.L19 
    mfctr 10   
    stw 10,16(1)
    lfs 11,16(1)
.L19:
    fmr 1,11
    addi 1,1,32
    blr
Use instead:
static inline float fsel_gez( float test, float arg0, float arg1 )
{
  float result;

  __asm__ ( "fsel %0, %1, %2, %3": "=f"(result): "f"(test), "f"(arg0), "f"(arg1) );

  return (result);
}

float test( float arg0, float arg1 )
{
  const float result0     = arg0;
  const float result1     = arg1;
  const float arg0_test   = arg0 - 1.0;
  const float arg1_test   = arg1 - 0.5;
  const float arg0_result = fsel_gez( arg0_test, result1, result0     );
  const float arg1_result = fsel_gez( arg1_test, result0, arg0_result );
  
  return ( arg1_result );
}
The less-than-equal comparison has been inverted to a greater-than-equal comparison. Which means that equal-to-zero will give the incorrect result. For many circumstances this is acceptable. An epsilon value may be added to the comparison in order to select on equal-to within a certain range.

The results from this are significantly better:
.LC5:
    .long    1065353216
.LC6:
    .long    1056964608
test:
    lis 5,.LC5@ha 
    lis 3,.LC6@ha 
    la 4,.LC5@l(5)
    la 9,.LC6@l(3)
    stwu 1,-16(1) 
    lfs 5,0(4)
    addi 1,1,16
    lfs 13,0(9)
    fsubs 4,1,5 
    fsubs 3,2,13
    fsel 0, 4, 2, 1
    fsel 2, 3, 1, 0
    fmr 1,2 
    blr 
EXAMPLE (2):

To calculate the minimum of three floats.
Although this function can be reduced, it serves as an example of using the result of fsel to generate a mask (0.0 or 1.0) which can then be used in calculations.
float f_min_rgb( float r, float g, float b )
{
  float d_rg     = g - r;                            /* >= 0 when ( r <= g )                 */
  float d_rb     = b - r;                            /* >= 0 when ( r <= b )                 */
  float d_gb     = b - g;                            /* >= 0 when ( g <= b )                 */
  float m_rg     = fsel_gez( d_rg, 1.0f, 0.0f );     /* ( r <= g )                           */
  float m_rb     = fsel_gez( d_rb, 1.0f, 0.0f );     /* ( r <= b )                           */
  float m_gb     = fsel_gez( d_gb, 1.0f, 0.0f );     /* ( g <= b )                           */
  float m_gr     = 1.0f - m_rg;                      /* ( g <= r )                           */
  float m_br     = 1.0f - m_rb;                      /* ( b <= r )                           */
  float m_bg     = 1.0f - m_gb;                      /* ( b <= g )                           */
  float r_gb     = m_rg * m_rb;                      /* (( r <= g ) && ( r <= b ))           */
  float g_rb     = m_gr * m_gb;                      /* (( g <= r ) && ( g <= b ))           */
  float b_rg     = m_br * m_bg;                      /* (( b <= r ) && ( b <= g ))           */
  float gb_r     = 1.0f - r_gb;                      /* (( g <= r ) && ( b <= r ))           */
  float rb_g     = 1.0f - g_rb;                      /* (( r <= g ) && ( b <= g ))           */
  float result_r = r * r_gb;                         /* (( r <= g ) && ( r <= b )) ? r : 0.0 */
  float result_g = g * g_rb * gb_r;                  /* (( g < r )  && ( g <= b )) ? g : 0.0 */
  float result_b = b * b_rg * rb_g;                  /* (( b < r )  && ( b < g ))  ? b : 0.0 */
  float result   = result_r + result_g + result_b;

  return (result);
}
Use fast versions of floor() and ceil() to eliminate floating point branches
Describe example of using float->int conversion as a mask for step functions
Predicated Reads and Writes
Predicated read example:
if ( foo.a > b )
{
  c += foo.a;
}
In this case (predicated read), create a local zero variable to the effect of
mask   = test_gt( foo.a, b );
add_c  = select( mask, foo.a, zero );
c     += add_c;
Predicated write example:
if ( a > b )
{
  foo.c += a;
}
In this case have a junk variable on the stack. The stack is very likely to be on-cache, so the write to L1 will be faster than the cost of even a few percentage points of mis-predicted branches: (There is no PPU or SPU equivalent to the P4 cmove instruction)

out_junk   = &junk;
out_foo_c  = &foo.c
mask       = test_gt( a, b );
out_addr   = select( mask, out_foo_c, out_junk );
*out_addr += a;
About the authors
Mike Acton is a Senior Architect working on PS3/Cell research at Highmoon Studios (Vivendi-Universal Games). Most recently, Mike was the Lead Special Effects Programmer for Darkwatch at Highmoon Studios (previously Sammy Studios) in Carlsbad, CA. Previously Mike has held Lead Programming, Playstation 2 Engine Development and Research roles with Yuke's, VBlank and BlueSky/Titus. He worked for Sony Interactive Studios in PSX development.

Mike has made regular appearances as a speaker at SCEA develpment conferences and has spoken at GDC. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.


André de Leiradella started with computers in 1981 with a ZX-81 compatible with a built-in BASIC interpreter. Two years later he was already using assembly to code faster programs in this platform. Today André is a Support Analyst working at Xerox, and uses he's spare time to code pet projects that goes from using script languages to control a VNC server and fixed point expression compilers for J2ME, to a full Java bytecode interpreter.

André has a Computer Science degree and a M.Sc. degree in Artificial Intelligence and has a fairly good knowledge of many areas of Computer Science, managing to dig deep on any specific subject when needed.

May Also Interest You
Understanding Strict Aliasing (Mike Acton)
Strict aliasing has been part of C programming for the better part of the last decade but a thorough understanding of the details of this feature is still clouded in mystery for many programmers. Examine detailed examples and some perculiarities of GCC's implementation.

Demystifying The Restrict Keyword (Mike Acton)
Optimizing data access is a critical part of good performance. Read on to find out how to use the restrict keyword to open up a whole class of optimizations that were previously impossible for a C compiler.

Avoiding Microcoded Instructions On The PPU (Mike Acton)
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

FullMoon [geocites.com] (André de Leiradella)
This plugin is an attempt to make object plugin coding for Moray [stmuc.com] easier, by allowing authors to code them using a script language called Lua. Lua is a procedural language with some OOP, meaning it's very easy to learn because it's similar to C and Pascal and it allows a very powerful design of the object interface.

APF Project [geocities.com] (André de Leiradella)
This library is targetted at programs that need to find paths in small mazes. Diagonal movements are allowed.