CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Benefits to Branch Elimination
Mike Acton
April 11, 2006
Better Performance Through Branch Elimination
Part 1: Introduction
Part 2: Background on Branching
Part 3: Benefits to Branch Elimination
Part 4: Programming with Branches, Patterns and Tips
Part 5: More Techniques for Eliminating Branches

Additional Examples:

Increment And Decrement Wrapping Values
Occasionally you have a set of values that you want to wrap around as you increment and decrement them. But the straightfoward implementation can have a big impact on processors where comparisons and branches are expensive (e.g. PowerPC). This article presents a straightforward branch-free implementation of these functions.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Background
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Take for example, this simple step function:
if ( x < x1 )
{
  a = a0;
}
else if ( x < x2 )
{
  a = a1;
}
else if ( x < x3 )
{
  a = a2;
}
else if ( x < x4 )
{
  a = a3;
}
else
{
  a = a4;
}

This function can also be expressed by the following branch-free expression, where x is a unsigned 32 bit integer in the range 0 < x < 0x7fffffff
uint32_t x_lt_x1  = (uint32_t)((int32_t)(x - x1) >> 31);
uint32_t x_lt_x2  = (uint32_t)((int32_t)(x - x2) >> 31);
uint32_t x_lt_x3  = (uint32_t)((int32_t)(x - x3) >> 31);
uint32_t x_lt_x4  = (uint32_t)((int32_t)(x - x4) >> 31);
uint32_t result_0 = (x_lt_x4 & a3 ) | (~x_lt_x4 & a4);
uint32_t result_1 = (x_lt_x3 & a2 ) | (~x_lt_x3 & result_0);
uint32_t result_2 = (x_lt_x2 & a1 ) | (~x_lt_x2 & result_1);
uint32_t result   = (x_lt_x1 & a0 ) | (~x_lt_x1 & result_2);

Mispredicted branches that continue execution at the wrong address will cause all the instructions that have been executed up to the time the right prediction is found to be canceled, causing a delay in program execution.

Both the PPU and the eight (8) on-board SPUs are in-order, superscalar, RISC platforms designed for one thing - throughput, or executing the maximum number of instructions in the minimum amount of time. In the best case both architectures can issue two (2) instructions per clock cycle. But in order to approach the maximum throughput in performance critical code on either architecture, two things must be true:
  • There must be something to execute on each cycle.
  • There must be enough instructions between the start of an instruction and the use of its result to cover the latency.
Because the PPU and SPU have simple (or no) branch predictors, and because they are both in-order architectures, in a significant amount of cases it will be faster to evaluate both sides of a branch then select the correct result at the end. Both SIMD instruction sets have support for partial predication (select instructions). The PPU's FPU also has support for partial predication (the fsel instruction). But the PPU's FXU (fixed-point execution unit, i.e. integer unit) does not. In that case software predication is used (e.g. integer masks)

Here is a summary of the best reasons to consider writing branch-free code (i.e. branch elimination):


And in consideration of the one main objection: Harder to understand

Better Code Scheduling
Regardless of how good the branches are predicted, because branches negatively impact how code is scheduled, there is often benefit to reducing them.

Most of the work that the compiler's code scheduler will do will be on basic blocks. A basic block is any series of instructions that can be executed unconditionally. Within a block, the compiler will be able to schedule instructions and manage the registers very well. However it's between basic blocks that things get complicated.

Every branch divides the code into three basic blocks.
For example:
A: do some stuff ... 
if ( some_condition ) 
{ 
  B: do some stuff ... 
} 
C: do some stuff ... 

The scheduler must decide:
  • Where to put A,B and C (since they do not need to be contiguous).
  • How registers will be allocated inside of A, B and C
  • Can some_condition be predicited statically?
  • etc.

However, by eliminating the branch altogether the compiler's job is made much easier and better code generation will (often) result. There are no guarantees, of course -- in many situations a branch is exactly what you want.

Less branches means easier to schedule loads and stores together. Although GCC has a limited ability to move loads outside of a branch, it's very conservative. For the PPU and SPU, the most effective pattern is (for all objects in a block) LOAD -> UPDATE -> STORE. Use the large register files and pipeline the loads early. For an example of the effective use of this pattern, see the Box Overlap example.

The advantage of consecutive loads is by pushing the loads as far in advance as possible (which is done by loading to the limit of the available registers) the maximum amount of time can be exploited for data to be loaded from the cache.
Easier to profile
Less branches means more predictable performance. Fundamentally, all branch prediction schemes are concerned with average performance. By contrast, once data and instruction cache hits are accounted for, branch-free code has constant, fixed performance. This allows programmers who are optimizing a section of code based on its performance profile with only a small section of either time or data available, to effectively ignore the impact varying data may have on performance. This property, in turn, makes any evaluation of the performance of other systems (e.g. cache hits) more accurate and valuable to the programmer.
Easier to test
Less branches means less complexity to test. Approaching full code coverage becomes much easier with every branch removed. When complex data is mixed with branchy code, there is an explosion of scenarios which must be tested, either by programmer-devised or automatically-generated tests. Each element of the data must be tested for each branch it may encounter and every branch must be tested to ensure that it is evaluating the correct data. In branch-free sections of code, the tests can be limited to concentrate on the data since every element of data will follow exactly the same path.

Additionally, tests are more valuable in branch-free sections of code, since every test must effectively make every possible decision correctly in order to have the correct result. Thus, passing tests on branch-free code are a more accurate and reliable indicator of the validity of the code when compared to testing branchy code, where it can never be gauranteed, simply by looking at the tests, whether or not all the code has even been evaluated once.
Easier to optimize
Easier to organize instruction layer
Expect better (more predictable) code from the compiler. Beyond the scheduling impact, the traces are much simpler and most variables will be write-once (const). Many of the largest failures in optimization are closer to the front-end where the compiler is re-organizing code. Simple traces with write-once variables don't need to be re-organized.
More predictable gains through inlining
Easier to move to SIMD
Effective use of SIMD instructions depends on keeping the pipeline full as long as possible. This generally requires branch elimination. Branch-free scalar code is significantly easier to move to Altivec on the PPU, or the SPU instruction set than branchy code.
Improved accuracy of remaining dynamic branch predictions
Dynamic branch prediction it has its limits, like everything else. Branch prediction is based on taking certain bits of the branch address and looking them up in a branch history table. The table has values that range from very-unlikely to very-likely and the which set of instructions that are fetched are a function of that value. Then, depending on the actual result, the value in the table is adjusted for the next time. Branch history tables are very small though and in all likelyhood there are many branches that will end up using the same entry in the table making it more and more inaccurate.
Harder to understand (?)
Predicated results, or branch-free code, may impact readability somewhat. Programmers are acustomed to reading and writing branchy code. However, much of this can made clearer through well named variables and predicate functions/macros.
About the authors
Mike Acton is a Senior Architect working on PS3/Cell research at Highmoon Studios (Vivendi-Universal Games). Most recently, Mike was the Lead Special Effects Programmer for Darkwatch at Highmoon Studios (previously Sammy Studios) in Carlsbad, CA. Previously Mike has held Lead Programming, Playstation 2 Engine Development and Research roles with Yuke's, VBlank and BlueSky/Titus. He worked for Sony Interactive Studios in PSX development.

Mike has made regular appearances as a speaker at SCEA develpment conferences and has spoken at GDC. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.


André de Leiradella started with computers in 1981 with a ZX-81 compatible with a built-in BASIC interpreter. Two years later he was already using assembly to code faster programs in this platform. Today André is a Support Analyst working at Xerox, and uses he's spare time to code pet projects that goes from using script languages to control a VNC server and fixed point expression compilers for J2ME, to a full Java bytecode interpreter.

André has a Computer Science degree and a M.Sc. degree in Artificial Intelligence and has a fairly good knowledge of many areas of Computer Science, managing to dig deep on any specific subject when needed.

May Also Interest You
Performance and Good Data Design
Some simple rules of thumb that programmers can follow to create a solid pipeline from the content creators to the screen and speakers.

Demystifying The Restrict Keyword (Mike Acton)
Optimizing data access is a critical part of good performance. Read on to find out how to use the restrict keyword to open up a whole class of optimizations that were previously impossible for a C compiler.

Avoiding Microcoded Instructions On The PPU (Mike Acton)
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

luareader [geocities.com] (André de Leiradella)
A library that allows Lua files to be executed from a number of sources.

APF Project [geocities.com] (André de Leiradella)
This library is targetted at programs that need to find paths in small mazes. Diagonal movements are allowed.