CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Increment And Decrement Wrapping Values
Mike Acton
July 11, 2006
Small code, big impact
Occasionally you have a set of values that you want to wrap around as you increment and decrement them. For example, in a GUI where the user keys right or left and you want to wrap around the menu.

A typical implementation:
static inline int wrap_inc( int value, int min, int max ) { return ( value == max ) ? min : value + 1; } static inline int wrap_dec( int value, int min, int max ) { return ( value == min ) ? max : value - 1; }

But on processors (such as the PowerPC) where compare and branch is very costly these small one-liners can have a significant impact on performance when used in critical code. They also make optimization more difficult for the compiler for the surrounding code. For more information read: Better Performance Through Branch Elimination

Breakdown
Store the desired result:
const type result_inc = val + 1;
This value may overflow if val == INT(SIZE)_MAX, but in that case the correct value will still be selected.

Get the different between the max (or min) value and the current value:
const type max_diff = max - val;
It's only important if this value is zero or not zero. If it's zero, we know we are at the max (or min) value; otherwise we can increment (or decrement).

Create a mask based on the difference:
const type max_diff_nz = (type)( (stype)( max_diff | -max_diff ) >> bit_mask );
i.e.
max_diff_nz = ( max_diff != 0 ) ? (type)-1 : (type)0;
(Remember that -1 is all bits on in two's complement.)

Complement the mask:
const type max_diff_eqz = ~max_diff_nz;

Select the correct result based on the masks:
const type result = ( result_inc & max_diff_nz ) | ( min & max_diff_eqz );
Only one of the two values can possibly be selected.
i.e.
result = ( val == max ) ? min : val + 1;


Final Code
// // wrap_int.h // #ifndef WRAP_INT_H #define WRAP_INT_H // // Increment wrapping value // // val = { ( val == max ), min // = { otherwise, val + 1 // // uint8_t wrap_inc_u8 ( const uint8_t val, const uint8_t min, const uint8_t max ); // uint16_t wrap_inc_u16( const uint16_t val, const uint16_t min, const uint16_t max ); // uint32_t wrap_inc_u32( const uint32_t val, const uint32_t min, const uint32_t max ); // uint64_t wrap_inc_u64( const uint64_t val, const uint64_t min, const uint64_t max ); // int8_t wrap_inc_s8 ( const int8_t val, const int8_t min, const int8_t max ); // int16_t wrap_inc_s16( const int16_t val, const int16_t min, const int16_t max ); // int32_t wrap_inc_s32( const int32_t val, const int32_t min, const int32_t max ); // int64_t wrap_inc_s64( const int64_t val, const int64_t min, const int64_t max ); #define DECL_WRAP_INC( type_name, type, stype, bit_mask ) \ static inline type wrap_inc_##type_name( const type val, const type min, const type max ) \ { \ const type result_inc = val + 1; \ const type max_diff = max - val; \ const type max_diff_nz = (type)( (stype)( max_diff | -max_diff ) >> bit_mask ); \ const type max_diff_eqz = ~max_diff_nz; \ const type result = ( result_inc & max_diff_nz ) | ( min & max_diff_eqz ); \ \ return (result); \ } DECL_WRAP_INC( u8, uint8_t, int8_t, 7 ); DECL_WRAP_INC( u16, uint16_t, int16_t, 15 ); DECL_WRAP_INC( u32, uint32_t, int32_t, 31 ); DECL_WRAP_INC( u64, uint64_t, int64_t, 63 ); DECL_WRAP_INC( s8, int8_t, int8_t, 7 ); DECL_WRAP_INC( s16, int16_t, int16_t, 15 ); DECL_WRAP_INC( s32, int32_t, int32_t, 31 ); DECL_WRAP_INC( s64, int64_t, int64_t, 63 ); // // Decrementing wrapping value // // val = { ( val == min ), max // = { otherwise, val - 1 // // uint8_t wrap_dec_u8 ( const uint8_t val, const uint8_t min, const uint8_t max ); // uint16_t wrap_dec_u16( const uint16_t val, const uint16_t min, const uint16_t max ); // uint32_t wrap_dec_u32( const uint32_t val, const uint32_t min, const uint32_t max ); // uint64_t wrap_dec_u64( const uint64_t val, const uint64_t min, const uint64_t max ); // int8_t wrap_dec_s8 ( const int8_t val, const int8_t min, const int8_t max ); // int16_t wrap_dec_s16( const int16_t val, const int16_t min, const int16_t max ); // int32_t wrap_dec_s32( const int32_t val, const int32_t min, const int32_t max ); // int64_t wrap_dec_s64( const int64_t val, const int64_t min, const int64_t max ); #define DECL_WRAP_DEC( type_name, type, stype, bit_mask ) \ static inline type wrap_dec_##type_name( const type val, const type min, const type max ) \ { \ const type result_dec = val - 1; \ const type min_diff = min - val; \ const type min_diff_nz = (type)( (stype)( min_diff | -min_diff ) >> bit_mask ); \ const type min_diff_eqz = ~min_diff_nz; \ const type result = ( result_dec & min_diff_nz ) | ( max & min_diff_eqz ); \ \ return (result); \ } DECL_WRAP_DEC( u8, uint8_t, int8_t, 7 ); DECL_WRAP_DEC( u16, uint16_t, int16_t, 15 ); DECL_WRAP_DEC( u32, uint32_t, int32_t, 31 ); DECL_WRAP_DEC( u64, uint64_t, int64_t, 63 ); DECL_WRAP_DEC( s8, int8_t, int8_t, 7 ); DECL_WRAP_DEC( s16, int16_t, int16_t, 15 ); DECL_WRAP_DEC( s32, int32_t, int32_t, 31 ); DECL_WRAP_DEC( s64, int64_t, int64_t, 63 ); #endif /* #ifndef WRAP_INT_H */