CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
VECTOR UNSIGNED SHORT
Mike Acton
April 26, 2006
Format
Eight 16 bit unsigned integer values packed into a single 128 bit vector stored in big-endian format.
Elements are refered to by index from low-address to high-address.
| 0x00 | 0x02 | 0x04 | 0x06 | 0x08 | 0x0a | 0x0c | 0x0e | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Initialization (PPU and SPU)
The vector can be initialized in method similar to initializing an array.
vector unsigned short a = { 0x1122, 0x2233, 0x4455, 0x6677, 0x8899, 0xaabb, 0xccdd, 0xeeff };
The vector can be also be intialized using a vector cast.
vector unsigned short a = (vector unsigned short)( 0x1122, 0x2233, 0x4455, 0x6677, 0x8899, 0xaabb, 0xccdd, 0xeeff );

Extracting a scalar component (PPU)
On the PPU, there are two valid methods for moving the elements of the vector into a scalar register (in C99).
  • Cast through a (char*):
    unsigned short vec_ushort_extract( vector unsigned short v, int index ) { return ((unsigned short*)(char*)&v)[ index ]; }
    Note that a direct cast from (vector unsigned short*) to (unsigned short*) is not permitted under the C strict-aliasing rules. Casts to and from (char*) are excepted.

  • Cast through a union. This method is explicitly permitted under the C strict-aliasing rules.
    #include <altivec.h> #include <stdint.h> #include <stdio.h> typedef union VEC_USHORT VEC_USHORT; union VEC_USHORT { vector unsigned short v; unsigned short e[8]; }; int main( void ) { vector unsigned short a = { 0x1122, 0x2233, 0x4455, 0x6677, 0x8899, 0xaabb, 0xccdd, 0xeeff }; VEC_USHORT A = { .v = a }; printf("a = 0x%04x, 0x%04x, 0x%04x, 0x%04x, 0x%04x, 0x%04x, 0x%04x, 0x%04x\n" ,A.e[0], A.e[1], A.e[2], A.e[3], A.e[4], A.e[5], A.e[6], A.e[7] ); return (0); }
Be aware that with both these methods there is an access penalty. The vector must be written to the stack and read into the fixed point unit. Execution may stall until the load into the fixed point unit is complete.

Extracting a scalar component (SPU)
On the SPU, the elements of the vector can be moved into a scalar variable by using the spu_extract() function.
Note that there are no scalar registers on the SPU. This process effectively allows the compiler to move the selected vector component to a component of another vector register of its own choosing.
#include <spu_intrinsics.h> #include <stdint.h> #include <stdio.h> int main( void ) { vector unsigned short a = (vector unsigned short)( 0x1122, 0x2233, 0x4455, 0x6677, 0x8899, 0xaabb, 0xccdd, 0xeeff ); unsigned short e0 = spu_extract( a, 0 ); unsigned short e1 = spu_extract( a, 1 ); unsigned short e2 = spu_extract( a, 2 ); unsigned short e3 = spu_extract( a, 3 ); unsigned short e4 = spu_extract( a, 4 ); unsigned short e5 = spu_extract( a, 5 ); unsigned short e6 = spu_extract( a, 6 ); unsigned short e7 = spu_extract( a, 7 ); printf("a = 0x%04x, 0x%04x, 0x%04x, 0x%04x\n", e0, e1, e2, e3 ); printf(" = 0x%04x, 0x%04x, 0x%04x, 0x%04x\n", e4, e5, e6, e7 ); return (0); }