CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Unaligned scalar load and store on the SPU
Mike Acton
September 15, 2006
Albert Noll, a student at UC Irvine is working on an interesting project. According to him:
"I am currently working on a java virtual machine runtime environment which hides the heterogenity of the cell architecture. Conventional java code code be executed and benefit from the numerous execution units the Cell architecture offers. I am doing some java benchmarks (java grande) to test the efficiancy of the implementation, but I still have some problems achieving really good results."

One of the problems Albert has encountered recently is in loading and storing scalar doubles to the java stack. He recently posed a question on the Cell Broadband Engine Architecture forum [ibm.com]:
"I have the following problem: I want to load a double value from an array (represents stack of the application) which is of type unsigned int. The two 32-bit values, which represent the double value have been casted to a double before, so the bits are set according to the double representation of the value."

The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords. The ability to compile scalar code on the SPU is something of a convinience, but it doesn't come without a penalty.

The first step, before considering performance, is to properly be able to load and store the unaligned double values.

Instead of:
{ // [ostack is unsigned int] double tmp[2]; memcpy(tmp, ostack - 4, 16); double res = tmp[0] * tmp[1]; memcpy(ostack - 4, &res, 8); ostack -= 2; }
Simplify a bit:
{ const double arg0 = unaligned_load_double( ostack-4 ); const double arg1 = unaligned_load_double( ostack-2 ); const double result = arg0 * arg1; unaligned_store_double( ostack-4, result ); ostack -= 2; }
After Albert gets his basic scalar code working, the next two steps to optimizing this access would probably be:
  • Instead of loading and storing each individual element, keep a "cache" of a couple of quadwords in the stack that are being worked with and keep using them until the data must be flushed (store) or requires loading a new line.
  • Start moving away from the scalar implementation completely. The both the SPU compiler and the SPU itself are most effective when working with vectors.

Have a look at the unaligned load and store functions below and hopefully it will be clear why scalar values necessarily complicates things on the SPU and ultimately leads to poorer performance.
double unaligned_load_double( const char* const arg0 );
void unaligned_store_double( char* const arg0, double arg1 );

Or download the files directly:
unaligned_double.c
unaligned_double.h
vsl.c
vsl.h

If you have a question you think Mike Acton or one of the other members of CellPerformance can help you with, feel free to email or post in our forums.

unaligned_double.h
0#ifndef UNALIGNED_DOUBLE_H 1#define UNALIGNED_DOUBLE_H 2 3double unaligned_load_double( const char* const arg0 ); 4void unaligned_store_double( char* const arg0, double arg1 ); 5 6#endif

unaligned_double.c
0#include "unaligned_double.h" 1#include "vsl.h" 2#include <stdint.h> 3#include <spu_intrinsics.h> 4 5// Return the double stored at the unaligned address at (arg0) 6double 7unaligned_load_double( const char* const arg0 ) 8{ 9 uintptr_t source_addr_u = (uintptr_t)arg0; 10 qword source_addr = si_from_uint( source_addr_u ); 11 12 // Unaligned load requires two reads since the double may be 13 // stored across two aligned quadwords. These loads will read 14 // starting at the quadword aligned address of source_addr 15 // i.e. will ignore the low 4 bits of source_addr 16 17 qword source_lo = si_lqd( source_addr, 0x00 ); 18 qword source_hi = si_lqd( source_addr, 0x10 ); 19 20 // Get low 4 bits of source address. This is the offset from the 21 // aligned address. 22 23 qword shift_offset = si_andi( source_addr, 0x0f ); 24 25 // Create a shuffle pattern which selects the appropriate 26 // bytes of the double from the two quadwords. 27 // 28 // The value of shift offset is stored in byte[3] of shift_offset, 29 // 1. Generate the pattern 0x03030303_03030303_03030303_03030303 30 // 2. Use shuffle to generate a qword with each byte filled with 31 // the shift offset. 32 33 qword lo_byte_pattern = si_ilh( 0x0303 ); 34 qword shift_offset_pattern = si_shufb( shift_offset, shift_offset, lo_byte_pattern ); 35 36 // Add shift_offset_pattern to a vector for shift left (which is 37 // just a byte vector with bytes incrementing from 0-15). This will 38 // generate a shuffle pattern which will be used store the unaligned 39 // double in the preferred slot of the result. 40 41 qword vector_shift_left = (qword)_vsl; 42 qword shift_pattern = si_a( shift_offset_pattern, vector_shift_left ); 43 44 // Move double into preferred slot and extract. 45 46 qword result_preferred = si_shufb( source_lo, source_hi, shift_pattern ); 47 double result = si_to_double( result_preferred ); 48 49 return (result); 50} 51 52// Write the double (arg1) at the unaligned address at (arg0) 53void 54unaligned_store_double( char* const arg0, double arg1 ) 55{ 56 uintptr_t source_addr_u = (uintptr_t)arg0; 57 qword source_addr = si_from_uint( source_addr_u ); 58 qword value = si_from_double( arg1 ); 59 60 // Unaligned store requires two reads since the double may be 61 // stored across two aligned quadwords. The two source lines 62 // will be loaded, modified then stored. 63 64 qword source_lo = si_lqd( source_addr, 0x00 ); 65 qword source_hi = si_lqd( source_addr, 0x10 ); 66 67 // Get low 4 bits of source address. This is the offset from the 68 // aligned address. 69 70 qword shift_offset = si_andi( source_addr, 0x0f ); 71 72 // Create a shuffle pattern which selects the appropriate 73 // bytes of the double from the low quadword. 74 // 75 // The value of shift offset is stored in byte[3] of shift_offset, 76 // 1. Generate the pattern 0x03030303_03030303_03030303_03030303 77 // 2. Use shuffle to generate a qword with each byte filled with 78 // the shift offset. 79 80 qword lo_byte_pattern = si_ilh( 0x0303 ); 81 qword shift_offset_pattern_lo = si_shufb( shift_offset, shift_offset, lo_byte_pattern ); 82 83 84 // Create a shuffle pattern which selects the appropriate 85 // bytes of the double from the high quadword 86 // high offset = (16 bytes - low offset) 87 88 qword shift_adjust_hi = si_ilh( 0x1010 ); 89 qword shift_offset_pattern_hi = si_sf( shift_adjust_hi, shift_offset_pattern_lo ); 90 91 // Subtract shift offset pattern from a vector for shift left (which is 92 // just a byte vector with bytes incrementing from 0-15). This will 93 // generate a shuffle pattern which will be used store the unaligned 94 // double at the unaliged locations used in the first source line. 95 96 qword vector_shift_left = (qword)_vsl; 97 qword shift_pattern_lo = si_sf( shift_offset_pattern_lo, vector_shift_left ); 98 qword shift_pattern_hi = si_sf( shift_offset_pattern_hi, vector_shift_left ); 99 100 // Mask the bits that will be unmodified in the source lines. 101 // Any shuffle pattern outside the range [0x00,0x07] is not being used 102 // by the value, so that will be kept in the source lines. 103 104 qword source_bits_mask_lo = si_clgtbi( shift_pattern_lo, 0x07 ); 105 qword source_bits_mask_hi = si_clgtbi( shift_pattern_hi, 0x07 ); 106 qword value_lo = si_shufb( value, value, shift_pattern_lo ); 107 qword value_hi = si_shufb( value, value, shift_pattern_hi ); 108 109 // Clear space in source lines to store the value 110 111 qword prepped_source_lo = si_and( source_lo, source_bits_mask_lo ); 112 qword prepped_source_hi = si_and( source_hi, source_bits_mask_hi ); 113 114 // Clear everything unwanted from the value 115 116 qword prepped_value_lo = si_andc( value_lo, source_bits_mask_lo ); 117 qword prepped_value_hi = si_andc( value_hi, source_bits_mask_hi ); 118 119 // Combine the source lines and the value lines 120 121 qword result_lo = si_or( prepped_source_lo, prepped_value_lo ); 122 qword result_hi = si_or( prepped_source_hi, prepped_value_hi ); 123 124 // Store the result 125 126 si_stqd( result_lo, source_addr, 0x00 ); 127 si_stqd( result_hi, source_addr, 0x10 ); 128}

vsl.h
0#ifndef VSL_H 1#define VSL_H 2 3extern const vector unsigned char _vsl; 4 5#endif

vsl.c
0// Vector for shift left 1const vector unsigned char _vsl = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 2 ,0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };