Box Overlap

Interactive 3D applications frequently need to check whether one geometric object overlaps another. In this article, we'll look at a function to test for overlap between 3D boxes, and we'll show how to optimize this function for the CBE.
Before Optimization
Let's start with the example below, which is similar to solid-2.5.4/src/complex/DT_CBox.h in the SOLID library from Dtecta, but doesn't need to include a bunch of other stuff.
#include <math.h>
#include <stdint.h>

struct Vector3
{
float m_co[3];

Vector3() {}

Vector3(const float& x, const float& y, const float& z)
{
m_co[0] = x;
m_co[1] = y;
m_co[2] = z;
}

float& operator[]( int i ) { return m_co[i]; }
const float& operator[]( int i ) const { return m_co[i]; }

Vector3& operator-=(const Vector3& v)
{
this->m_co[0] -= v.m_co[0];
this->m_co[1] -= v.m_co[1];
this->m_co[2] -= v.m_co[2];

return (*this);
}

Vector3& operator+=(const Vector3& v)
{
this->m_co[0] += v.m_co[0];
this->m_co[1] += v.m_co[1];
this->m_co[2] += v.m_co[2];

return (*this);
}
};

struct Box
{
Vector3 m_center;
Vector3 m_extent;

Box() {}

Box(const Vector3& center, const Vector3& extent)
: m_center(center),
m_extent(extent)
{}

bool overlaps(const Box& b) const
{
return ::fabs(m_center[0] - b.m_center[0]) <= m_extent[0] + b.m_extent[0] &&
::fabs(m_center[1] - b.m_center[1]) <= m_extent[1] + b.m_extent[1] &&
::fabs(m_center[2] - b.m_center[2]) <= m_extent[2] + b.m_extent[2];
}
};

bool
test_overlap( const Box& a, const Box& b )
{
return a.overlaps( b );
}
We'll be looking at the compiler output for the test_overlap() function. I made the data public, since it makes the optimization pass much simpler. I'm not going to debate what makes "better" C++ code. We're just talking about what's faster here.

Nothing is obviously wrong (i.e. slow) with this code. Everything is inline, so we shouldn't expect any unneeded jumps for such small functions. There's only one reference to each element, so we shouldn't expect any extra loads.

But if we look at the compiler output, we see that almost every operation stalls waiting for the operands to load.
_Z12test_overlapRK3BoxS1_:
stwu 1,-16(1)
lfs 3,0(3) -- LOAD(3)
li 0,0
lfs 11,0(4) -- LOAD(11)
fsubs 1,3,11 -- WAIT FOR LOAD(3), LOAD(11)
lfs 2,16(3) -- LOAD(2)
lfs 0,16(4) -- LOAD(0)
fadds 12,2,0 -- WAIT FOR LOAD(2), LOAD(2)
fabs 13,1
fcmpu 7,13,12
bgt- 7,.L2
lfs 9,4(3) -- LOAD(9)
lfs 10,4(4) -- LOAD(10)
fsubs 6,9,10 -- WAIT FOR LOAD(9), LOAD(10)
lfs 7,20(3) -- LOAD(7)
lfs 8,20(4) -- LOAD(8)
fadds 4,7,8 -- WAIT FOR LOAD(7), LOAD(8)
fabs 5,6
fcmpu 0,5,4
bgt- 0,.L2
lfs 0,8(3) -- LOAD(0)
lfs 11,8(4) -- LOAD(11)
fsubs 1,0,11 -- WAIT FOR LOAD(0), LOAD(11)
lfs 2,24(3) -- LOAD(2)
lfs 3,24(4) -- LOAD(3)
fadds 13,2,3 -- WAIT FOR LOAD(2), LOAD(3)
fabs 12,1
fcmpu 1,12,13
bgt- 1,.L2
li 0,1
.L2:
mr 3,0
addi 1,1,16
blr
The compiler has built the dependency graph around the branches. You might think we benefit by branching out immediately when we find a case that fails, since we skip the subsequent loads. But this turns out to be a bad idea for the following reasons:

1. Operands that are adjacent in memory probably lie on the same cache line. The PPE is dual thread, so the longer the delay between loads of adjacent operands, the greater the chance of the other thread (or an interrupt) flushing the cache line.
2. The compiler has used "bgt-", meaning the branches are statically predicted unlikely. It doesn't make much sense for the compiler to hide loads behind unlikely branches.
Separating Loads From Calculations
What we want to do is queue up the loads as deep as we can before we start doing any calculations.
Don't use class or struct fields (or array elements) directly in calculations. Always follow this pattern:
1. Load everything you need into local variables of native types.
2. Do all your calculations.
3. Store your final result, while trying to avoid branches.
Here's the second version of the overlaps() method:
    bool overlaps(const Box& b) const
{
//
// LOADS
//

const float a_c0 = m_center[0];
const float a_c1 = m_center[1];
const float a_c2 = m_center[2];
const float a_e0 = m_extent[0];
const float a_e1 = m_extent[1];
const float a_e2 = m_extent[2];
const float b_c0 = b.m_center[0];
const float b_c1 = b.m_center[1];
const float b_c2 = b.m_center[2];
const float b_e0 = b.m_extent[0];
const float b_e1 = b.m_extent[1];
const float b_e2 = b.m_extent[2];

//
// CALCULATIONS
//

const float delta_c0 = a_c0 - b_c0;
const float delta_c1 = a_c1 - b_c1;
const float delta_c2 = a_c2 - b_c2;
const float abs_delta_c0 = ::fabs( delta_c0 );
const float abs_delta_c1 = ::fabs( delta_c1 );
const float abs_delta_c2 = ::fabs( delta_c2 );
const float sum_e0 = a_e0 + b_e0;
const float sum_e1 = a_e1 + b_e1;
const float sum_e2 = a_e2 + b_e2;

//
// COMPARES AND BRANCHES
//

const bool in_0 = abs_delta_c0 <= sum_e0;
const bool in_1 = abs_delta_c1 <= sum_e1;
const bool in_2 = abs_delta_c2 <= sum_e2;
const bool result = in_0 && in_1 && in_2;

return (result);
}
The results are not much better at this point. The compiler reorders things, doing each subtraction as soon as the needed operands are loaded.
_Z12test_overlapRK3BoxS1_:
stwu 1,-16(1)
lfs 9,4(3)
li 9,0
lfs 0,4(4)
fsubs 2,9,0
lfs 8,0(3)
lfs 1,0(4)
fsubs 5,8,1
lfs 10,8(3)
lfs 12,8(4)
fsubs 1,10,12
lfs 3,16(3)
lfs 13,16(4)
fadds 0,3,13
fabs 9,2
lfs 11,12(4)
lfs 6,12(3)
fadds 12,6,11
lfs 4,20(3)
lfs 7,20(4)
fabs 8,5
fadds 11,4,7
fabs 10,1
fcmpu 7,9,0
crnot 30,29
mfcr 0
rlwinm 0,0,31,1
fcmpu 1,8,12
fcmpu 6,10,11
crnot 26,25
cmpwi 7,0,0
mfcr 0
rlwinm 0,0,27,1
bgt- 1,.L14
cmpwi 6,0,0
beq- 7,.L14
beq- 6,.L14
li 9,1
.L14:
mr 3,9
addi 1,1,16
blr
We'll use a trick to prevent the compiler from mixing loads and calculations. First we define this macro:
#define GCC_SPLIT_BLOCK __asm__ ("");
An empty inline assembly statement doesn't add any code, but it splits the basic block, forcing the compiler to schedule the code on either side separately. We'll add this macro after the loads but before the calculations. We'll also add it just after the calculations so it's easier to see what's happening, but this second split isn't really important for optimization.

Here's the third version of the overlaps() method:
    bool overlaps(const Box& b) const
{
//
// LOADS
//

const float a_c0 = m_center[0];
const float a_c1 = m_center[1];
const float a_c2 = m_center[2];
const float a_e0 = m_extent[0];
const float a_e1 = m_extent[1];
const float a_e2 = m_extent[2];
const float b_c0 = b.m_center[0];
const float b_c1 = b.m_center[1];
const float b_c2 = b.m_center[2];
const float b_e0 = b.m_extent[0];
const float b_e1 = b.m_extent[1];
const float b_e2 = b.m_extent[2];

GCC_SPLIT_BLOCK

//
// CALCULATIONS
//

const float delta_c0 = a_c0 - b_c0;
const float delta_c1 = a_c1 - b_c1;
const float delta_c2 = a_c2 - b_c2;
const float abs_delta_c0 = ::fabs( delta_c0 );
const float abs_delta_c1 = ::fabs( delta_c1 );
const float abs_delta_c2 = ::fabs( delta_c2 );
const float sum_e0 = a_e0 + b_e0;
const float sum_e1 = a_e1 + b_e1;
const float sum_e2 = a_e2 + b_e2;

GCC_SPLIT_BLOCK

//
// COMPARES AND BRANCHES
//

const bool in_0 = abs_delta_c0 <= sum_e0;
const bool in_1 = abs_delta_c1 <= sum_e1;
const bool in_2 = abs_delta_c2 <= sum_e2;
const bool result = in_0 && in_1 && in_2;

return (result);
}
The new output clearly shows that the code was scheduled on either side of the splits.
_Z12test_overlapRK3BoxS1_:
//
// PUSH STACK
//

stwu 1,-16(1)

//
// LOADS
//

lfs 4,20(3)
lfs 3,20(4)
lfs 1,0(3)
lfs 13,4(3)
lfs 12,8(3)
lfs 11,12(3)
lfs 10,16(3)
lfs 9,0(4)
lfs 8,4(4)
lfs 7,8(4)
lfs 6,12(4)
lfs 5,16(4)

//
// CALCULATIONS
//

fsubs 0,1,9
fsubs 2,13,8
fsubs 1,12,7
fadds 11,11,6
fadds 10,10,5
fadds 4,4,3
fabs 0,0
fabs 13,2
fabs 12,1

//
// COMPARES AND BRANCHES
//

fcmpu 7,13,10
li 3,0
crnot 30,29
fcmpu 1,0,11
mfcr 0
rlwinm 0,0,31,1
fcmpu 6,12,4
crnot 26,25
cmpwi 7,0,0
mfcr 0
rlwinm 0,0,27,1
bgt- 1,.L14
cmpwi 6,0,0
beq- 7,.L14
beq- 6,.L14
li 3,1

//
// POP STACK AND RETURN
//
.L14:
addi 1,1,16
blr
We still have 3 branches and a lot of compares. Let's see what we can do about that.
Removing Branches
In the CBE (as with most pipelined architectures), it's good to reduce or eliminate branches where possible. In this case, we can use the fsel instruction to replace a compare and branch. This is an optional PowerPC instruction, but the PPU implements it. Unfortunately, the compiler doesn't generate fsel calls for the PPU, so we'll have to call it manually:
static inline float ppc_fsels( const float fra, const float frc, const float frb ) 
{
float frt;

// From: http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixassem/alangref/fsel.htm
// The double-precision floating-point operand in floating-point register (FPR) FRA
// is compared with the value zero. If the value in FRA is greater than or equal to
// zero, floating point register FRT is set to the contents of floating-point
// register FRC. If the value in FRA is less than zero or is a NaN, floating point
// register FRT is set to the contents of floating-point register FRB. The comparison
// ignores the sign of zero; both +0 and -0 are equal to zero.
//
// i.e. frt = ( fra >= 0.0 ) ? frc : frb;
//
__asm__( "fsel %0, %1, %2, %3" : "=f"(frt) : "f"(fra), "f"(frc), "f"(frb) );

return (frt);
}
Now let's focus on the compares and branches portion of the method:
      const bool  in_0     = abs_delta_c0 <= sum_e0;
const bool in_1 = abs_delta_c1 <= sum_e1;
const bool in_2 = abs_delta_c2 <= sum_e2;
const bool result = in_0 && in_1 && in_2;
This code can be rewritten as follows:
      const float  overlap_0 = sum_e0 - abs_delta_c0;
const float overlap_1 = sum_e1 - abs_delta_c1;
const float overlap_2 = sum_e2 - abs_delta_c2;
const double temp_01 = ( overlap_1 >= 0.0f ) ? overlap_0 : overlap_1;
const double temp_012 = ( overlap_2 >= 0.0f ) ? temp_01 : overlap_2;
const bool result = temp_012 >= 0.0f;
The calculations of temp_01 and temp_012 can be expressed using fsel.
      const float  overlap_0 = sum_e0 - abs_delta_c0;
const float overlap_1 = sum_e1 - abs_delta_c1;
const float overlap_2 = sum_e2 - abs_delta_c2;
const double temp_01 = ppc_fsels( overlap_1, overlap_0, overlap_1 );
const double temp_012 = ppc_fsels( overlap_2, temp_01, overlap_2 );
const bool result = temp_012 >= 0.0f;
Now take a look at the constant value 0.0f in the last statement above. Keep in mind the PowerPC has no instruction to move an immediate value into a floating point register, so each constant appearing in an expression means an additional load from memory. That's why it's a good idea to restructure expressions if possible to reduce or eliminate constants.

Here we can't easily avoid comparing with zero, but we can get rid of the load by doing something unconventional. We can replace 0.0f with a parameter named zero, which in this case will be passed in via FPR1. Then it's up to the caller to find an optimal way to provide the value 0.0f. For example, if the calling function has plenty of register variables available, 0.0f can be loaded into one of them near the top. Alternatively, the constant can be put some place in memory where it will be on the same cache line as some other data that's needed anyway.

You might think we could construct the constant 0.0f cheaply by subtracting any float value (e.g., a_c0) from itself. But this doesn't work if the value is NaN, because you end up with NaN instead of 0.0f.

Anyway, let's also change the GCC_SPLIT_BLOCK macro so that we can inject comments into the asm output (to make it easier to track down our changes).
#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );
Here's the fourth version of the overlaps() method:
    bool overlaps(const Box& b, float zero) const
{
GCC_SPLIT_BLOCK("LOADS")

const float a_c0 = m_center[0];
const float a_c1 = m_center[1];
const float a_c2 = m_center[2];
const float a_e0 = m_extent[0];
const float a_e1 = m_extent[1];
const float a_e2 = m_extent[2];
const float b_c0 = b.m_center[0];
const float b_c1 = b.m_center[1];
const float b_c2 = b.m_center[2];
const float b_e0 = b.m_extent[0];
const float b_e1 = b.m_extent[1];
const float b_e2 = b.m_extent[2];

GCC_SPLIT_BLOCK("CALCULATIONS")

const float delta_c0 = a_c0 - b_c0;
const float delta_c1 = a_c1 - b_c1;
const float delta_c2 = a_c2 - b_c2;
const float abs_delta_c0 = ::fabs( delta_c0 );
const float abs_delta_c1 = ::fabs( delta_c1 );
const float abs_delta_c2 = ::fabs( delta_c2 );
const float sum_e0 = a_e0 + b_e0;
const float sum_e1 = a_e1 + b_e1;
const float sum_e2 = a_e2 + b_e2;
const float overlap_0 = sum_e0 - abs_delta_c0;
const float overlap_1 = sum_e1 - abs_delta_c1;
const float overlap_2 = sum_e2 - abs_delta_c2;

GCC_SPLIT_BLOCK("SELECT RESULT")

const double temp_01 = ppc_fsels( overlap_1, overlap_0, overlap_1 );
const double temp_012 = ppc_fsels( overlap_2, temp_01, overlap_2 );
const bool result = temp_012 >= zero;

return (result);
}
We'll also change the test_overlap function to add zero as a parameter:
bool
test_overlap( const Box& a, const Box& b, float zero )
{
return a.overlaps( b, zero );
}
The output shows that we have reduced the cost for the comparisons significantly:
_Z12test_overlapRK3BoxS1_f:
stwu 1,-16(1)

//
// LOADS
//

lfs 0,20(3)
lfs 3,20(4)
lfs 2,0(3)
lfs 10,4(3)
lfs 9,8(3)
lfs 12,12(3)
lfs 13,16(3)
lfs 8,0(4)
lfs 7,4(4)
lfs 6,8(4)
lfs 5,12(4)
lfs 4,16(4)

//
// CALCULATIONS
//

fsubs 11,2,8
fsubs 2,10,7
fsubs 8,9,6
fadds 7,12,5
fadds 6,13,4
fadds 5,0,3
fabs 11,11
fabs 10,2
fabs 9,8
fsubs 12,7,11
fsubs 4,6,10
fsubs 3,5,9

//
// SELECT RESULT
//

fsel 13, 4, 12, 4
addi 1,1,16
fsel 2, 3, 13, 3
fmr 0,2
fcmpu 7,1,0
cror 30,28,30
mfcr 3
rlwinm 3,3,31,1
blr
Right now our main problem is that we have 12 loads and we're not doing enough work to make up for that. Next we'll look at how to reduce loads.
Moving to VMX/Altivec
Always look for ways to reduce loads and stores. It's one of the most effective techniques for improving performance.
We're going to use the VXU (Altivec unit), which operates on 128-bit (16-byte) operands. A typical operand is a vector of 4 float values, of which we'll use 3. The compiler recognizes a set of vector data types and vector intrinsics.

Here are some of the main advantages to using Altivec:

More available registers - General purpose code will eat up most of your fixed point registers, making it more likely you'll need to keep dumping data on the stack.

Mixed integer and floating point - Mixing integer and floating point code, or converting between the two, is very expensive with scalar operations. This is because the only method of moving between the FXU (fixed point execution unit) and the FPU is through memory (typically the stack). This often creates a Load-Hit-Store data hazard event which will cause your processor to wait around until the register has been loaded. On the VXU you can freely use vector integer instructions on vector floating point values without penalty. There are also conversion instructions for your convenience.

Much higher throughput - This is really the whole point of a SIMD instruction set. One instruction works with 128 bit wide registers, so much more work can be done. Each instruction is also very fast.

Saturated arithmetic instructions - Saturated instructions are operations that basically cannot overflow or underflow. Any calculated value that is greater than the maximum value for the type of the vector component (8, 16 or 32 bits) is clamped to the maximum. Conversly for the minimum. This is extremely handy for any kind of fixed point math.

Bit manipulation on all types (permute, shift, rotate) - There is a large set of instructions for bit manipulation which you can apply to all the vector types. The permute instruction is a special instruction that lets you shuffle around the bytes in a vector. By itself, this instruction makes Altivec a win.

For our current application (testing for overlap), I'm just going to remove Vector3 completely and opt for using the vector types directly. If I did have some reason to hide the vector types (cross platform code?) I would completely remove the following methods:
    Vector3(const float& x, const float& y, const float& z)
{
m_co[0] = x;
m_co[1] = y;
m_co[2] = z;
}

float& operator[]( int i ) { return m_co[i]; }
const float& operator[]( int i ) const { return m_co[i]; }
There's no fast way to implement these methods. They're the very antithesis of working with SIMD instructions.

Anyway, the conversion to Altivec is quite straightforward in this case. The fourth element (w) must be masked out or else initialized to zero in each vector.

The fuctions beginning with vec_ are vector intrinsics. Note that vec_all_ge returns an int, not a vector type value. Specifically, it returns 1 if all elements of the first vector argument are greater than or equal to the corresponding elements of the second vector argument.

Here, we don't need to pass in zero as a parameter, because we can easily build a zero vector using vec_splat_u8. I've also used int instead of bool as the return type of overlaps and test_overlap. That way, a calling function that needs to test multiple boxes can use bitwise logical operators (& and |) to avoid branches.

Here's the fifth version of the code:
#include <altivec.h>

#define GCC_SPLIT_BLOCK(str) __asm__( "//\n\t// " str "\n\t//\n" );

struct Box
{
vector float m_v[2];

enum
{
m_centerOffset = 0x00,
m_extentOffset = 0x10
};

Box() {}

Box(const vector float& center, const vector float& extent)
{
vec_st( center, m_centerOffset, (vector float*)m_v );
vec_st( extent, m_extentOffset, (vector float*)m_v );
}

int overlaps(const Box& b) const
{
GCC_SPLIT_BLOCK("LOADS")
const vector float zero = (vector float)vec_splat_u8( 0x00 );
const vector float a_c = vec_ld( m_centerOffset, (vector float*)m_v );
const vector float a_e = vec_ld( m_extentOffset, (vector float*)m_v );
const vector float b_c = vec_ld( m_centerOffset, (vector float*)b.m_v );
const vector float b_e = vec_ld( m_extentOffset, (vector float*)b.m_v );
GCC_SPLIT_BLOCK("CALCULATE RESULT")
const vector float delta_c = vec_sub( a_c, b_c );
const vector float abs_delta_c = vec_abs( delta_c );
const vector float sum_e = vec_add( a_e, b_e );
const vector float overlap = vec_sub( sum_e, abs_delta_c );
const int result = vec_all_ge( overlap, zero );

return (result);
}
};

int
test_overlap( const Box& a, const Box& b )
{
return a.overlaps( b );
}
This straightforward translation to vector types reduces the number of loads from 12 to 4:
_Z12test_overlapRK3BoxS1_:
stwu 1,-16(1)

//
// LOADS
//

li 0,16
vspltisb 11,0
lvx 12,4,0
lvx 1,3,0
lvx 0,0,3
lvx 13,0,4

//
// CALCULATE RESULT
//

vsubfp 0,0,13
addi 1,1,16
vaddfp 1,1,12
vspltisw 13,-1
vslw 12,13,13
vandc 0,0,12
vsubfp 1,1,0
vcmpgefp. 11,1,11
mfcr 3
rlwinm 3,3,25,1
blr
This is fine for doing a single overlap test. But what if we need to perform a great many tests for overlap? That's where the Altivec really shines.
Doing Four Overlap Tests At Once
We'll be declaring a struct box4 representing 4 boxes. The following uniform vector layout will be used, where K is a box4 and J is the corresponding array of 4 Box objects:
K.center_x = { J[0].m_center[0], J[1].m_center[0], J[2].m_center[0], J[3].m_center[0] }
K.center_y = { J[0].m_center[1], J[1].m_center[1], J[2].m_center[1], J[3].m_center[1] }
K.center_z = { J[0].m_center[2], J[1].m_center[2], J[2].m_center[2], J[3].m_center[2] }
K.extent_x = { J[0].m_extent[0], J[1].m_extent[0], J[2].m_extent[0], J[3].m_extent[0] }
K.extent_y = { J[0].m_extent[1], J[1].m_extent[1], J[2].m_extent[1], J[3].m_extent[1] }
K.extent_z = { J[0].m_extent[2], J[1].m_extent[2], J[2].m_extent[2], J[3].m_extent[2] }
The new function box4_overlaps accepts two box4 pointers as parameters, and returns a signed int vector of overlap test results. Specifically, the Nth element of the return vector will be -1 if the Nth element of the first box4 overlaps the Nth element of the second box4. It will be 0 otherwise.

Once again we use vec_splat_u8 to build a zero vector, so we don't need zero passed in as a parameter.

Here's the sixth version of the code:
#include <altivec.h>
#include <stdint.h>

typedef struct box4 box4;

struct box4
{
vector float center_x;
vector float center_y;
vector float center_z;
vector float extent_x;
vector float extent_y;
vector float extent_z;
};

vector signed int
box4_overlaps( box4* const a, box4* const b )
{
const vector float zero = (vector float)vec_splat_u8( 0x00 );
const vector float acx = vec_ld( 0x00, &a->center_x );
const vector float acy = vec_ld( 0x00, &a->center_y );
const vector float acz = vec_ld( 0x00, &a->center_z );
const vector float aex = vec_ld( 0x00, &a->extent_x );
const vector float aey = vec_ld( 0x00, &a->extent_y );
const vector float aez = vec_ld( 0x00, &a->extent_z );
const vector float bcx = vec_ld( 0x00, &b->center_x );
const vector float bcy = vec_ld( 0x00, &b->center_y );
const vector float bcz = vec_ld( 0x00, &b->center_z );
const vector float bex = vec_ld( 0x00, &b->extent_x );
const vector float bey = vec_ld( 0x00, &b->extent_y );
const vector float bez = vec_ld( 0x00, &b->extent_z );
const vector float dx = vec_sub( acx, bcx );
const vector float dy = vec_sub( acy, bcy );
const vector float dz = vec_sub( acz, bcz );
const vector float abs_dx = vec_abs( dx );
const vector float abs_dy = vec_abs( dy );
const vector float abs_dz = vec_abs( dz );
const vector float sum_ex = vec_add( aex, bex );
const vector float sum_ey = vec_add( aey, bey );
const vector float sum_ez = vec_add( aez, bez );
const vector float overlap_x = vec_sub( sum_ex, abs_dx );
const vector float overlap_y = vec_sub( sum_ey, abs_dy );
const vector float overlap_z = vec_sub( sum_ez, abs_dz );
const vector signed int result_x = vec_cmpge( overlap_x, zero );
const vector signed int result_y = vec_cmpge( overlap_y, zero );
const vector signed int result_z = vec_cmpge( overlap_z, zero );
const vector signed int result_xy = vec_and( result_x, result_y );
const vector signed int result_xyz = vec_and( result_xy, result_z );

return (result_xyz);
}
The compiler output shows 12 loads, but we're doing 4 overlap tests instead of 1, so we've nearly quadrupled the performance compared to the fourth version.
box4_overlaps:
addi 12,3,16
addi 5,4,16
stwu 1,-16(1)
lvx 1,0,4
addi 9,3,32
lvx 11,0,12
addi 11,3,48
lvx 7,0,5
addi 10,3,64
lvx 0,0,3
vsubfp 2,11,7
vsubfp 0,0,1
addi 8,4,32
addi 7,4,48
lvx 7,0,10
addi 6,4,64
lvx 9,0,9
lvx 10,0,6
addi 3,3,80
lvx 8,0,11
addi 4,4,80
lvx 13,0,8
addi 1,1,16
lvx 12,0,7
vsubfp 1,9,13
vaddfp 9,8,12
lvx 13,0,4
vaddfp 8,7,10
lvx 12,0,3
vaddfp 12,12,13
vspltisw 10,-1
vslw 7,10,10
vandc 0,0,7
vspltisw 13,-1
vslw 10,13,13
vandc 11,2,10
vspltisw 13,-1
vslw 10,13,13
vandc 2,1,10
vsubfp 10,9,0
vsubfp 9,8,11
vsubfp 1,12,2
vspltisb 7,0
vcmpgefp 8,10,7
vcmpgefp 11,9,7
vcmpgefp 2,1,7
vand 0,8,11
vand 2,0,2
blr
Synergistic Processor Unit
The CBE has eight Synergistic Processor Units (SPUs) that are designed for computation-intensive tasks. Suppose we want our application to run on an SPU. How can we adapt the overlap test function for the SPU environment?

We'll use the same box4 structure as in the last example. The SPU compiler recognizes vector intrinsics (beginning with si_) that are similar to those of the VXU, but not identical. Here are some of the differences that have a direct bearing on the problem at hand:

1. The return from a vector comparison is a vector unsigned int, instead of a vector signed int. A value of true is represented as 1 instead of -1.
2. The SPU has no instruction for absolute value. We can calculate the absolute value of a vector float operand via a sequence of two instructions: si_shli (shift left immediate) and si_rotmi (rotate and mask immediate). The term rotate is misleading. In effect, si_rotmi(v, -n) is a logical shift right of each element of v by n bits.
3. The SPU can't directly test whether one vector float operand is greater than or equal to another. So we'll use si_fcgt (vector float greater than) with operands reversed to perform a "less than" test. Then we'll invert the result with si_nor.

The data type qword means a quadword (128 bits = 16 bytes) with unspecified structure. It could be a vector float, or a vector unsigned int, or some other vector type. It could even be a scalar kept in the first 32 bits, with the remaining 96 bits unused. For example, the first parameter of si_lqd is a qword with an address in the first 32 bits and the remaining 96 bits unused. The function si_from_uint casts an unsigned int to a qword. It doesn't generate any actual machine instructions.

Here's the seventh and final version of the code:
#include <spu_intrinsics.h>

typedef struct box4 box4;

struct box4
{
vector float center_x;
vector float center_y;
vector float center_z;
vector float extent_x;
vector float extent_y;
vector float extent_z;
};

vector unsigned int
box4_overlaps( box4* const a, box4* const b )
{
const qword zero = si_il( 0 );
const qword a_addr = si_from_uint( (unsigned int) a );
const qword b_addr = si_from_uint( (unsigned int) b );
const qword acx = si_lqd( a_addr, 0x00 );
const qword acy = si_lqd( a_addr, 0x10 );
const qword acz = si_lqd( a_addr, 0x20 );
const qword aex = si_lqd( a_addr, 0x30 );
const qword aey = si_lqd( a_addr, 0x40 );
const qword aez = si_lqd( a_addr, 0x50 );
const qword bcx = si_lqd( b_addr, 0x00 );
const qword bcy = si_lqd( b_addr, 0x10 );
const qword bcz = si_lqd( b_addr, 0x20 );
const qword bex = si_lqd( b_addr, 0x30 );
const qword bey = si_lqd( b_addr, 0x40 );
const qword bez = si_lqd( b_addr, 0x50 );
const qword dx = si_fs( acx, bcx );
const qword dy = si_fs( acy, bcy );
const qword dz = si_fs( acz, bcz );
const qword uns_dx = si_shli( dx, 1 );
const qword uns_dy = si_shli( dy, 1 );
const qword uns_dz = si_shli( dz, 1 );
const qword abs_dx = si_rotmi( uns_dx, -1 );
const qword abs_dy = si_rotmi( uns_dy, -1 );
const qword abs_dz = si_rotmi( uns_dz, -1 );
const qword sum_ex = si_fa( aex, bex );
const qword sum_ey = si_fa( aey, bey );
const qword sum_ez = si_fa( aez, bez );
const qword overlap_x = si_fs( sum_ex, abs_dx );
const qword overlap_y = si_fs( sum_ey, abs_dy );
const qword overlap_z = si_fs( sum_ez, abs_dz );
const qword result_x = si_fcgt( zero, overlap_x );
const qword result_y = si_fcgt( zero, overlap_y );
const qword result_z = si_fcgt( zero, overlap_z );
const qword result_xy = si_and( result_x, result_y );
const qword result_xyz = si_and( result_xy, result_z );
const qword inv_result = si_nor( result_xyz, result_xyz );

return (vector unsigned int)(inv_result);
}
The SPU compiler output shows that there's practically a one to one correspondence between C statements and machine instructions. The compiler has done some reordering, but that shouldn't be a problem here.
box4_overlaps:
hbr .L2,$lr
lnop
il $14,0
lqd $34,16($3)
lqd $35,16($4)
lqd $32,0($3)
lqd $33,0($4)
lqd $30,32($3)
lqd $31,32($4)
lqd $27,64($3)
fs $29,$34,$35
lqd $28,64($4)
nop $127
lqd $24,48($3)
fs $26,$32,$33
lqd $25,48($4)
lqd $20,80($3)
fs $23,$30,$31
lnop
lnop
shli $22,$29,1
lqd $21,80($4)
fa $16,$27,$28
shli $19,$26,1
fa $13,$24,$25
shli $18,$23,1
rotmi $17,$22,-1
fa $11,$20,$21
rotmi $15,$19,-1
rotmi $12,$18,-1
fs $10,$16,$17
fs $8,$13,$15
fs $7,$11,$12
fcgt $6,$14,$10
fcgt $5,$14,$8
fcgt $9,$14,$7
and $4,$6,$5
and $3,$4,$9
nor $2,$3,$3
ori $3,$2,0
nop $127
.L2:
bi $lr
Additional Reading
Basic Altivec references:
Altivec Instruction Cross Reference, Apple (HTML)
Altivec Programming Environments Manual, Freescale (PDF)
Altivec Programmer's Interface Manual, Freescale (PDF)

Useful Altivec introductions and tutorials:
Understanding SIMD, Apple (HTML)
Altivec Tutorial, Apple (HTML)
Altivec Tutorial, Ian Ollman (PDF)
Pratical Altivec Strategies, Ian Ollman (PDF)
Unrolling Altivec, Peter Seebach (HTML)
AltiVec Revealed, Tom Thompson

Basic SPU references:
SPU C/C++ Language Extensions (PDF)
SPU Instruction Set Architecture (PDF)