This article will talk about how to convert some scalar code to SIMD code for the PPU and SPU using the inverse matrix as an example.
Most of the time in the video games, programmers are not doing a standard inverse matrix. It is too expensive. Instead, to inverse a matrix, they consider it as orthonormal and they just do a 3x3 transpose of the rotation part with a dot product for the translation. Sometimes the full inverse algorithm is necessary.
The main goal is to be able to do it as fast as possible. This is why the code should use SIMD instructions as much as possible.
SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel.
Also the number of branches should stay to the strict minimum. Any extra branches will slow down the final solution. The first step is to choose the most suitable algorithm in order to reach the objectives.
Different algorithms exist to inverse a matrix:
The Gauss-Jordan elimination:
The Gauss-Jordan elimination is a method to find the inverse matrix solving a system of linear equations.
A good explanation about how this algorithm work can be found in the book "Numerical Recipes in C" [library.cornell.edu] chapter 2.1.
For a visual demonstration using a java applet see: Gauss-Jordan Elimination [cse.uiuc.edu].
In
this algorithm, the choice of a good pivot is a critical part. To do
it, all floating point values of a specific column need to be tested
with each other, one by one. This, by definition, doesn't suit very
well in SIMD code.
Performing the algorithm, some multiplications are be done between columns
(e.g.: to apply the pivot) and some other operations between rows
(e.g.: to apply the multiplier to the rest of the matrix).
This requires extra code to swap rows and columns in order to use SIMD instructions.
Inversion using LU decomposition: The description of the inverse calculation can be found in "Numerical Recipes in C" [library.cornell.edu] chapter 2.3.
Inversion by Partitioning: To inverse a matrix A (size N) by partitioning, the matrix is partitioned into:
| A0 A1 |The inverse is
A = | | with A0 and A3 squared matrix with the respective size
| A2 A3 | s0 and s3 following the rule: s0 + s3 = N
| B0 B1 |with:
InvA = | |
| B2 B3 |
B0 = Inv(A0 - A1 * InvA3 * A2)More information can be found in "Numerical Recipes in C" [library.cornell.edu] chapter 2.7
B1 = - B0 * (A1 * InvA3)
B2 = - (InvA3 * A2) * B0
B3 = InvA3 + B2 * (A1 * InvA3)
The issue related above is also present here; the idea is to work four floating points at a time and not only two.
Using the inverse formula ( (1/det(M)) * Transpose(Cofactor(M))): Check the article about Matrix Inverse [mathworld.wolfram.com] for more information about this formula.
This is the algorithm which will be used to inverse the matrix. Each
step presents a very good factorization ratio; it's possible to group
the operations in order to replace them by SIMD instructions.
The most critical part in this algorithm is the calculation of all
cofactors. This part has also two great advantages for our objectives.
It's 100% calculation; this allows writing code without branching. All
cofactor values are computed the same way and can be computed in
parallel and independently of each other. This is a perfect place to
use the SIMD instructions.
This article will start with a basic implementation of the inverse formula using scalar instructions. Then this code will be modified to prepare the SIMD version. The first SIMD version will be done for the PPU. The final one will be conversion using the SPU intrinsic instruction set.
InvM = (1/det(M)) * Transpose(Cofactor(M))which can also be written:
InvM = (1/det(M)) * Adjoint(M) withFor the scalar version, the matrix is defined as follow:
Adjoint(M) = Transpose(Cofactor(M))
typedef struct s_vector
{
float row[4];
} s_vector;
typedef struct s_matrix
{
s_vector cols[4];
} s_matrix;
The first version of the code does a standard implementation of the formula. The inverse function calls the cofactor function which computes and returns the cofactor matrix.
Definition 2 - If A is a square matrix then the cofactor of a(i,j), denoted by C(i,j), is the number ((-1)^(i+j))*M(i,j).
- Choose any row, say row i, then,
- det(A) = a(i,1)C(i,1) + a(i,2)C(i,2) + ... + a(i,n)C(i,n)
- Choose any column, say column j, then,
- det(A) = a(1,j)C(1,j) + a(2,j)C(2,j) + ... + a(n,j)C(n,j)
The full code is available here: inverse_v1.h
for ( col = 0 ; col < 4 ; col++ )
{
for ( row = 0; row < 4; row++ )
{
output->cols[col].row[row] = source->cols[col].row[row] * factor;
}
}
The second reason to do this refactorization is to locate the SIMD blocks. Unrolling the multiplication is straight forward. The same changes can be applied to the transpose and the determinant functions.
The following chapter will detail the code of the cofactor matrix. The second scalar version can be found here: inverse_v2.h
The new cofactor code is:
cofactor_column_v2(output->cols[0].row, source, col);Inside cofactor_column_v2, the rows are grouped together to have a better view of what to do to convert this into SIMD code:
cofactor_column_v2(output->cols[1].row, source, col);
cofactor_column_v2(output->cols[2].row, source, col);
cofactor_column_v2(output->cols[3].row, source, col);
const float r0_pos_part1 = mat->cols[col0].row[1] *The row indices clearly show a relation between them. By noting the r0_pos_part1 as follow:
mat->cols[col1].row[2] *
mat->cols[col2].row[3];
const float r1_pos_part1 = mat->cols[col0].row[2] *
mat->cols[col1].row[3] *
mat->cols[col2].row[0];
const float r2_pos_part1 = mat->cols[col0].row[3] *
mat->cols[col1].row[0] *
mat->cols[col2].row[1];
const float r3_pos_part1 = mat->cols[col0].row[0] *
mat->cols[col1].row[1] *
mat->cols[col2].row[2];
r[0]_pos_part1 = mat->cols[c0]->row[r0] *the next rows can be written like this:
mat->cols[c1]->row[r1] *
mat->cols[c1]->row[r2]
r[N]_pos_part1 = mat->cols[c0]->row[(r0+N)&3] *The same relation is present in all positive and negative parts of the calculation. Basically, in order to calculate the different parts of the 3x3 determinants for a defined column, all three other columns need to be multiply together after being rotated by a specific value.
mat->cols[c1]->row[(r1+N)&3] *
mat->cols[c1]->row[(r2+N)&3]
Those 3x3 determinants also called minor of the matrix need to have their signs adjusted.
Following the idea of converting the code using SIMD instructions, two variables have been created:
static const unsigned int znzn[] = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };They contain the two possible mask signs for a whole column. When the column number is even, nznz will be the mask to select. In the other case, znzn will be the one to choose.
static const unsigned int nznz[] = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };
To select the correct variable, the basic way (and probably also the most common used nowadays) would probably use an 'if'. As indicated at the beginning of this article, the 'if' statement is something to avoid as much as possible. It generates branches. The solution to avoid it is to use a mask (col_mask) and to do a selection with it:
const unsigned int col_mask = (const unsigned int)(((const int)((col & 1) << 31)) >> 31);
const unsigned int u_znzn = (const unsigned int)(&znzn[0]);
const unsigned int u_nznz = (const unsigned int)(&nznz[0]);
union
{
unsigned int u;
unsigned int *p;
} mask;
mask.u = (u_nznz&col_mask)|(u_znzn&~col_mask);
The union is here to ensure the strict aliasing rule.
Once the correct pointer selected, the final calculation is simple:
r0_cofactor.i ^= mask.p[0];The next step is the conversion of this scalar version for the PPU using the altivec instruction set.
r1_cofactor.i ^= mask.p[1];
r2_cofactor.i ^= mask.p[2];
r3_cofactor.i ^= mask.p[3];
typedef struct s_matrixIn the version 2, the rows were grouped together. This showed the ideas of rotating their index to do the calculation. This one is used in the SIMD code. Different rotations for each column are required. Those rotations are computed first. The variable names are defined as follow: cXuY, where X is the column number rotated up by Y floats:
{
vector float cols[4];
} s_matrix;
const vector float c0u1 = vec_sld(c0, c0, 4);
const vector float c0u2 = vec_sld(c0, c0, 8);
const vector float c0u3 = vec_sld(c0, c0, 12);
....
The calculation of each cofactor is based on the determinant of the
3x3 matrix created by removing the cofactor's column and row from the
source matrix. That is why for the first column, the multiplication is
done in the reverse order (i.e.: the third and fourth column will be
multiply together before doing the operation using the second one).
This way, the result will be available to compute the cofactors of the
second column.
With the third column, the multiplication is done in the initial order
(Multiplying the first and second column together first) to share the results with fourth column.
Note: in the source code, the fourth column has been computed before
the third one, for convenience only (to avoid the mistakes working with
the column 0, 1, and 2 instead of 0, 1, and 3). The final result is
identical.
The same masking operation for the sign bit is done using SIMD instructions.
In order to calculate the adjoint matrix, the transpose code has also been converted. The unrolled version wasn't really helpful. The knowledge of the altivec instructions was required, especially the one which manipulates the data: vec_mergeh and vec_mergel.
The determinant function now returns a vector float, each
element is nearly equal (nearly due to the floating point precision) to
the determinant of the matrix.
The algorithm is the same as before. A multiplication is computed
between a row or a column with the corresponding value in the cofactor
matrix, all values are added together.
The multiplication of each value is a simple SIMD instruction;
unfortunately no instruction exists to dispatch the sum of all values
in a vector. The solution is to rotate the result vector twice and add
it with itself as follow:
( A B C D )
+ ( C D A B )
=====================
( A+C B+D C+A B+D )
+ ( B+D C+A B+D A+C )
=====================
= the vector with the determinant store in each element.
The code of this first PPU version of the inverse matrix can be found here: inverse_v3.h
The altivec instruction set doesn't include any instruction to define constants, every constant will have to be constructed and loaded from the memory. The following lines:
const vector unsigned int u_znzn = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };will generate the following code (using ppu-gcc from mambo):
const vector unsigned int u_nznz = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };
ld 4, .LC18@toc(2)If one operation is known from being slow, this is the access to the memory. To avoid the loading of constant values, they are built on the fly as follow:
ld 11, .LC16@toc(2)
const vector unsigned int u_zero = (vector unsigned int)zero;u_signmask, after its initialization, contains the vector: { 0x80000000, 0x80000000, 0x80000000, 0x80000000 } The use of vec_mergeh inserts some zero values in the middle of it to finalize the constant value.
const vector unsigned int u_two = vec_splat_u32(2);
const vector unsigned int u_fifteen = vec_splat_u32(15);
const vector unsigned int u_2shift15 = vec_sl(u_two, u_fifteen);
const vector unsigned int u_signmask = vec_sl(u_2shift15, u_fifteen);
const vector unsigned int u_nznz = vec_mergeh(u_signmask, u_zero);
const vector unsigned int u_znzn = vec_mergeh(u_zero, u_signmask);
Another part of the code can be also improved using shift instructions instead of multiplication:
const vector float m_c2u1_c3u2 = vec_madd(c2u1, c3u2, zero);can be replaced by:
const vector float m_c2u1_c3u2 = vec_sld(m_c2u2_c3u3, m_c2u2_c3u3, 12);On the PPU, vec_madd and vec_sld aren't on the same pipeline. They might now be executed in parallel.
The optimized version of the previous PPU code can be found here: inverse_v4.h
The final version of the inverse matrix for PPU where the whole code has been placed in a single function can be downloaded here: inverse_v5.h
Some altivec instructions have a direct equivalent as SPU intrinsic instructions:
- vec_madd is replaced by either spu_madd or simply by spu_mul when the third parameter is zero.
- vec_xor, vec_re, vec_sub respectively becomes spu_xor, spu_re, spu_sub
- vec_sld
- vec_mergeh
- vec_mergel
To replace the sld function, three patterns are required:
- YZWX to replace: sld(v, v, 4)
- ZWXY to replace: sld(v, v, 8)
- WXYZ to replace: sld(v, v, 12)
The SPU version of the inverse matrix can be found here: inverse_v6.h
- Avoid the algorithms which deal with special cases (like the Gauss-Jordan elimination).
- Start with a simple scalar implementation.
- Unroll the loops and group the code which can be executed in parallel and which follow the same patterns (like in the cofactor function).
- Get use to the data manipulation instructions (vec_mergeh, vec_sld, spu_shuffle...).
- Look at the generated assemble code.
- Prefer to build the PPU data on the fly instead of loading them from the memory.
In the past years, Cedric was mostly an working on AI. He also optimized (high and low level optimization) some PS2 code to help his game reaching a correct frame rate.
The last game he worked on is Darkwatch for Highmoon Studios.
Previously he was lead AI programmer on Street Racing Syndicate at Eutechnyx.