A 4x4 Matrix Inverse - CellPerformance

GUEST ARTICLE! Cédric Lallain is a Frenchman who has been working with me on Cell/PS3 research at Highmoon Studios in Carlsbad, CA.. I hope that this is only the first of many contributions to the community by Cédric. Welcome aboard! -- Mike.

Inverse matrix on PPU and on SPU using SIMD instructions.

This article will talk about how to convert some scalar code to SIMD code for the PPU and SPU using the inverse matrix as an example.

Most of the time in the video games, programmers are not doing a standard inverse matrix. It is too expensive. Instead, to inverse a matrix, they consider it as orthonormal and they just do a 3x3 transpose of the rotation part with a dot product for the translation. Sometimes the full inverse algorithm is necessary.

The main goal is to be able to do it as fast as possible. This is why the code should use SIMD instructions as much as possible.

A vector is an instruction operand containing a set of data elements packed into a one-dimensional array. The elements can be fixed-point or floating-point values. Most Vector/SIMD Multimedia Extension and SPU instructions operate on vector operands. Vectors are also called Single-Instruction, Multiple-Data (SIMD) operands, or packed operands.
SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel.

[Chapter 2.5.1 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]].

Each SPE is a 128-bit RISC processor specialized for data-rich, compute-intensive SIMD and scalar applications.

[Chapter 3 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]].

Also the number of branches should stay to the strict minimum. Any extra branches will slow down the final solution. The first step is to choose the most suitable algorithm in order to reach the objectives. Different algorithms exist to inverse a matrix:

The Gauss-Jordan elimination: The Gauss-Jordan elimination is a method to find the inverse matrix solving a system of linear equations. A good explanation about how this algorithm work can be found in the book "Numerical Recipes in C" [library.cornell.edu] chapter 2.1.
For a visual demonstration using a java applet see: Gauss-Jordan Elimination [cse.uiuc.edu]. In this algorithm, the choice of a good pivot is a critical part. To do it, all floating point values of a specific column need to be tested with each other, one by one. This, by definition, doesn't suit very well in SIMD code.
Performing the algorithm, some multiplications are be done between columns (e.g.: to apply the pivot) and some other operations between rows (e.g.: to apply the multiplier to the rest of the matrix). This requires extra code to swap rows and columns in order to use SIMD instructions.

Inversion using LU decomposition: The description of the inverse calculation can be found in "Numerical Recipes in C" [library.cornell.edu] chapter 2.3.

In linear algebra, a block LU decomposition is a decomposition of a block matrix into a lower block triangular matrix L and an upper block triangular matrix U. This decomposition is used in numerical analysis to reduce the complexity of the block matrix formula.

[Block LU decomposition [wikipedia.org]]

This algorithm would probably be very useful if the size of the matrix was 8x8. In this case, it requires doing the calculation two floating points at a time where a vector type contains four.

Inversion by Partitioning: To inverse a matrix A (size N) by partitioning, the matrix is partitioned into:

       |  A0    A1  |
   A = |            | with A0 and A3 squared matrix with the respective size
       |  A2    A3  |                s0 and s3 following the rule: s0 + s3 = N

The inverse is

          |  B0    B1  |
   InvA = |            |
          |  B2    B3  |

with:

  B0 = Inv(A0 - A1 * InvA3 * A2)
  B1 = - B0 * (A1 * InvA3)
  B2 = - (InvA3 * A2) * B0
  B3 = InvA3 + B2 * (A1 * InvA3)

More information can be found in "Numerical Recipes in C" [library.cornell.edu] chapter 2.7

The issue related above is also present here; the idea is to work four floating points at a time and not only two.

Using the inverse formula ( (1/det(M)) * Transpose(Cofactor(M))): Check the article about Matrix Inverse [mathworld.wolfram.com] for more information about this formula.

This is the algorithm which will be used to inverse the matrix. Each step presents a very good factorization ratio; it's possible to group the operations in order to replace them by SIMD instructions.
The most critical part in this algorithm is the calculation of all cofactors. This part has also two great advantages for our objectives. It's 100% calculation; this allows writing code without branching. All cofactor values are computed the same way and can be computed in parallel and independently of each other. This is a perfect place to use the SIMD instructions.

This article will start with a basic implementation of the inverse formula using scalar instructions. Then this code will be modified to prepare the SIMD version. The first SIMD version will be done for the PPU. The final one will be conversion using the SPU intrinsic instruction set.

A 4x4 matrix inverse

The general formula is:

   InvM = (1/det(M)) * Transpose(Cofactor(M))

which can also be written:

   InvM = (1/det(M)) * Adjoint(M) with
   Adjoint(M) = Transpose(Cofactor(M))

For the scalar version, the matrix is defined as follow:

  typedef struct s_vector
  {
    float row[4];
  } s_vector;

  typedef struct s_matrix
  {
    s_vector cols[4];
  } s_matrix;

The first version of the code does a standard implementation of the formula. The inverse function calls the cofactor function which computes and returns the cofactor matrix.

Definition 1 - If A is a square matrix then the minor of a(i,j), denoted by M(i,j), is the determinant of the submatrix that results from removing the ith row and jth column of A.
Definition 2 - If A is a square matrix then the cofactor of a(i,j), denoted by C(i,j), is the number ((-1)^(i+j))*M(i,j).

[from The method of Cofactors [tutorial.math.lamar.edu]]

Once the cofactor matrix is computed, the result is used to calculate the determinant and also the adjoint matrix.

Theorem 1 - if A is a matrix.

Choose any row, say row i, then,
- det(A) = a(i,1)C(i,1) + a(i,2)C(i,2) + ... + a(i,n)C(i,n)
Choose any column, say column j, then,
- det(A) = a(1,j)C(1,j) + a(2,j)C(2,j) + ... + a(n,j)C(n,j)

The adjoint of A is the transpose of the matrix of cofactors and is denoted by adj(A).

[from The method of Cofactors [tutorial.math.lamar.edu]]

From there, the inverse matrix is just a division of the adjoint matrix by the determinant.

The full code is available here: inverse_v1.h

Toward the SIMD

Even for the scalar code, it's better to unroll the loop, this give more options to the compiler for optimization. This gets also rid of the branches. This rule is especially true for the small loops with little iteration, like:

    for ( col = 0 ; col < 4 ; col++ )
    {
        for ( row = 0; row < 4; row++ )
        {
            output->cols[col].row[row] =  source->cols[col].row[row] * factor;
        }
    }

The second reason to do this refactorization is to locate the SIMD blocks. Unrolling the multiplication is straight forward. The same changes can be applied to the transpose and the determinant functions.

The following chapter will detail the code of the cofactor matrix. The second scalar version can be found here: inverse_v2.h

The case of the cofactor matrix

To avoid too much confusion, in the second scalar version, the new helper function is now called 'cofactor_column_v2' instead of 'cofactor_ij_v1'. It takes care of a whole column of cofactors and not just one at a time.

The new cofactor code is:

    cofactor_column_v2(output->cols[0].row, source, col);
    cofactor_column_v2(output->cols[1].row, source, col);
    cofactor_column_v2(output->cols[2].row, source, col);
    cofactor_column_v2(output->cols[3].row, source, col);

Inside cofactor_column_v2, the rows are grouped together to have a better view of what to do to convert this into SIMD code:

    const float r0_pos_part1 = mat->cols[col0].row[1] * 
                               mat->cols[col1].row[2] * 
                               mat->cols[col2].row[3];
                               
    const float r1_pos_part1 = mat->cols[col0].row[2] * 
                               mat->cols[col1].row[3] * 
                               mat->cols[col2].row[0];
                               
    const float r2_pos_part1 = mat->cols[col0].row[3] * 
                               mat->cols[col1].row[0] * 
                               mat->cols[col2].row[1];
                               
    const float r3_pos_part1 = mat->cols[col0].row[0] * 
                               mat->cols[col1].row[1] * 
                               mat->cols[col2].row[2];

The row indices clearly show a relation between them. By noting the r0_pos_part1 as follow:

    r[0]_pos_part1 = mat->cols[c0]->row[r0] * 
                     mat->cols[c1]->row[r1] * 
                     mat->cols[c1]->row[r2]

the next rows can be written like this:

    r[N]_pos_part1 = mat->cols[c0]->row[(r0+N)&3] * 
                     mat->cols[c1]->row[(r1+N)&3] * 
                     mat->cols[c1]->row[(r2+N)&3]

The same relation is present in all positive and negative parts of the calculation. Basically, in order to calculate the different parts of the 3x3 determinants for a defined column, all three other columns need to be multiply together after being rotated by a specific value.

Those 3x3 determinants also called minor of the matrix need to have their signs adjusted.

Following the idea of converting the code using SIMD instructions, two variables have been created:

    static const unsigned int znzn[] = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };
    static const unsigned int nznz[] = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };

They contain the two possible mask signs for a whole column. When the column number is even, nznz will be the mask to select. In the other case, znzn will be the one to choose.

To select the correct variable, the basic way (and probably also the most common used nowadays) would probably use an 'if'. As indicated at the beginning of this article, the 'if' statement is something to avoid as much as possible. It generates branches. The solution to avoid it is to use a mask (col_mask) and to do a selection with it:

    const unsigned int col_mask   = (const unsigned int)(((const int)((col & 1) << 31)) >> 31);
    const unsigned int u_znzn     = (const unsigned int)(&znzn[0]);
    const unsigned int u_nznz     = (const unsigned int)(&nznz[0]);
    union 
    {
        unsigned int  u;
        unsigned int *p;
    } mask;
    mask.u = (u_nznz&col_mask)|(u_znzn&~col_mask);

The union is here to ensure the strict aliasing rule.

Once the correct pointer selected, the final calculation is simple:

    r0_cofactor.i ^= mask.p[0];
    r1_cofactor.i ^= mask.p[1];
    r2_cofactor.i ^= mask.p[2];
    r3_cofactor.i ^= mask.p[3];

The next step is the conversion of this scalar version for the PPU using the altivec instruction set.

Altivec version

A new definition for the matrix type is required:

  typedef struct s_matrix
  {
      vector float cols[4];
  } s_matrix;

In the version 2, the rows were grouped together. This showed the ideas of rotating their index to do the calculation. This one is used in the SIMD code. Different rotations for each column are required. Those rotations are computed first. The variable names are defined as follow: cXuY, where X is the column number rotated up by Y floats:

    const vector float c0u1 = vec_sld(c0, c0,  4);
    const vector float c0u2 = vec_sld(c0, c0,  8);
    const vector float c0u3 = vec_sld(c0, c0, 12);
    ....

The order to do the calculation is really important to minimize the number of operations.

The calculation of each cofactor is based on the determinant of the 3x3 matrix created by removing the cofactor's column and row from the source matrix. That is why for the first column, the multiplication is done in the reverse order (i.e.: the third and fourth column will be multiply together before doing the operation using the second one). This way, the result will be available to compute the cofactors of the second column.
With the third column, the multiplication is done in the initial order (Multiplying the first and second column together first) to share the results with fourth column.
Note: in the source code, the fourth column has been computed before the third one, for convenience only (to avoid the mistakes working with the column 0, 1, and 2 instead of 0, 1, and 3). The final result is identical.

The same masking operation for the sign bit is done using SIMD instructions.

In order to calculate the adjoint matrix, the transpose code has also been converted. The unrolled version wasn't really helpful. The knowledge of the altivec instructions was required, especially the one which manipulates the data: vec_mergeh and vec_mergel.

The determinant function now returns a vector float, each element is nearly equal (nearly due to the floating point precision) to the determinant of the matrix.
The algorithm is the same as before. A multiplication is computed between a row or a column with the corresponding value in the cofactor matrix, all values are added together.
The multiplication of each value is a simple SIMD instruction; unfortunately no instruction exists to dispatch the sum of all values in a vector. The solution is to rotate the result vector twice and add it with itself as follow:

   (  A   B   C   D  )
 + (  C   D   A   B  )
 =====================
   ( A+C B+D C+A B+D )
 + ( B+D C+A B+D A+C )
 =====================
 = the vector with the determinant store in each element.

It's important to know that the values are not necessary the same along the vector, this is due to the order of the calculation and to the lack of precision of the floating point, those values can be slightly different; a vec_splat can be apply to this vector to force them to be identical.

The final multiplication (by one over the determinant) can be easily performed considering function already has a vector filled with the determinant.

The code of this first PPU version of the inverse matrix can be found here: inverse_v3.h

Optimization Altivec

Once the code is working on PPU, the next step is the optimization.

The altivec instruction set doesn't include any instruction to define constants, every constant will have to be constructed and loaded from the memory. The following lines:

    const vector unsigned int u_znzn = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };
    const vector unsigned int u_nznz = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };

will generate the following code (using ppu-gcc from mambo):

ld 4, .LC18@toc(2)
ld 11, .LC16@toc(2)

If one operation is known from being slow, this is the access to the memory. To avoid the loading of constant values, they are built on the fly as follow:

    const vector unsigned int u_zero     = (vector unsigned int)zero;
    const vector unsigned int u_two      = vec_splat_u32(2);
    const vector unsigned int u_fifteen  = vec_splat_u32(15);
    const vector unsigned int u_2shift15 = vec_sl(u_two, u_fifteen);
    const vector unsigned int u_signmask = vec_sl(u_2shift15, u_fifteen);
    const vector unsigned int u_nznz     = vec_mergeh(u_signmask, u_zero);
    const vector unsigned int u_znzn     = vec_mergeh(u_zero, u_signmask);

u_signmask, after its initialization, contains the vector: { 0x80000000, 0x80000000, 0x80000000, 0x80000000 } The use of vec_mergeh inserts some zero values in the middle of it to finalize the constant value.

Another part of the code can be also improved using shift instructions instead of multiplication:

    const vector float m_c2u1_c3u2 = vec_madd(c2u1, c3u2, zero);

can be replaced by:

    const vector float m_c2u1_c3u2 = vec_sld(m_c2u2_c3u3, m_c2u2_c3u3, 12);

On the PPU, vec_madd and vec_sld aren't on the same pipeline. They might now be executed in parallel.

[cf: Appendix A.3.2 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]].

The optimized version of the previous PPU code can be found here: inverse_v4.h

The final version of the inverse matrix for PPU where the whole code has been placed in a single function can be downloaded here: inverse_v5.h

SPU version

The SPU is a very powerful calculator with a lot of strong intrinsic instructions, and even if some altivec functions don't have direct equivalent, they can be replaced by the intrinsic set.

Some altivec instructions have a direct equivalent as SPU intrinsic instructions:

vec_madd is replaced by either spu_madd or simply by spu_mul when the third parameter is zero.
vec_xor, vec_re, vec_sub respectively becomes spu_xor, spu_re, spu_sub

Some others require a work around:

vec_sld
vec_mergeh
vec_mergel

For the PPU, the need to build the constant values was clearly present. On the contrary, the loading time in the SPU is almost nothing; the SPU even have instructions to extract, insert values and create constant values on the fly without going through the memory.

[Table B-1 in the Appendix B.1.2 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]]

There is no need to worry too much creating constant values. Constants can replace the PPU calculation for nznz and znzn. The instruction spu_shuffle with associated constant values will be used to replace vec_mergeh, vec_mergel and vec_sld. Five different shuffling patterns have to be created. To explain them, the four floats of the first vector will be designated by the letters: X, Y, Z, and W. The four floats of the second vector will be: A, B, C, D.

To replace the sld function, three patterns are required:

YZWX to replace: sld(v, v, 4)
ZWXY to replace: sld(v, v, 8)
WXYZ to replace: sld(v, v, 12)

To replace vec_mergeh, and vec_mergel, only two other patterns are defined: XAYB, ZCWD

The SPU version of the inverse matrix can be found here: inverse_v6.h

Summary

Avoid the algorithms which deal with special cases (like the Gauss-Jordan elimination).
Start with a simple scalar implementation.
Unroll the loops and group the code which can be executed in parallel and which follow the same patterns (like in the cofactor function).
Get use to the data manipulation instructions (vec_mergeh, vec_sld, spu_shuffle...).
Look at the generated assemble code.
Prefer to build the PPU data on the fly instead of loading them from the memory.

About The Author

Cedric Lallain is a Senior Programmer working on PS3/Cell research at Highmoon Studios (Vivendi Games).
In the past years, Cedric was mostly an working on AI. He also optimized (high and low level optimization) some PS2 code to help his game reaching a correct frame rate.
The last game he worked on is Darkwatch for Highmoon Studios.
Previously he was lead AI programmer on Street Racing Syndicate at Eutechnyx.