Reducing Floating Point Branches

Suggestions? Comments? Questions?

Send email to Mike Acton

Articles

Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links

No Insider Info!

Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.

Legal

Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.

Reducing Floating Point Branches

Mike Acton
April 17, 2006

Background

There are three distinct instruction sets for carrying out floating point operations on the CBE. On the PPU, scalar floating point (32 bit and 64 bit) instructions are executed on the FPU (Floating Point Unit) and vector floating point (4 x 32 bit) instructions are executed on the VXU (Vector Execution Unit). The SPU can also execute vector floating point (4 x 32 bit and 2 x 64 bit) instructions.

Most floating point instructions on both the PPU and SPU have single cycle throughputs which allows an additional floating point instruction to be issued on the following cycle. However many floating point instructions have very long latencies (number of cycles before the result can be used) and the latencies vary among the instructions.

For example: (Source: Cell Broadband Engine Programming Handbook [ibm.com]

|------------------------------------------------------------------|-------------|---------| | UNIT | DESCRIPTION | INSTRUCTION | LATENCY | |-----------|------------------------------------------------------|-------------|---------| | PPU (FPU) | scalar 32 bit floating point multiply-add | fmadd | 10 | | PPU (FPU) | scalar 32 bit floating point IEEE divide | fdivs | 74 | | PPU (VXU) | vector 4 x 32 bit floating point multiply-add | vmaddfp | 12 | | PPU (VXU) | vector 4 x 32 bit floating point reciprocal estimate | vrefp | 14 | | SPU | vector 4 x 32 bit floating point multiply-add | fma | 6 | | SPU | vector 4 x 32 bit floating point reciprocal-estimate | frest | 4 | |-----------|------------------------------------------------------|-------------|---------|

In order to take advantage of the single cycle throughput, it is critical to fill the latency cycles with useful work. The two most significant barriers to filling the latency cycles are memory access and branching. This article will concentrate on minimizing branches. For background on why branches represent such a significant optimization barrier, see: Better Performance Through Branch Elimination

Floating point values

Let's look at the classic example of floating point branch elimination: fabs

There is, in fact, an absolute value instruction on the PPU. This example is only for illustration purposes.

Signed floating point numbers are extremely simple, the top bit is the sign bit - nothing else changes. So typically instead of doing something like this:

float abs_f32( float arg ) { return ( ( arg < 0.0f ) ? -arg : arg ); }

A programmer will "optimize" out the branch by simply masking out the sign bit. Like this fairly common bit of code:

float abs_f32_1( float arg ) { uint32_t frep = *(uint32_t*)(float*)&arg; uint32_t frep_uns = frep & 0x7fffffff; float result = *(float*)(uint32_t*)&frep_uns; return ( result ); }

So what's wrong with that? Read on to find out...

In the examples below -ffast-math has been added to the compiler flags. (This tutorial is not concerned with problems that need IEEE accuracy.)

What's wrong with the code above?

Although it may in fact work, it's illegal. Strict aliasing rules are that you can't cast (and dereference) two unrelated pointers, so the cast from (float*) to (uint32_t*) is not allowed and will result in this warning in GCC: warning: dereferencing type-punned pointer will break strict-aliasing rules
There's a good reason for this rule. Strict aliasing and the restrict keyword were introduced in C99 to allow the compiler to optimize load and store order. Disabling these optimizations will reduce performance significantly.

Work with the strict aliasing and the restrict keyword. Don't work around them or disable them. See: Demystifying The Restrict Keyword

There are two solutions to this:

First, the C99 standard specifically states that (char*) types are related to every pointer type, so:

float abs_f32( float arg ) { uint32_t frep = *(uint32_t*)(char*)(float*)&arg; uint32_t frep_uns = frep & 0x7fffffff; float result = *(float*)(uint8_t*)(char*)&frep_uns; return ( result ); }

Second, the C99 standard specifically allows (and recommends) "casting" of unreleated types through unions:

union FREP32 { float f32; uint32_t u32; }; float abs_f32( float arg ) { union FREP32 result = { f32:arg }; result.u32 = result.u32 & 0x7fffffff; return ( result.f32 ); }

Note this line:

union FREP32 result = { f32:arg };

C99 allows the assignment of fields in a structure or union through their names. Note that C++98 does not allow that, so you'll have to write something like the following if you want to remain compatible with C++:

union FREP32 result; result.f32 = arg;

However, both options are still going to generate terrible code for the PPU. Let's have a look:

abs_f32: stwu 1,-32(1) # sp[-32] = sp # sp = sp - 32 stfs 1,16(1) # sp[16] = arg lwz 0,16(1) # frep = ap[16] rlwinm 9,0,0,1,31 # frep_uns = frep & 0x7fffffff; stw 9,16(1) # sp[16] = frep_uns lfs 1,16(1) # result = sp[16] addi 1,1,32 # sp = sp + 32 blr # RETURN (result)

The main problem is there are two load-hit-store events.

Store the original floating point value to the stack and load into fixed point register:

stfs 1,16(1) # sp[16] = arg lwz 0,16(1) # frep = ap[16]

Store the fixed point result and load into the floating point return register:

stw 9,16(1) # sp[16] = frep_uns lfs 1,16(1) # result = sp[16]

In each case the only thing that can be done is sit around and wait for the memory to be available before it is loaded into either register set. This is extrememly slow. Not a little slower. Cripplingly slow.

Floating point instructions don't work with immediate floating point values, so those constants must be loaded seperately into FPU registers. There are two methods to accomplish this:

The value can be built using integer instructions on the FXU, saved to memory then loaded into the FPU.
The value can be pre-stored memory (part of the executable) and loaded directly into the FPU.

Both of these methods involve memory, so anytime you use a floating point constant you risk a cache miss, just for starters. The main disadvantages of the first method are the extra store and the potential for a load-hit-store hazard event. The value must be completely commited to the cache before it can be loaded into the FPU, which will cause the FPU load instruction to stall until the value is ready.

Avoid editing floating point value with integer instructions. Don't cast floating point values to integers. Look at the floating point intruction set for alternatives.

Another Example

Let's look at a another simple function involving floating point comparisons and branches:

float test_1( float arg0, float arg1 ) { if ( ( arg0 >= 1.0 ) || ( arg1 <= 0.5 ) ) { return (arg0); } return (arg1); }

Note the constants:

1.0 as a 32 bit float is 0x3f800000 0.5 as a 32 bit float is 0x3f000000

.LC0: .long 1065353216 # 0x3f800000 test_1: lis 3,.LC0@ha # f_one_addr = LC0 << 16 fmr 13,1 # FTMP0 = arg0 la 9,.LC0@l(3) # f_one_addr = f_one_addr | LC0 stwu 1,-32(1) # sp[-32] = sp # sp = sp - 32 lfs 0,0(9) # f_one = f_one_addr[0] fcmpu 7,1,0 # CR[7] = fcmp( arg0, f_one ) cror 30,29,30 # CR[7] = CR[7].GT | CR[7].EQ fmr 1,2 # FTMP1 = arg1 beq- 7,.L3 # if_unlikely ( CR[7] == 0 ) lis 0,0x3f00 # f_pt_five_addr = 0x3f00 << 16 stw 0,16(1) # sp[16] = f_pt_five_addr[0] lfs 3,16(1) # f_pt_five = sp[16] fcmpu 0,2,3 # CR[0] = fcmp( arg1, f_pt_five ) cror 2,0,2 # CR[0] = CR[0].LT | CR[0].EQ beq- 0,.L3 # if_unlikely ( CR[0] == 0 ) GOTO RETURN_ARG0 # RETURN_ARG1: addi 1,1,32 # sp = sp + 32 blr # RETURN (result) .L3: # RETURN_ARG0: fmr 1,13 # result = FTMP0 addi 1,1,32 # sp = sp + 32 blr # RETURN (result)

Combining tests?

Consider combining floating point comparisons: When a floating point operation requires use of the CR, it requires exclusive use. So it will effectively block the other units from using compare operations at the same time.

float test_2( float arg0, float arg1 ) { const int arg0_test = ( arg0 >= 1.0 ); const int arg1_test = ( arg1 <= 0.5 ); const int comb_test = arg0_test || arg1_test; if ( comb_test ) { return (arg0); } return (arg1); }

If you look at compiler output from this example (below), you'll see that the branches are reduced from two to one as expected. However, there is a cost involved. The CR has is moved into a fixed point register and manipulated directly. Then the final branch is actually done from a fixed point result. This isn't too bad except for the use of the microcoded "or." instruction.

.LC1: .long 1065353216 .LC2: .long 1056964608 test_2: lis 5,.LC1@ha lis 3,.LC2@ha la 4,.LC1@l(5) la 11,.LC2@l(3) stwu 1,-16(1) lfs 13,0(11) lfs 0,0(4) fcmpu 7,1,0 fcmpu 6,2,13 crnot 30,28 crnot 26,25 mfcr 0 rlwinm 9,0,31,1 rlwinm 0,0,27,1 or. 11,9,0 bne- 0,.L9 fmr 1,2 .L9: addi 1,1,16 blr

Aside: Is there a benefit to using the ? : syntax?

This question comes up pretty often.
The answer is: No.

Look at the following example:

float test( float arg0, float arg1 ) { const int arg0_test = ( arg0 >= 1.0 ); const int arg1_test = ( arg1 <= 0.5 ); const int comb_test = arg0_test || arg1_test; const float result = comb_test ? arg0 : arg1; return ( result ); }

The compiler output is the same as the previous version:

.LC3: .long 1065353216 .LC4: .long 1056964608 test: lis 5,.LC3@ha lis 3,.LC4@ha la 4,.LC3@l(5) la 11,.LC4@l(3) stwu 1,-16(1) lfs 13,0(11) lfs 0,0(4) fcmpu 7,1,0 fcmpu 6,2,13 crnot 30,28 crnot 26,25 mfcr 0 rlwinm 9,0,31,1 rlwinm 0,0,27,1 or. 11,9,0 bne- 0,.L13 fmr 1,2 .L13: addi 1,1,16 blr

Other options?

There's additional FPU instruction available on the PPU that will selectively move between a choice of two floating point registers based on the value of a third. fsel.

fsel result, test, arg0, arg1

Which is logically equivalent to:

result = ( test >= 0.0 ) ? arg1 : arg0;

Let's rewrite the function (at least pretty close, we'll talk about the differences below...) to give the compiler the best chance of using fsel:

float test_4( float arg0, float arg1 ) { const float result0 = arg0; const float result1 = arg1; const float arg0_test = arg0 - 1.0; const float arg1_test = arg1 - 0.5; const float arg0_result = ( arg0_test >= 0.0f ) ? result1 : result0; const float arg1_result = ( arg1_test >= 0.0f ) ? result0 : arg0_result; return ( arg1_result ); }

But you'll notice in the compiler output here, there is no use of fsel.

.LC5: .long 1065353216 .LC6: .long 1056964608 .LC7: .long 0 test: lis 7,.LC5@ha fmr 11,1 la 6,.LC5@l(7) stwu 1,-32(1) lis 5,.LC6@ha lfs 5,0(6) fsubs 12,1,5 la 4,.LC6@l(5) stfs 2,16(1) lis 3,.LC7@ha lfs 3,0(4) la 9,.LC7@l(3) fsubs 2,2,3 lwz 0,16(1) lfs 13,0(9) mtctr 0 fcmpu 7,12,13 fcmpu 6,2,13 bge- 7,.L16 stfs 1,16(1) lwz 8,16(1) mtctr 8 .L16: bge- 6,.L19 mfctr 10 stw 10,16(1) lfs 11,16(1) .L19: fmr 1,11 addi 1,1,32 blr

So... we're going to have to call it ourselves:
We've also inverted the less-than-equal comparison to a greater-than-equal comparison. Which means that equal-to-zero will give the incorrect result. You'll have to decide on your circumstances if this is acceptable. You can also add an epsilon value to the comparison if you know you want an equal-to within a certain range.

static inline float fsel_gez( float test, float arg0, float arg1 ) { float result; __asm__ ( "fsel %0, %1, %2, %3": "=f"(result): "f"(test), "f"(arg0), "f"(arg1) ); return (result); } float test( float arg0, float arg1 ) { const float result0 = arg0; const float result1 = arg1; const float arg0_test = arg0 - 1.0; const float arg1_test = arg1 - 0.5; const float arg0_result = fsel_gez( arg0_test, result1, result0 ); const float arg1_result = fsel_gez( arg1_test, result0, arg0_result ); return ( arg1_result ); }

The results from this are significantly better.

.LC5: .long 1065353216 .LC6: .long 1056964608 test: lis 5,.LC5@ha lis 3,.LC6@ha la 4,.LC5@l(5) la 9,.LC6@l(3) stwu 1,-16(1) lfs 5,0(4) addi 1,1,16 lfs 13,0(9) fsubs 4,1,5 fsubs 3,2,13 fsel 0, 4, 2, 1 fsel 2, 3, 1, 0 fmr 1,2 blr

No branches.
The main problem remaining is the load of the constants.

Push floating point constants up the code hierarchy. You may be able to reuse constants and the best chance for that is in the calling function.

float test_6( float arg0, float arg0_comp, float arg1, float arg1_comp ) { const float result0 = arg0; const float result1 = arg1; const float arg0_test = arg0 - arg0_comp; const float arg1_test = arg1 - arg1_comp; const float arg0_result = fsel_gez( arg0_test, result1, result0 ); const float arg1_result = fsel_lz( arg1_test, arg0_result, result0 ); return ( arg1_result ); }

Here's the result. Wow! What a difference from the original. Fast and branch free.

test: fsubs 5,1,2 fsubs 0,3,4 fsel 2, 5, 3, 1 fsel 3, 0, 1, 2 fmr 1,3 blr

Summary