Reducing Floating Point Branches
Mike Acton
April 17, 2006
April 17, 2006
Background
There are three distinct instruction sets for carrying out floating point operations on the CBE. On the PPU, scalar floating point (32 bit and 64 bit) instructions are executed on the FPU (Floating Point Unit) and vector floating point (4 x 32 bit) instructions are executed on the VXU (Vector Execution Unit). The SPU can also execute vector floating point (4 x 32 bit and 2 x 64 bit) instructions.Most floating point instructions on both the PPU and SPU have single cycle throughputs which allows an additional floating point instruction to be issued on the following cycle. However many floating point instructions have very long latencies (number of cycles before the result can be used) and the latencies vary among the instructions.
For example: (Source: Cell Broadband Engine Programming Handbook [ibm.com]
|------------------------------------------------------------------|-------------|---------|
| UNIT | DESCRIPTION | INSTRUCTION | LATENCY |
|-----------|------------------------------------------------------|-------------|---------|
| PPU (FPU) | scalar 32 bit floating point multiply-add | fmadd | 10 |
| PPU (FPU) | scalar 32 bit floating point IEEE divide | fdivs | 74 |
| PPU (VXU) | vector 4 x 32 bit floating point multiply-add | vmaddfp | 12 |
| PPU (VXU) | vector 4 x 32 bit floating point reciprocal estimate | vrefp | 14 |
| SPU | vector 4 x 32 bit floating point multiply-add | fma | 6 |
| SPU | vector 4 x 32 bit floating point reciprocal-estimate | frest | 4 |
|-----------|------------------------------------------------------|-------------|---------|
In order to take advantage of the single cycle throughput, it is critical to fill the latency cycles with useful work. The two most significant barriers to filling the latency cycles are memory access and branching. This article will concentrate on minimizing branches. For background on why branches represent such a significant optimization barrier, see: Better Performance Through Branch Elimination
Floating point values
Let's look at the classic example of floating point branch elimination: fabs
There is, in fact, an absolute value instruction on the PPU. This example is only for illustration purposes.
Signed floating point numbers are extremely simple, the top bit is the sign bit - nothing else changes.
So typically instead of doing something like this:
float abs_f32( float arg )
{
return ( ( arg < 0.0f ) ? -arg : arg );
}
A programmer will "optimize" out the branch by simply masking out the sign bit. Like this fairly common bit of code:
float abs_f32_1( float arg )
{
uint32_t frep = *(uint32_t*)(float*)&arg;
uint32_t frep_uns = frep & 0x7fffffff;
float result = *(float*)(uint32_t*)&frep_uns;
return ( result );
}
So what's wrong with that? Read on to find out...
In the examples below -ffast-math has been added to the compiler flags. (This tutorial is not concerned with problems that need IEEE accuracy.)
What's wrong with the code above?
Although it may in fact work, it's illegal. Strict aliasing rules are that you can't cast (and dereference) two unrelated pointers, so the cast from (float*) to (uint32_t*) is not allowed and will result in this warning in GCC:
warning: dereferencing type-punned pointer will break strict-aliasing rules
There's a good reason for this rule. Strict aliasing and the restrict keyword were introduced in C99 to allow the compiler to optimize load and store order. Disabling these optimizations will reduce performance significantly.
Work with the strict aliasing and the restrict keyword. Don't work around them or disable them. See: Demystifying The Restrict Keyword
There are two solutions to this:
First, the C99 standard specifically states that (char*) types are related to every pointer type, so:
float abs_f32( float arg )
{
uint32_t frep = *(uint32_t*)(char*)(float*)&arg;
uint32_t frep_uns = frep & 0x7fffffff;
float result = *(float*)(uint8_t*)(char*)&frep_uns;
return ( result );
}
Second, the C99 standard specifically allows (and recommends) "casting" of unreleated types through unions:
union FREP32
{
float f32;
uint32_t u32;
};
float abs_f32( float arg )
{
union FREP32 result = { f32:arg };
result.u32 = result.u32 & 0x7fffffff;
return ( result.f32 );
}
Note this line:
union FREP32 result = { f32:arg };
C99 allows the assignment of fields in a structure or union through their names. Note that C++98 does not allow that, so you'll have to write something like the following if you want to remain compatible with C++:
union FREP32 result;
result.f32 = arg;
However, both options are still going to generate terrible code for the PPU. Let's have a look:
abs_f32:
stwu 1,-32(1) # sp[-32] = sp
# sp = sp - 32
stfs 1,16(1) # sp[16] = arg
lwz 0,16(1) # frep = ap[16]
rlwinm 9,0,0,1,31 # frep_uns = frep & 0x7fffffff;
stw 9,16(1) # sp[16] = frep_uns
lfs 1,16(1) # result = sp[16]
addi 1,1,32 # sp = sp + 32
blr # RETURN (result)
The main problem is there are two load-hit-store events.Store the original floating point value to the stack and load into fixed point register:
stfs 1,16(1) # sp[16] = arg
lwz 0,16(1) # frep = ap[16]
Store the fixed point result and load into the floating point return register:
stw 9,16(1) # sp[16] = frep_uns
lfs 1,16(1) # result = sp[16]
In each case the only thing that can be done is sit around and wait for the memory to be available before it is loaded into either register set. This is extrememly slow. Not a little slower. Cripplingly slow.
Floating point instructions don't work with immediate floating point values, so those constants must be loaded seperately into FPU registers. There are two methods to accomplish this:
- The value can be built using integer instructions on the FXU, saved to memory then loaded into the FPU.
- The value can be pre-stored memory (part of the executable) and loaded directly into the FPU.
Both of these methods involve memory, so anytime you use a floating point constant you risk a cache miss, just for starters. The main disadvantages of the first method are the extra store and the potential for a load-hit-store hazard event. The value must be completely commited to the cache before it can be loaded into the FPU, which will cause the FPU load instruction to stall until the value is ready.
Avoid editing floating point value with integer instructions. Don't cast floating point values to integers. Look at the floating point intruction set for alternatives.
Another Example
Let's look at a another simple function involving floating point comparisons and branches:
float test_1( float arg0, float arg1 )
{
if ( ( arg0 >= 1.0 ) || ( arg1 <= 0.5 ) )
{
return (arg0);
}
return (arg1);
}
Note the constants:
1.0 as a 32 bit float is 0x3f800000
0.5 as a 32 bit float is 0x3f000000
.LC0:
.long 1065353216 # 0x3f800000
test_1:
lis 3,.LC0@ha # f_one_addr = LC0 << 16
fmr 13,1 # FTMP0 = arg0
la 9,.LC0@l(3) # f_one_addr = f_one_addr | LC0
stwu 1,-32(1) # sp[-32] = sp
# sp = sp - 32
lfs 0,0(9) # f_one = f_one_addr[0]
fcmpu 7,1,0 # CR[7] = fcmp( arg0, f_one )
cror 30,29,30 # CR[7] = CR[7].GT | CR[7].EQ
fmr 1,2 # FTMP1 = arg1
beq- 7,.L3 # if_unlikely ( CR[7] == 0 )
lis 0,0x3f00 # f_pt_five_addr = 0x3f00 << 16
stw 0,16(1) # sp[16] = f_pt_five_addr[0]
lfs 3,16(1) # f_pt_five = sp[16]
fcmpu 0,2,3 # CR[0] = fcmp( arg1, f_pt_five )
cror 2,0,2 # CR[0] = CR[0].LT | CR[0].EQ
beq- 0,.L3 # if_unlikely ( CR[0] == 0 ) GOTO RETURN_ARG0
# RETURN_ARG1:
addi 1,1,32 # sp = sp + 32
blr # RETURN (result)
.L3: # RETURN_ARG0:
fmr 1,13 # result = FTMP0
addi 1,1,32 # sp = sp + 32
blr # RETURN (result)
Combining tests?
Consider combining floating point comparisons: When a floating point operation requires use of the CR, it requires exclusive use. So it will effectively block the other units from using compare operations at the same time.float test_2( float arg0, float arg1 )
{
const int arg0_test = ( arg0 >= 1.0 );
const int arg1_test = ( arg1 <= 0.5 );
const int comb_test = arg0_test || arg1_test;
if ( comb_test )
{
return (arg0);
}
return (arg1);
}
If you look at compiler output from this example (below), you'll see that the branches are reduced from two to one as expected. However, there is a cost involved. The CR has is moved into a fixed point register and manipulated directly. Then the final branch is actually done from a fixed point result. This isn't too bad except for the use of the microcoded "or." instruction.
.LC1:
.long 1065353216
.LC2:
.long 1056964608
test_2:
lis 5,.LC1@ha
lis 3,.LC2@ha
la 4,.LC1@l(5)
la 11,.LC2@l(3)
stwu 1,-16(1)
lfs 13,0(11)
lfs 0,0(4)
fcmpu 7,1,0
fcmpu 6,2,13
crnot 30,28
crnot 26,25
mfcr 0
rlwinm 9,0,31,1
rlwinm 0,0,27,1
or. 11,9,0
bne- 0,.L9
fmr 1,2
.L9:
addi 1,1,16
blr
Aside: Is there a benefit to using the ? : syntax?
This question comes up pretty often.
The answer is: No.
Look at the following example:
The answer is: No.
Look at the following example:
float test( float arg0, float arg1 )
{
const int arg0_test = ( arg0 >= 1.0 );
const int arg1_test = ( arg1 <= 0.5 );
const int comb_test = arg0_test || arg1_test;
const float result = comb_test ? arg0 : arg1;
return ( result );
}
The compiler output is the same as the previous version:
.LC3:
.long 1065353216
.LC4:
.long 1056964608
test:
lis 5,.LC3@ha
lis 3,.LC4@ha
la 4,.LC3@l(5)
la 11,.LC4@l(3)
stwu 1,-16(1)
lfs 13,0(11)
lfs 0,0(4)
fcmpu 7,1,0
fcmpu 6,2,13
crnot 30,28
crnot 26,25
mfcr 0
rlwinm 9,0,31,1
rlwinm 0,0,27,1
or. 11,9,0
bne- 0,.L13
fmr 1,2
.L13:
addi 1,1,16
blr
Other options?
There's additional FPU instruction available on the PPU that will selectively move between a choice of two floating point registers based on the value of a third. fsel.
fsel result, test, arg0, arg1
Which is logically equivalent to:
result = ( test >= 0.0 ) ? arg1 : arg0;
Let's rewrite the function (at least pretty close, we'll talk about the differences below...) to give the compiler the best chance of using fsel:
float test_4( float arg0, float arg1 )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - 1.0;
const float arg1_test = arg1 - 0.5;
const float arg0_result = ( arg0_test >= 0.0f ) ? result1 : result0;
const float arg1_result = ( arg1_test >= 0.0f ) ? result0 : arg0_result;
return ( arg1_result );
}
But you'll notice in the compiler output here, there is no use of fsel.
.LC5:
.long 1065353216
.LC6:
.long 1056964608
.LC7:
.long 0
test:
lis 7,.LC5@ha
fmr 11,1
la 6,.LC5@l(7)
stwu 1,-32(1)
lis 5,.LC6@ha
lfs 5,0(6)
fsubs 12,1,5
la 4,.LC6@l(5)
stfs 2,16(1)
lis 3,.LC7@ha
lfs 3,0(4)
la 9,.LC7@l(3)
fsubs 2,2,3
lwz 0,16(1)
lfs 13,0(9)
mtctr 0
fcmpu 7,12,13
fcmpu 6,2,13
bge- 7,.L16
stfs 1,16(1)
lwz 8,16(1)
mtctr 8
.L16:
bge- 6,.L19
mfctr 10
stw 10,16(1)
lfs 11,16(1)
.L19:
fmr 1,11
addi 1,1,32
blr
So... we're going to have to call it ourselves:We've also inverted the less-than-equal comparison to a greater-than-equal comparison. Which means that equal-to-zero will give the incorrect result. You'll have to decide on your circumstances if this is acceptable. You can also add an epsilon value to the comparison if you know you want an equal-to within a certain range.
static inline float fsel_gez( float test, float arg0, float arg1 )
{
float result;
__asm__ ( "fsel %0, %1, %2, %3": "=f"(result): "f"(test), "f"(arg0), "f"(arg1) );
return (result);
}
float test( float arg0, float arg1 )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - 1.0;
const float arg1_test = arg1 - 0.5;
const float arg0_result = fsel_gez( arg0_test, result1, result0 );
const float arg1_result = fsel_gez( arg1_test, result0, arg0_result );
return ( arg1_result );
}
The results from this are significantly better.
.LC5:
.long 1065353216
.LC6:
.long 1056964608
test:
lis 5,.LC5@ha
lis 3,.LC6@ha
la 4,.LC5@l(5)
la 9,.LC6@l(3)
stwu 1,-16(1)
lfs 5,0(4)
addi 1,1,16
lfs 13,0(9)
fsubs 4,1,5
fsubs 3,2,13
fsel 0, 4, 2, 1
fsel 2, 3, 1, 0
fmr 1,2
blr
No branches.The main problem remaining is the load of the constants.
Push floating point constants up the code hierarchy. You may be able to reuse constants and the best chance for that is in the calling function.
float test_6( float arg0, float arg0_comp, float arg1, float arg1_comp )
{
const float result0 = arg0;
const float result1 = arg1;
const float arg0_test = arg0 - arg0_comp;
const float arg1_test = arg1 - arg1_comp;
const float arg0_result = fsel_gez( arg0_test, result1, result0 );
const float arg1_result = fsel_lz( arg1_test, arg0_result, result0 );
return ( arg1_result );
}
Here's the result. Wow! What a difference from the original. Fast and branch free.
test:
fsubs 5,1,2
fsubs 0,3,4
fsel 2, 5, 3, 1
fsel 3, 0, 1, 2
fmr 1,3
blr
Summary
Further Reading
The IEEE754 standard is well-established as both the de facto and actual standard for floating point numbers. Although not all processors and software environments adhere strictly to the standard, it is highly-regarded and generally only the deviations from the standard are documented.These documents provide a solid background for understanding the floating point format:
- Numerical Computation Guide [sun.com]
- Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic [berkeley.edu]
- Why do we need a floating-point arithmetic standard? [berkeley.edu]
- Comparing floating point numbers [cygnus-software.com]
About the author
Mike Acton is a Senior Architect working on PS3/Cell research at Highmoon Studios (Vivendi-Universal Games) who occasionally refers to himself in the third person. Most recently, Mike was the Lead Special Effects Programmer for Darkwatch at Highmoon Studios (previously Sammy Studios) in Carlsbad, CA. Previously Mike has held Lead Programming, Playstation 2 Engine Development and Research roles with Yuke's, VBlank and BlueSky/Titus. He worked for Sony Interactive Studios in PSX development. Mike has made regular appearances as a speaker at SCEA develpment conferences and has spoken at GDC. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.
He prefers vi.
Also check out the non Cell BE articles by Mike Acton.