Choosing to Avoid Branches: A Small Altivec Example

Balancing speed and instruction count
I came across an interesting bit of Altivec code [] on the Altivec mailing list [] today and I thought it would make a good example of how speed is not simply a matter of instruction count on the PPU and SPU.

The function (modified slightly for this example):
vector unsigned short test( vector unsigned short a, vector unsigned short b ) { const vector unsigned short mask = (vector unsigned short)vec_cmplt( a, b ); if (vec_all_ge(a, b)) { return (a); } return (mask); }
Ingoring the function stack push/pop, we would have a very few instructions. Something like:
vcmpgtuh. v0, v3, v2 mfcr r0 rlwinm r0, r0, 27, 1 vcmpgtuh v0, v3, v2 cmpwi cr7, r0, 0 bne+ cr7, L2 vor v2, v0, v0 L2:
Ben Weiss [] wrote this snippet to demonstrate that GCC does not optimize out redundant calls when Altivec instructions are called both with and without the CR-modify bit set. It's worth noting that this code (in the general case) will be a performance loss. Not only will there very likely be performance issues in these instructions themselves, but it will also affect the surrounding code by creating an optimization barrier.

Read on to find out more about the issues and examine the alternative...
Tracing the redundant instruction
A problem with the emitted assembly above is that the "vcmpgtuh" is completely redundant. The "vcmpgtuh." already calculates the needed result. So why is it emitted?

The predicate (CR modifying) Altivec instructions are mapped to their builtin descriptions in rs6000.c (gcc/config/rs6000)
static const struct builtin_description_predicates bdesc_altivec_preds[] = { { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpbfp.", "__builtin_altivec_vcmpbfp_p", ALTIVEC_BUILTIN_VCMPBFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpeqfp.", "__builtin_altivec_vcmpeqfp_p", ALTIVEC_BUILTIN_VCMPEQFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpgefp.", "__builtin_altivec_vcmpgefp_p", ALTIVEC_BUILTIN_VCMPGEFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpgtfp.", "__builtin_altivec_vcmpgtfp_p", ALTIVEC_BUILTIN_VCMPGTFP_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpequw.", "__builtin_altivec_vcmpequw_p", ALTIVEC_BUILTIN_VCMPEQUW_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpgtsw.", "__builtin_altivec_vcmpgtsw_p", ALTIVEC_BUILTIN_VCMPGTSW_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpgtuw.", "__builtin_altivec_vcmpgtuw_p", ALTIVEC_BUILTIN_VCMPGTUW_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpgtuh.", "__builtin_altivec_vcmpgtuh_p", ALTIVEC_BUILTIN_VCMPGTUH_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpgtsh.", "__builtin_altivec_vcmpgtsh_p", ALTIVEC_BUILTIN_VCMPGTSH_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpequh.", "__builtin_altivec_vcmpequh_p", ALTIVEC_BUILTIN_VCMPEQUH_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpequb.", "__builtin_altivec_vcmpequb_p", ALTIVEC_BUILTIN_VCMPEQUB_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpgtsb.", "__builtin_altivec_vcmpgtsb_p", ALTIVEC_BUILTIN_VCMPGTSB_P }, { MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpgtub.", "__builtin_altivec_vcmpgtub_p", ALTIVEC_BUILTIN_VCMPGTUB_P } };
Then in altivec.h the predicate intrinsics e.g. vec_all_gte() directly call the builtins. There doesn't appear to be any RTL references to them at all. So the scheduler has no way of knowing the two instructions are related.

For additional validation, examing the GCC debug output (*.34.sched2)
;; ====================================================== ;; -- basic block 0 from 61 to 17 -- after reload ;; ====================================================== ;; 0--> 61 %0=vrsave :nothing ;; 0--> 63 %12=%0|0xffffffffc0000000 :fxu_sim_cell ;; 2--> 62 [%1-0x4]=%0 :lsu_cell ;; 2--> 64 {vrsave=unspec/v[%12,vrsave] 30;clo:fxu_sim_cell ;; 4--> 14 {%6=unspec[%3,%2,`*vcmpgtuh.'] 175;:vus_cell ;; 6--> 25 %1=%2 :vus_cell ;; 8--> 16 %6=cmp(%6==0x0,0x0) :bru_cr_cell ;; 8--> 10 %0=unspec[%3,%2] 58 :vus_cell ;; 12--> 17 pc={(%6!=0x0)?L38:pc} :bru_cell
To enable debugging output in GCC, add the -dall compilation flag. For more information on the GCC debugging options see: Options for Debugging Your Program or GCC []

*.34.sched2 represents the final pass of the compiler and it would seem from this line
;; 4--> 14 {%6=unspec[%3,%2,`*vcmpgtuh.'] 175;:vus_cell
that "vcmpgtuh." is pretty much a text replacement.

Based on this, don't expect any of the Altivec instructions with the CR modify bit set to fold with the plain version under any conditions.
In addition to the various other reasons for avoiding the CR modifying Altivec instructions, this is a good reason to:
Avoid CR-modifying Altivec instructions (vec_any_*, vec_all_*)
What are the other problems?
On closer examination, these few lines of code have quite a few potential performance issues.
vcmpgtuh. v0, v3, v2
The "vcmpgtuh." instruction modifies the Condition Register (CR). This could cause problems because the Vector Execution Unit (VXU) must contend with the Floating Point Unit (FPU) and the Fixed Point Execution Unit (FXU) for access to the 32 bit CR. It would be resonable to expect that under fairly common conditions, this would cause a pipeline bubble.

Another disadvantage to modifying the CR is the added difficulty in instruction scheduling. The CR is a very scarce resource and the options the compiler has for moving the code (to maximize throughput) is now more reliant on the availability of the 4 bits of the CR needed to store this result than the actual instruction throughput or latency.

mfcr r0 rlwinm r0, r0, 27, 1
In order to free the Altivec modified field in the condition register, the value is moved to a fixed-point register (mfcr). The CR value must also be shifted in order to create a predicate value. In this case, the "vcmpgtuh." doesn't just lock the condition register, but also takes resources from the fixed-point unit.
bne+ cr7, L2
The first problem with introducing this branch instruction is that it creates an optimization barrier for the compiler. Although sometimes possible, it is very difficult to lift code from inside a branch target and schedule it with code in another block. And GCC is especially weak at this particular optimization.

vor v2,v0,v0
The "vor" instruction simply takes the result in the default case and moves it to the return register. By itself, there is very little lost, but in this context the instruction must wait the full latency of the compare instruction in the best case, and for a full pipeline flush in the worst case, in order to execute. And because the branch introduces an optimization barrier, there is very little oportunity to schedule instructions during that period.

What's the alternative?
Simply put - remove the branch.

Here is one possible implementation:
const vector unsigned int zero = vec_xor( a, a ); const vector unsigned short mask_lt = vec_cmplt( a, b ); const vector unsigned short mask_e0 = vec_sld( mask_lt, mask_lt, 8 ); const vector unsigned short mask_e1 = vec_sld( mask_lt, mask_lt, 12 ); const vector unsigned short mask_e2 = vec_and( mask_lt, mask_e0 ); const vector unsigned int mask_e3 = (vector unsigned int)vec_and( mask_e2, mask_e1 ); const vector unsigned short mask_all_ge = (vector unsigned short)vec_cmpeq( mask_e3, zero ); const vector unsigned short result = vec_sel( a, mask_lt, mask_all_ge );
Which generated the following results for ppu-gcc (3.4.1) (-O3)
vspltish v0, 0 vcmpgtuh v3, v3, v2 vor v13, v2, v2 vsldoi v2, v3, v3, 8 vsldoi v1, v3, v3, 12 vand v2, v3, v2 vand v2, v2, v1 vcmpequw v2, v2, v0 vsel v2, v13, v3, v2
Examine the difference in complexity between the original and branch-free version of this function:

Original version:

Branch-Free version:

Although the branch-free code emits a slightly longer sequence, the final complexity is lower (notice that there is only a simple, single path - which was, of course, the point.)