Balancing speed and instruction count
I came across an interesting bit of Altivec code [simdtech.org] on the Altivec mailing list [simdtech.org] today and I thought it would make a good example of how speed is not simply a matter of instruction count on the PPU and SPU.The function (modified slightly for this example):
vector unsigned short
test( vector unsigned short a, vector unsigned short b )
{
const vector unsigned short mask = (vector unsigned short)vec_cmplt( a, b );
if (vec_all_ge(a, b))
{
return (a);
}
return (mask);
}
Ingoring the function stack push/pop, we would have a very few instructions. Something like:
vcmpgtuh. v0, v3, v2
mfcr r0
rlwinm r0, r0, 27, 1
vcmpgtuh v0, v3, v2
cmpwi cr7, r0, 0
bne+ cr7, L2
vor v2, v0, v0
L2:
Ben Weiss [shellandslate.com]
wrote this snippet to demonstrate that GCC does not optimize out
redundant calls when Altivec instructions are called both with and
without the CR-modify bit set. It's worth noting that this code (in the
general case) will be a performance loss. Not only will there very
likely be performance issues in these instructions themselves, but it
will also affect the surrounding code by creating an optimization
barrier.Read on to find out more about the issues and examine the alternative...
Tracing the redundant instruction
A problem with the emitted assembly above is that the "vcmpgtuh" is
completely redundant. The "vcmpgtuh." already calculates the needed
result. So why is it emitted?The predicate (CR modifying) Altivec instructions are mapped to their builtin descriptions in rs6000.c (gcc/config/rs6000)
static const struct builtin_description_predicates bdesc_altivec_preds[] =
{
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpbfp.", "__builtin_altivec_vcmpbfp_p", ALTIVEC_BUILTIN_VCMPBFP_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpeqfp.", "__builtin_altivec_vcmpeqfp_p", ALTIVEC_BUILTIN_VCMPEQFP_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpgefp.", "__builtin_altivec_vcmpgefp_p", ALTIVEC_BUILTIN_VCMPGEFP_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4sf, "*vcmpgtfp.", "__builtin_altivec_vcmpgtfp_p", ALTIVEC_BUILTIN_VCMPGTFP_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpequw.", "__builtin_altivec_vcmpequw_p", ALTIVEC_BUILTIN_VCMPEQUW_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpgtsw.", "__builtin_altivec_vcmpgtsw_p", ALTIVEC_BUILTIN_VCMPGTSW_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v4si, "*vcmpgtuw.", "__builtin_altivec_vcmpgtuw_p", ALTIVEC_BUILTIN_VCMPGTUW_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpgtuh.", "__builtin_altivec_vcmpgtuh_p", ALTIVEC_BUILTIN_VCMPGTUH_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpgtsh.", "__builtin_altivec_vcmpgtsh_p", ALTIVEC_BUILTIN_VCMPGTSH_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v8hi, "*vcmpequh.", "__builtin_altivec_vcmpequh_p", ALTIVEC_BUILTIN_VCMPEQUH_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpequb.", "__builtin_altivec_vcmpequb_p", ALTIVEC_BUILTIN_VCMPEQUB_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpgtsb.", "__builtin_altivec_vcmpgtsb_p", ALTIVEC_BUILTIN_VCMPGTSB_P },
{ MASK_ALTIVEC, CODE_FOR_altivec_predicate_v16qi, "*vcmpgtub.", "__builtin_altivec_vcmpgtub_p", ALTIVEC_BUILTIN_VCMPGTUB_P }
};
Then in altivec.h the predicate intrinsics e.g. vec_all_gte() directly
call the builtins. There doesn't appear to be any RTL references to
them at all. So the scheduler has no way of knowing the two
instructions are related.For additional validation, examing the GCC debug output (*.34.sched2)
;; ======================================================
;; -- basic block 0 from 61 to 17 -- after reload
;; ======================================================
;; 0--> 61 %0=vrsave :nothing
;; 0--> 63 %12=%0|0xffffffffc0000000 :fxu_sim_cell
;; 2--> 62 [%1-0x4]=%0 :lsu_cell
;; 2--> 64 {vrsave=unspec/v[%12,vrsave] 30;clo:fxu_sim_cell
;; 4--> 14 {%6=unspec[%3,%2,`*vcmpgtuh.'] 175;:vus_cell
;; 6--> 25 %1=%2 :vus_cell
;; 8--> 16 %6=cmp(%6==0x0,0x0) :bru_cr_cell
;; 8--> 10 %0=unspec[%3,%2] 58 :vus_cell
;; 12--> 17 pc={(%6!=0x0)?L38:pc} :bru_cell
To enable debugging output in GCC, add the -dall compilation flag. For more information on the GCC debugging options see: Options for Debugging Your Program or GCC [gnu.org]
*.34.sched2 represents the final pass of the compiler and it would seem from this line
;; 4--> 14 {%6=unspec[%3,%2,`*vcmpgtuh.'] 175;:vus_cell
that "vcmpgtuh." is pretty much a text replacement.
Based on this, don't expect any of the Altivec instructions with the CR modify bit set to fold with the plain version under any conditions.
In addition to the various other reasons for avoiding the CR modifying Altivec instructions, this is a good reason to:
Avoid CR-modifying Altivec instructions (vec_any_*, vec_all_*)
What are the other problems?
On closer examination, these few lines of code have quite a few potential performance issues.
vcmpgtuh. v0, v3, v2
The "vcmpgtuh." instruction modifies the Condition Register (CR). This
could cause problems because the Vector Execution Unit (VXU) must
contend with the Floating Point Unit (FPU) and the Fixed Point
Execution Unit (FXU) for access to the 32 bit CR. It would be resonable
to expect that under fairly common conditions, this would cause a
pipeline bubble.Another disadvantage to modifying the CR is the added difficulty in instruction scheduling. The CR is a very scarce resource and the options the compiler has for moving the code (to maximize throughput) is now more reliant on the availability of the 4 bits of the CR needed to store this result than the actual instruction throughput or latency.
mfcr r0
rlwinm r0, r0, 27, 1
In order to free the Altivec modified field in the condition register,
the value is moved to a fixed-point register (mfcr). The CR value must
also be shifted in order to create a predicate value. In this case, the
"vcmpgtuh." doesn't just lock the condition register, but also takes
resources from the fixed-point unit.
bne+ cr7, L2
The first problem with introducing this branch instruction is that it
creates an optimization barrier for the compiler. Although sometimes
possible, it is very difficult to lift code from inside a branch target
and schedule it with code in another block. And GCC is especially weak
at this particular optimization.
vor v2,v0,v0
The "vor" instruction simply takes the result in the default case and
moves it to the return register. By itself, there is very little lost,
but in this context the instruction must wait the full latency of the
compare instruction in the best case, and for a full pipeline flush in
the worst case, in order to execute. And because the branch introduces
an optimization barrier, there is very little oportunity to schedule
instructions during that period.What's the alternative?
Simply put - remove the branch.Here is one possible implementation:
const vector unsigned int zero = vec_xor( a, a );
const vector unsigned short mask_lt = vec_cmplt( a, b );
const vector unsigned short mask_e0 = vec_sld( mask_lt, mask_lt, 8 );
const vector unsigned short mask_e1 = vec_sld( mask_lt, mask_lt, 12 );
const vector unsigned short mask_e2 = vec_and( mask_lt, mask_e0 );
const vector unsigned int mask_e3 = (vector unsigned int)vec_and( mask_e2, mask_e1 );
const vector unsigned short mask_all_ge = (vector unsigned short)vec_cmpeq( mask_e3, zero );
const vector unsigned short result = vec_sel( a, mask_lt, mask_all_ge );
Which generated the following results for ppu-gcc (3.4.1) (-O3)
vspltish v0, 0
vcmpgtuh v3, v3, v2
vor v13, v2, v2
vsldoi v2, v3, v3, 8
vsldoi v1, v3, v3, 12
vand v2, v3, v2
vand v2, v2, v1
vcmpequw v2, v2, v0
vsel v2, v13, v3, v2
Examine the difference in complexity between the original and branch-free version of this function:
Original version:
Branch-Free version:
Result
Although the branch-free code emits a slightly longer sequence,
the final complexity is lower (notice that there is only a
simple, single path - which was, of course, the point.)