A Practical GCC Trick To Use During Optimization

Splitting a basic block (by force)

Warning: This is a trick to use during optimization. It is not documented nor gauranteed to work across multiple platforms or different revisions of the compiler. Many programmers will say that this non-portable code should not be used in production, but there is such a huge practical benefit to using it has to be mentioned.

One of the first things that the compiler's scheduler does is organize the code into blocks of code without branches, out-of-line function calls or other optimization barriers. These basic blocks will then be scheduled among eachother based on their dependencies. Toward the end of scheduling individual instructions will be scheduled within basic blocks. Finally, instructions may be moved across block boundaries.

Inline assembly is most definately an optimization barrier and every block of inline assembly is treated as an independent basic block. This is why assembly should be inlined one instruction at a time - to give the compiler the maximum number of options for scheduling.

Why do I mention this?

Well, it turns out that if you have an empty inline assembly statement you can rely on the side-effect of splitting the basic block witout actually adding any instructions.
#define GCC_SPLIT_BLOCK __asm__ ("");
So in this case we want to split the block after the loads but before the calculations. The compiler will then schedule these two blocks separately, then try to merge them - but there will be nothing to merge since it will meet all the contraints. I'll also split the comparisons and branches section of the code at the end so it's easier to see what's happening (so easier to optimize). There's no problem with letting the compiler mix the instructions between the FPU calculations and the comparisons and branches, though. So the new version:
bool overlaps(const Box& b) const { // // LOADS // const float a_c0 = m_center[0]; const float a_c1 = m_center[1]; const float a_c2 = m_center[2]; const float a_e0 = m_extent[0]; const float a_e1 = m_extent[1]; const float a_e2 = m_extent[2]; const float b_c0 = b.m_center[0]; const float b_c1 = b.m_center[1]; const float b_c2 = b.m_center[2]; const float b_e0 = b.m_extent[0]; const float b_e1 = b.m_extent[1]; const float b_e2 = b.m_extent[2]; GCC_SPLIT_BLOCK // // CALCULATIONS // const float delta_c0 = a_c0 - b_c0; const float delta_c1 = a_c1 - b_c1; const float delta_c2 = a_c2 - b_c2; const float abs_delta_c0 = ::fabs( delta_c0 ); const float abs_delta_c1 = ::fabs( delta_c1 ); const float abs_delta_c2 = ::fabs( delta_c2 ); const float sum_e0 = a_e0 + b_e0; const float sum_e1 = a_e1 + b_e1; const float sum_e2 = a_e2 + b_e2; GCC_SPLIT_BLOCK // // COMPARES AND BRANCHES // const bool in_0 = abs_delta_c0 <= sum_e0; const bool in_1 = abs_delta_c1 <= sum_e1; const bool in_2 = abs_delta_c2 <= sum_e2; const bool result = in_0 && in_1 && in_2; return (result); }
The final output is exactly what we were expecting. It's very obvious from output that the code was scheduled on either side of the splits. The new output:
_Z12test_overlapRK3BoxS1_: // // PUSH STACK // stwu 1,-16(1) // // LOADS // lfs 4,20(3) lfs 3,20(4) lfs 1,0(3) lfs 13,4(3) lfs 12,8(3) lfs 11,12(3) lfs 10,16(3) lfs 9,0(4) lfs 8,4(4) lfs 7,8(4) lfs 6,12(4) lfs 5,16(4) // // CALCULATIONS // fsubs 0,1,9 fsubs 2,13,8 fsubs 1,12,7 fadds 11,11,6 fadds 10,10,5 fadds 4,4,3 fabs 0,0 fabs 13,2 fabs 12,1 // // COMPARES AND BRANCHES // fcmpu 7,13,10 li 3,0 crnot 30,29 fcmpu 1,0,11 mfcr 0 rlwinm 0,0,31,1 fcmpu 6,12,4 crnot 26,25 cmpwi 7,0,0 mfcr 0 rlwinm 0,0,27,1 bgt- 1,.L14 cmpwi 6,0,0 beq- 7,.L14 beq- 6,.L14 li 3,1 // // POP STACK AND RETURN // .L14: addi 1,1,16 blr