Background on Branching

Branching, both conditional and unconditional, slows most implementations. Even an unconditional branch or a correctly predicted taken branch may cause a delay if the target instruction is not in the fetch buffer or the cache. It is therefore best to use branch instructions carefully and to devise algorithms that reduce branching. Many operations that normally use branches may be performed either with fewer or no branches.
-- From IBM's The PowerPC Compiler Writer's Guide 3.1.5

Branches represent a significant part of both performance critical and general purpose code - as a general rule of thumb, 20% of the instructions in typical code are branches. In inner loops and other code sections which demand the highest performance may benefit from a multifold increase in performance by eliminating, or reducing, branches.

This series of articles will present the types of delays that branches may cause in program execution and some programming patterns that help avoid those delays.
Branch Paths
Branch instructions divide the execution path in two paths, the target path and the fall-through path. The address of the first instruction on each path is referred to as target address and fall-through address, respectively. The target address is where the instruction points to, much like the label in a C goto statement. The fall-through address represents the instruction just after the branch.

...
if ( a != b )
{
fall_through_address:
...
}
target_address:
...
...
beq cr0, target_address
fall_through_address:
...
target_address:
...

There are two potential performance issues related to branch instructions, which are based on the decisions the hardware must make in order to execute the correct instruction stream:
  • Deciding which target address will be taken. For indirect branches where the target address is not known until runtime, either the architecture will wait for the address to become known, then start loading the instruction stream (incurring a stall), or a form of target address prediction will be used, which will begin loading queuing the stream at the predicted address and revert any executed code if the prediction turns out to be incorrect.
  • Deciding which branch will be taken. For conditional branches, there are both static and dynamic methods of predicting whether excecution will continue through the target address or the fall-through address. This decision isn't as simple to make as it may appear at a first glance. Modern processors are able to execute more than one instruction concurrently and what the processor chooses to do will have a great impact on overall execution performance.

Branch Types
Branch instructions can be either unconditional or conditional.

Unconditional branch, or jump, instructions, as the name would imply, are always taken and cause the instruction queue to begin loading instructions either at the address which is embedded in the instruction itself (called a direct branch), or at the address previously stored in a register by the programmer or compiler (called an indirect, or computed, branch).
goto my_label; <-- Unconditional, direct branch via goto (*my_callback)(); <-- Unconditional, indirect branch via function pointer
On both the PPU and SPU, an unconditional branch may also jump to the link register, which is not a special type of register, but is one register within the register file which by convention (i.e. the ABI) has been decided to be the register in which next instruction of the parent on the callstack is stored. The link register is normally maintained by the compiler, but may be used directly by the programmer in assembly.
return; <-- Unconditional, indirect branch via link register

Conditional branch instructions are only taken if a particular condition is met, typically the result of some comparison (e.g. equal-to-zero, less-than, etc.). Conditional branches generally correspond to if, for, while and do statements in the C language.

On the PPU, the state of the condition is held in the Condition Register (CR) which is then used to determine the which of the target or fall-through streams to begin queuing. The CR is modified by either special CR-modifying versions of instructions which compare the result of the expression to zero and store the camparison result, or more directly by special comparison instructions and CR logic instructions.

On the SPU, there is no special condition register. Comparison instructions store a value (zero or non-zero) into a target register specified by the programmer or compiler. This register is then used by various branch-on-zero instructions to queue the appropriate instruction stream.
Like unconditional branches, conditional branches may be either direct or indirect. However, the vast majority of conditional branches will correspond to local conditional statements or loops and in which the target address is known at compile time (i.e. direct).

On the PPU, there is additionally a special register caled the Count Register (CTR) which is designed specifically for simplifying the prediction of local loops. The state of the CTR is held in the Condition Register (CR) and conditional branches can test bits in the CR, for CTR being equal to or different from zero to decide whether or not to jump. A PPU compiler will often map for
loops with static iteration counts to use the CTR.

What follows is a table of the conditions which may be used to affect the branch condition result on the PPU depending on where the target address is stored.
Conditional branch type Available conditions
Predefined target address CR bits
CTR=0
CTR≠0
CTR=0 and CR bits
CTR≠0 and CR bits
Always(*1)
Target address in LR CR bits
CTR=0
CTR≠0
CTR=0 and CR bits
CTR≠0 and CR bits
Always(*1)
Target address in CTR CR bits
Always (*1)
*1 In those cases execution will always continue in the target address.


In summary, unconditional, direct branches are generated by the compiler for the goto, break and continue C statements. Indirect branches are compiled for calls to function pointers and C++ virtual functions, for subroutine returns, and for switch statements that use lookup tables or address tables. Conditional branches are compiled for if, while, do-while, for statements and the ? ternary operator.

if ( a > b )         /* Conditional branch */
{
c = a;
goto next_value; /* Unconditional branch */
}

(*callback)(c, b); /* Computed branch */

next_value:
c = b;
Branch Resolution
A branch instruction can depend on resources in order to make the processor continue execution on the right location. A branch instruction that has all needed resources available by the time it's processed is called resolved. Unconditional, direct branches are always resolved (statically, at compile time) because they have no dependencies.

All other branches are called unresolved. They depend on the value in a register (e.g. LR or CTR) to determine the target address or on a condition (e.g. CTR or CR) to determine if then branch will or will not be taken. Unresolved branches correspond to conditional or indirect branch operations. Those branches must be resolved (the target address or the condition - or both - must be computed) before they are dispatched to be processed. If an unresolved branch is processed, some prediction of the address where the program will continue execution must take place in order not to stall, or minimize the stall, of the execution pipeline.

PPU Branch Prediction
When a branch is to be processed but is not yet resolved, the processor must do one of:
  • Sit still and wait until the resources needed by the branch instruction are available and so the address where execution will continue is known;
  • Continue execution simultaneously on both possible addresses - the target and the fall-through addresses - until the correct address is known. In this case, all the processing made on the wrong path must be undone; or
  • Predict the address on which execution will continue and go to that path until the correct address is known. In this case, if the predicted address is found wrong all the processing must be undone back to the branch instruction, and the branch instruction is re-evaluated.
The first option will always cause execution to stop. The second option requires special hardware in the processor to execute the two different paths simultaneously and requires intensive register renaming. The PPU uses the third option. It is important therefore to understand how prediction works in order to write more efficient code.

Branch prediction is the task of guessing the address where a branch that is unresolved by the time it's processed will jump to. The PPU has two branch prediction modes: static and dynamic.
1. Static branch prediction does not change the prediction as the program runs. The compiler makes a guess about whether to predict taken or not taken and sets a bit in the instruction word. When the instruction runs, the hardware will always branch the same way according to the value of the special bit.

2. Dynamic branch prediction changes the prediction as the program runs. The hardware, not the compiler, must guess which way to branch each time the instruction is executed.
-- From: http://lgjohn.okstate.edu/6253/lectures_old/brpredict.pdf

Static prediction comes in two flavors: prediction of the likelihood of the branch to be taken and prediction of the target address at which execution will continue. The prediction of the likelihood of the branch to be taken is related to conditions that depend on CR and on CTR. A branch which depends on CR and is to be processed but the results of the instruction that updates CR are not yet available will cause a stall, so the condition of the branch must be predicted. The same thing happens with CTR.

The prediction of the address at which execution will continue is related to target addresses that are in LR or in CTR. A branch which depends on LR and where its value is not yet available will cause a stall, so the continuation address must be predicted. The same thing happens with CTR.

The prediction of the likelihood of a branch to be taken is given by two bits a and t in the BO field of the branch instructions. It cannot be changed while the program is executing, that's why the prediction is called static.

a t Hint
0 0 No hint is given
0 1 Reserved
1 0 The branch is very likely not to be taken and execution will continue at the fall-through address
1 1 The branch is very likely to be taken and execution will continue at the target address

Dynamic prediction is performed by lookups in the Branch History Table (BHT) and with the use of the Link Stack. Those structures are used to predict both the likelihood of a branch being taken and the address where execution will continue. There is no way the programmer can interfere with dynamic branch predictions as it's done without the help of the software, so this article will not dive into its details.

SPU Branch Prediction
The SPU doesn't have dynamic branch prediction hardware. The rationale behind this is that that hardware would take increase both energy and chip area requirements.

The SPU static branch prediction isn't encoded in branch instructions, as it is at the PPU. Instead, the SPU has three special instructions that provide hints of the continuation address of the branches.

Instruction Opcode Description
Hint for branch (r-form) hbr branch_addr, reg Hint that the branch instruction at branch_addr will branch to the address contained in reg.
Hint for branch (a-form) hbra branch_addr, predicted_addr Hint that the branch instruction at branch_addr will branch to predicted_addr.
Hint for branch relative hbrr branch_addr, offset Hint that the branch instruction at branch_addr will branch to the address of the hbrr instruction plus the signed value offset (relative address)

When a hint instruction is found, its arguments are saved in the Branch Target Buffer (BTB). Then, every time a branch instructions enters the issue stage of the execution pipeline, its address is compared to the branch instruction address saved in the BTB. If they match, the SPU uses the predicted address saved in the BTB along with the branch address.

If a branch instruction which is not present at the BTB enters the issue state, the SPU predicts that execution will continue at the fall-through path. In other words, those branches behave like PPU branches where a and t bits are set to 1 and 0, respectively. Branches that are present at the BTB behave like PPU branches where a and t bits are both set to 1.

Hints saved at the BTB remains valid until invalidated by a sync instruction or replaced by another hint. Currently, the BTB has only one entry, so only one branch hint will be active at a time, but many branch prediction strategies can be coded in software by the use of the hint for branch instructions.

The address where a hint instruction is placed in relation to the branch instruction for which it provides a hint has consequences to the processing of the hint. In short, hints too close to the branch will have no effect, hints not far enough of the branch will be predicted, but a stall will occur. All other hints will provide predictions without stalling execution.