Better Performance Through Branch Elimination
Mike Acton
July 11, 2006
July 11, 2006
Update! (11 July 06) Major Revision. With much help from André de Leiradella, these are improved working drafts of the series on branch elimination. Now included are a more detailed background on branches and many more examples!
Introduction
Second only to poor data access patterns, branches can have a big negative impact in the performance of a program. Methods for reducing branch penalties, such as both dynamic and static (software-assisted) branch prediction hardware, despite their successes, are increasingly less effective as the length of the instruction pipelines increase, particularly with in-order architectures where execution must be stalled when hardware prediction fails.
Branching, both conditional and unconditional, slows most implementations. Even an unconditional branch or a correctly predicted taken branch may cause a delay if the target instruction is not in the fetch buffer or the cache. It is therefore best to use branch instructions carefully and to devise algorithms that reduce branching. Many operations that normally use branches may be performed either with fewer or no branches.
-- From IBM's The PowerPC Compiler Writer's Guide 3.1.5
Branches represent a significant part of both performance critical and general purpose code - as a general rule of thumb, 20% of the instructions in typical code are branches. In inner loops and other code sections which demand the highest performance may benefit from a multifold increase in performance by eliminating, or reducing, branches.
This series of articles will present the types of delays that branches may cause in program execution and some programming patterns that help avoid those delays.
Part 1: Introduction
Part 2: Background on Branching
Part 3: Benefits to Branch Elimination
Part 4: Programming with Branches, Patterns and Tips
Part 5: More Techniques for Eliminating Branches
Additional Examples:
Increment And Decrement Wrapping Values
Occasionally you have a set of values that you want to wrap around as you increment and decrement them. But the straightfoward implementation can have a big impact on processors where comparisons and branches are expensive (e.g. PowerPC). This article presents a straightforward branch-free implementation of these functions.
Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.
Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.
If this article interests you, I recommend highly Henry S. Warren's book Hacker's Delight and his associated website, Hacker's Delight. This book is a must-have for every programmer.