CellPerformance
All things related to getting the best performance from your Cell Broadband Engine™ (CBE) processor.
Suggestions? Comments? Questions?

Send email to Mike Acton
Articles
Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links
No Insider Info!
Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.
Legal
Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.
Avoiding Microcoded Instructions On The PPU
Mike Acton
April 28, 2006
What are microcoded instructions?
Microcode is a special instruction set that is (usually) only available to the hardware. On the PPU (PowerPC Unit), small microprograms made up of microcode are stored in ROM and executed in the place of those PowerPC instructions that were too costly to implement directly in hardware or do not fit into the pipeline design very well. The size of a microprogram is measured in microwords.

The PowerPC instructions for which a microprogram is executed are often called microcoded instructions.

Microcoded instructions may be conditionally executed or unconditionally executed. Unconditionally executed microcoded instructions always execute the microprogram. Conditionally executed microcoded instructions will only execute the microprogram when the values of the register operands are exceptional in some way. Microcoded instructions are a special case of normal instructions and conditionally executed microcoded instructions are a special case of those.
Why avoid microcoded instructions?
The G5 core implements several instructions in microcode. These instructions cause a pipeline bubble during decode. The most commonly used microcoded instructions are load and store multiple -- lmw and stmw. These are often generated by the compiler to save space when saving and restoring registers on the stack. You can force GCC to avoid these instructions by specifying -mnomultiple. Indexed forms and/or algebraic forms of updating load and stores are also executed as microcode. You can force GCC to avoid these instructions by specifying -mno-update.

Like the G5, the PPU contains microcoded instructions. Microcoded instructions are implemented in order to maintain compatibility with the PowerPC standard (a processor can only be called a PowerPC processor if it adheres to the standard [ibm.com].) When one of these instructions is decoded, the current pipeline is flushed, the microded program is then fetched from ROM and executed as a single atomic unit. The process of flushing the pipeline, fetching the microcode and executing the program takes quite a long time compared to other instructions. Additionally, because the instruction must be executed atomically in order to remain as transparent to the user as possible, any resources needed by the microcode program must be locked.

;; micr insns will stall at least 7 cycles to get the first instr from ROM, micro instructions are not dual issued.
The minimum seven (7) cycle stall for microcoded instructions is derived from the fixed stages of the microcode section of the instruction pipeline. Microcoded stages are inserted after the last instruction buffer stage and before the first instruction decode stage. The actual penalty is determined by the complexity and length of the instruction.

For more information on the PPU pipeline stages see: Introduction to the Cell multiprocessor [ibm.com]

The details on which instructions are microcoded and the associated penalties are specific to each PowerPC device and are outlined in the User's Guide for the individual processor. For example, see the IBM PowerPC 970FX RISC Microprocessor User's Guide [ibm.com] paying particular attention to Section 6.3.3 Instruction Decode, Cracking, and Microcode.

The PPU User's Guide has not been released publically. So how is a programmer to know which instructions are microcoded and how to avoid them?

Read on to find out.
UPDATE: 11 MAY 2006

On May 10, 2006 IBM released the Cell Broadband Engine Programming Handbook [ibm.com]. Section A.1.3.1 (Unconditionally Microcoded Instructions) has a detailed list of those instructions which are always microcoded, including latency information and microword count. Before this document was released there were no public documents which described in detail, the penalties for using microcoded instructions. This article has been updated to reflect those details.

From the document:

Note: A minimum of 11 cycles are required before the first instruction is received from the microcode ROM, so microcoded instructions should be avoided if possible.

Most microcoded instructions are decoded into two or three simple PowerPC instructions, and they can be avoided in most cases. The microcoded instructions are typically decomposed into an integer and a load or store operation, with a dependency between them. Although most microcoded PowerPC instructions are decoded into only a few simple instructions, it is important to keep in mind that there are typically dependencies between the internal operations of the microcode, which generate stalls at the issue stage. Replacing the microcoded instructions with PowerPC instructions not only avoids stalling but also gives more latitude in scheduling instructions to avoid stalls, as well as potentially improving multithreaded performance.
Microcoded instruction scheduling
Like many of the specific details of the processor, any good compiler needs to understand (and take advantage of) the predicted latency and throughput information on each instruction. So a programmer need look no further than the CBE GCC source code [bsc.es] for a list of microcoded instructions.

Here's an example extry from the rs6000.md file (which is used by the cell-ppu target) which flags the first instruction in the replacement ("rldicl.") as being microcoded.
(define_insn "" [(set (match_operand:CC 0 "cc_reg_operand" "=x,?y") (compare:CC (zero_extend:DI (match_operand:QI 1 "gpc_reg_operand" "r,r")) (const_int 0))) (clobber (match_scratch:DI 2 "=r,r"))] "TARGET_64BIT" "@ rldicl. %2,%1,0,56 #" [(set_attr "type" "compare") (set_attr "microcode" "mc,*") (set_attr "length" "4,8")])
The above snippet is written in RTL [gnu.org], which stands for Register Transfer Language and is used to describe the processor specific assembly output in GCC. Assembly-level transformations, such as peephole optimizations, are also described in RTL. For an introduction to RTL see: Using and Porting the GNU Compiler Collection (GCC) [gnu.org] and Porting GCC For Dummies [axis.se]

Here is a partial list of the microcoded instructions explicitly flagged in the same file:
and. andi. andil. andis. andiu. doz*. lhau lhaux lm lmw lsi lswi mr. mullw. muls. neg. nor. or. rldic*. rlinm. rlwinm. s*i. s*wi. sf. sl sle. sli. slw slwi. sr sre. srw stm stmw stsi stswi subf. subfc.

For the definitive list of microcoded instructions see the Cell Broadband Engine Programming Handbook [ibm.com]. The corresponding instructions have been added to the sections below.

Avoiding microcoded instructions
Microcoded instructions, such as load/store multiple, were designed to save space in compiled code and offer no performance advantage over using multiple instructions. Because of the way these instructions are handled inside the processor, they might have a greater latency and take longer to execute than a sequence of individual instructions that produces the same results. Some compilers (gcc for example) have options that prevent generation of these instructions.

Fortunately, there is a GCC flag that will warn the programmer if a known microcoded instruction is emitted. Simply add -mwarn-microcode to your compilation flags.

This flag is defined in rs6000.h:
{"warn-microcode", &rs6000_warn_microcode_switch, \ N_("Emitting warning of microcode") }, \ {"no-warn-microcode", &rs6000_warn_microcode_switch, "" }, \
And is processesed in rs6000.c:
/* Handle -m(no-)warn-microcode similarly. */ if (rs6000_warn_microcode_switch) { const char *base = rs6000_warn_microcode_switch; while (base[-1] != 'm') base--; if (*rs6000_warn_microcode_switch != '\0') error ("invalid option `%s'", base); rs6000_warn_microcode = (base[0] != 'n'); }
And is used in final.c:
#ifdef RS6000_GENERATE_MICROCODE /* 0 - notmicrocode, 1 - conditional microcode, 2 - microcode */ if (rs6000_warn_microcode) { if (get_attr_microcode(insn) == 2) pedwarn ("emitting microcode insn %s\t[%s] #%d",template, insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); else if (get_attr_microcode(insn) == 1) pedwarn ("emitting conditional microcode insn %s\t[%s] #%d",template, insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); } #endif
The other PowerPC specific compilation flags can also be found in these files.
The compiler source is the best source for information on compiler flags and processor specific options. Some flags do not make it into the help output.
Note that -mwarn-microcode is not in the gcc help list of flags.

How does this affect code in practice? From the above list, there are three main classes of microcoded instructions to watch out for.
Avoid multiple load/store instructions
These instructions are handy to load or store a small contiguous area of memory. However, it will always be faster to simply load each individual value into a register. GCC will not emit these instructions if the -mno-multiple flag is passed to the compiler.

List of microcoded load and store instructions, including load/store multiple. From: Cell Broadband Engine Programming Handbook [ibm.com]
|--------------------------------------------------------------------------------------------------------------------| | Unconditionally Microcoded Loads and Stores | |--------------------------------------------------------------------------------------------------------------------| | | | A microcode load or store operation can access an 8-bit byte or a 32-bit word, indicated as "by byte" or | | "by word" respectively. | | | |--------------------------------------------------------------------------------------------------------------------| | INSTRUCTION | CLASS | LATENCY | MICROWORD SIZE | COMMENT | |---------------------|-----------------|----------|-----------------------------|-----------------------------------| | lha | load algebraic | 11 | 7 | Handled by byte. | | lhau | load algebraic | 11 | 8 | Handled by byte. | | lhaux | load algebraic | 11 | 8 | Handled by byte. | | lhax | load algebraic | 11 | 8 | Handled by byte. | | lmw | load multiple | 11 |(2 + 1 × words) | This instruction is broken down | | | | | | into a series of load words. | | lswi | load string / | 10 | By word: | Optimized instruction[1] | | | optimized | | (1 × words + 2 × bytes) | | | | | | By byte: | | | | | | (2 × bytes) | | | lswx | load string / | By word: | By word: | Optimized instruction[1] | | | optimized | 10 | 4 + (1 × words + 2 × bytes) | | | | | By byte: | By byte: | | | | | 7 | 4 + (2 × bytes) | | | | | | | | | | | | | | | lwa | load algebraic | 11 | 13 | Handled by byte. | | lwaux | load algebraic | 11 | 12 | Handled by byte. | | lwax | load algebraic | 11 | 12 | Handled by byte. | | stmw | store multiple | 11 | (2 + 1 × words) | Broken into a series of store | | | | | | words. | | stswi | store string / | 10 | By word: | | | | optimized | | (1 × words + 2 × bytes) | Optimized instruction[1] | | | | | By byte: | | | | | | (2 × bytes) | | | stswx | store string / | 7 | By word: | Optimized instruction[1] | | | optimized | | 4 + (1 × words + 2 × bytes) | | | | | | By byte: | | | | | | 4 + (2 × bytes) | | | | | | | | |--------------------------------------------------------------------------------------------------------------------| [1] The instruction is first broken down into a series of load-word instructions (odd bytes are handled by byte). If this does not cause an alignment exception, then the instruction is complete. If an alignment exception occurs, the first attempt is flushed. When the instruction is returned to microcode it is then handled a byte at a time. Odd bytes, if any, are defined as the remainder of string_count / 4. For store instructions, it is a series of store words.
Avoid Condition Register recording integer instructions
Many of the integer functions, when the Condition Register (CR) modify bit is set (denoted by a "dot" at the end of the instruction), are microcoded. With this bit set, fixed-point instructions will automatically set the first field (field zero) in the Condition Register with the value's compare-with-zero result. For example, if the result of the "or." instruction is greater than zero, the GT bit will be set in CR[0].

In general, this makes branching on integer expressions more expensive and an effort should be made to eliminate them.

For information on eliminating integer (fixed point) branches see:

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Reducing The Costs Of Comparisons and Branches
Reducing comparisons by combining branches and judicious use of branch hints..

Using Masks To Accelerate Integer Code
A branch-free technique for selecting between multiple integer values with no comparison operations and no dependencies on the Condition Register (CR). Using this method will improve pipelining on inlined code significantly.


List of CR recording microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]
|-------------------------------------------------| | Unconditionally Microcoded Instructions | | (CR recording) | | | | Record instructions are all handled the | | same way. The “root” instruction is issued | | followed by the cmpi_x instruction. | | | | The nonrecord form used in the microcode | | sequence is only available to microcode. | | | |-------------------------------------------------| | INSTRUCTION | LATENCY | MICROWORD SIZE | |---------------------|---------|-----------------| | and. | 11 | 2 | | andc. | 11 | 2 | | andi. | 11 | 2 | | andis. | 11 | 2 | | nand. | 11 | 2 | | nor. | 11 | 2 | | nego. | 11 | 2 | | or. | 11 | 2 | | orc. | 11 | 2 | | xor. | 11 | 2 | | cntlzd. | 11 | 2 | | cntlzw. | 11 | 2 | | divd. | 11 | 2 | | divdo. | 11 | 2 | | divdu. | 11 | 2 | | divduo. | 11 | 2 | | divw. | 11 | 2 | | divwo. | 11 | 2 | | divwu. | 11 | 2 | | divwuo. | 11 | 2 | | eqv. | 11 | 2 | | extsb. | 11 | 2 | | extsh. | 11 | 2 | | extsw. | 11 | 2 | | mulhd. | 11 | 2 | | mulhdu. | 11 | 2 | | mulhw. | 11 | 2 | | mulhwu. | 11 | 2 | | mulld. | 11 | 2 | | mulldo. | 11 | 2 | | mullw. | 11 | 2 | | mullwo. | 11 | 2 | | rldcl. | 11 | 5 | | rldcr. | 11 | 5 | | rldic. | 11 | 2 | | rldicl. | 11 | 2 | | rldicr. | 11 | 2 | | rldimi. | 11 | 2 | | rlwimi. | 11 | 2 | | rlwinm. | 11 | 2 | | rlwnm. | 11 | 5 | | sld. | 11 | 5 | | slw. | 11 | 5 | | srad. | 11 | 5 | | sradi. | 11 | 2 | | sraw. | 11 | 5 | | srawi. | 11 | 2 | | srd. | 11 | 5 | | srw. | 11 | 5 | |---------------------|---------|-----------------|

Avoid indirect shift and rotate instructions
This is the simpliest case to find, but the hardest to eliminate:
int64_t right_shift64( int64_t a, int64_t sa ) { return ( a >> sa ); }
This code will emit this deceptively simple function:
.right_shift_64: srad 3,3,4 blr
And the following warning (if -mwarn-microcode is enabled):
test.c: In function `right_shift_64': test.c:7: warning: emitting microcode insn srad%I2 %0,%1,%H2 [*ashrdi3_internal1] #20

The best option for eliminating indirect shift instructions is to know the range of possible shift amounts and create an alternate branch-free expression that selects between those choices.

List of indirect shift and rotate microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]
|--------------------------------------------------| | Unconditionally Microcoded Instructions | | (Shift and Rotate) | | | | All indirect shift and rotate instructions | | are handled using the same technique. First | | the mt_shr is issued, followed by two noops | | for delay, followed by the root instruction | | (that is, rldcl_sh). | | | |--------------------------------------------------| | INSTRUCTION | LATENCY | MICROWORD SIZE | |---------------------|---------|------------------| | rldcl | 11 | 4 | | rldcr | 11 | 4 | | rlwnm | 11 | 4 | | sld | 11 | 4 | | slw | 11 | 4 | | srad | 11 | 4 | | sraw | 11 | 4 | | srd | 11 | 4 | | srw | 11 | 4 | |--------------------------------------------------|

Non-Pipelined, Complex Instructions
In addition to microcoded instructions there is another class of low performance instructions worth mentioning: the complex pipeline instructions. These instructions are are not microcoded (i.e. the resources already local to the execution pipeline can be used directly), however they are complex enough that special handling is required. In order for these instructions to be executed the instruction pipeline must be evacuated (i.e. flushed). Therefore the throughput of these instructions will be equal to the latency - They will be slow.

List of the non-pipelined instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]
|----------------------------------------------------| | Non-Microcoded, Non-Pipelined Integer Instructions | |---------|----------|-------------------------------| | Instr. | Pipeline | Latency (cycles) | |---------|----------|-------------------------------| | mulli | FXU | 6 | | mullw | FXU | 9 | | mulhw | FXU | 9 | | mulhwu | FXU | 9 | | mullwo | FXU | 9 | | mulld | FXU | 15 | | mulhd | FXU | 15 | | mulhdu | FXU | 15 | | mulldo | FXU | 15 | | divd | FXU | 10-70 | | divdu | FXU | 10-70 | | divdo | FXU | 10-70 | | divduo | FXU | 10-70 | | divw | FXU | 10-38 | | divwu | FXU | 10-38 | | divwo | FXU | 10-38 | | divwuo | FXU | 10-38 | |---------|----------|-------------------------------| Note on divide instructions: The fixed-point divide is a variable latency operation that calculates RA and RB for word or doubleword and signed or unsigned fixedpoint (integer) operands. Division is defined by the following equation: dividend = (quotient x divisor) + r where: 0 = r < |divisor|, when dividend = 0 and -|divisor| < r = 0, when dividend < 0 Overflow is set when an attempt is made to compute either the least negative integer divided by negative one or any integer divided by zero. The performance is determined by the number of bits required to represent the result. PPU cycles equal: ((1 setup) + (ceil ((rb leading digits - ra leading digits)/2) + 1 iterations) + (1 fixup)) × 2 word minimum = 10, maximum = 38 cycles doubleword minimum = 10, maximum = 70 cycles Overflow cases will complete in 10 cycles

There is no method to detect complex instructions emitted by the GCC compiler. Avoid integer multiplies and divides.

Good luck with that! ;)

Summary
  • Keep an eye out for microcoded instructions: use -mwarn-microcode in GCC.
  • Don't use multiple load/store instructions: use -mno-multiple in GCC.
  • Avoid CR recording integer instructions
  • Avoid indirect shift and rotate instructions
  • Avoid integer multiply and divide instructions
Eliminating microcoded and other non-pipelined instructions is sometimes difficult and not always desireable (for example, when code size is the determining factor in performance.) However, it is important to know the penalty and make an informed choice. And as always, if you are optimizing at this level, be sure to double-check your results with a real profile on real hardware.
About the author
Mike Acton is a Senior Architect working on PS3/Cell research at Highmoon Studios (Vivendi-Universal Games) who occasionally refers to himself in the third person. Most recently, Mike was the Lead Special Effects Programmer for Darkwatch at Highmoon Studios (previously Sammy Studios) in Carlsbad, CA. Previously Mike has held Lead Programming, Playstation 2 Engine Development and Research roles with Yuke's, VBlank and BlueSky/Titus. He worked for Sony Interactive Studios in PSX development.

Mike has made regular appearances as a speaker at SCEA develpment conferences and has spoken at GDC. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.

He prefers vi.

Also check out the non Cell BE articles by Mike Acton.