Avoiding Microcoded Instructions On The PPU

What are microcoded instructions?

Microcode is a special instruction set that is (usually) only available to the hardware. On the PPU (PowerPC Unit), small microprograms made up of microcode are stored in ROM and executed in the place of those PowerPC instructions that were too costly to implement directly in hardware or do not fit into the pipeline design very well. The size of a microprogram is measured in microwords.

The PowerPC instructions for which a microprogram is executed are often called microcoded instructions.

Microcoded instructions may be conditionally executed or unconditionally executed. Unconditionally executed microcoded instructions always execute the microprogram. Conditionally executed microcoded instructions will only execute the microprogram when the values of the register operands are exceptional in some way. Microcoded instructions are a special case of normal instructions and conditionally executed microcoded instructions are a special case of those.

Why avoid microcoded instructions?

The G5 core implements several instructions in microcode. These instructions cause a pipeline bubble during decode. The most commonly used microcoded instructions are load and store multiple -- lmw and stmw. These are often generated by the compiler to save space when saving and restoring registers on the stack. You can force GCC to avoid these instructions by specifying -mnomultiple. Indexed forms and/or algebraic forms of updating load and stores are also executed as microcode. You can force GCC to avoid these instructions by specifying -mno-update.

-- From G5 Performance Primer [apple.com]

Like the G5, the PPU contains microcoded instructions. Microcoded instructions are implemented in order to maintain compatibility with the PowerPC standard (a processor can only be called a PowerPC processor if it adheres to the standard [ibm.com].) When one of these instructions is decoded, the current pipeline is flushed, the microded program is then fetched from ROM and executed as a single atomic unit. The process of flushing the pipeline, fetching the microcode and executing the program takes quite a long time compared to other instructions. Additionally, because the instruction must be executed atomically in order to remain as transparent to the user as possible, any resources needed by the microcode program must be locked.

;; micr insns will stall at least 7 cycles to get the first instr from ROM, micro instructions are not dual issued.

-- From cellpu.md (CBE Toolchain 2.3 source code [bsc.es])

The minimum seven (7) cycle stall for microcoded instructions is derived from the fixed stages of the microcode section of the instruction pipeline. Microcoded stages are inserted after the last instruction buffer stage and before the first instruction decode stage. The actual penalty is determined by the complexity and length of the instruction.

For more information on the PPU pipeline stages see: Introduction to the Cell multiprocessor [ibm.com]

The details on which instructions are microcoded and the associated penalties are specific to each PowerPC device and are outlined in the User's Guide for the individual processor. For example, see the IBM PowerPC 970FX RISC Microprocessor User's Guide [ibm.com] paying particular attention to Section 6.3.3 Instruction Decode, Cracking, and Microcode.

The PPU User's Guide has not been released publically. So how is a programmer to know which instructions are microcoded and how to avoid them?

Read on to find out.

UPDATE: 11 MAY 2006

On May 10, 2006 IBM released the Cell Broadband Engine Programming Handbook [ibm.com]. Section A.1.3.1 (Unconditionally Microcoded Instructions) has a detailed list of those instructions which are always microcoded, including latency information and microword count. Before this document was released there were no public documents which described in detail, the penalties for using microcoded instructions. This article has been updated to reflect those details.

From the document:

Note: A minimum of 11 cycles are required before the first instruction is received from the microcode ROM, so microcoded instructions should be avoided if possible.

Most microcoded instructions are decoded into two or three simple PowerPC instructions, and they can be avoided in most cases. The microcoded instructions are typically decomposed into an integer and a load or store operation, with a dependency between them. Although most microcoded PowerPC instructions are decoded into only a few simple instructions, it is important to keep in mind that there are typically dependencies between the internal operations of the microcode, which generate stalls at the issue stage. Replacing the microcoded instructions with PowerPC instructions not only avoids stalling but also gives more latitude in scheduling instructions to avoid stalls, as well as potentially improving multithreaded performance.

Microcoded instruction scheduling

Like many of the specific details of the processor, any good compiler needs to understand (and take advantage of) the predicted latency and throughput information on each instruction. So a programmer need look no further than the CBE GCC source code [bsc.es] for a list of microcoded instructions.

Here's an example extry from the rs6000.md file (which is used by the cell-ppu target) which flags the first instruction in the replacement ("rldicl.") as being microcoded.

(define_insn "" [(set (match_operand:CC 0 "cc_reg_operand" "=x,?y") (compare:CC (zero_extend:DI (match_operand:QI 1 "gpc_reg_operand" "r,r")) (const_int 0))) (clobber (match_scratch:DI 2 "=r,r"))] "TARGET_64BIT" "@ rldicl. %2,%1,0,56 #" [(set_attr "type" "compare") (set_attr "microcode" "mc,*") (set_attr "length" "4,8")])

The above snippet is written in RTL [gnu.org], which stands for Register Transfer Language and is used to describe the processor specific assembly output in GCC. Assembly-level transformations, such as peephole optimizations, are also described in RTL. For an introduction to RTL see: Using and Porting the GNU Compiler Collection (GCC) [gnu.org] and Porting GCC For Dummies [axis.se]

Here is a partial list of the microcoded instructions explicitly flagged in the same file:

and. andi. andil. andis. andiu. doz*. lhau lhaux lm lmw lsi lswi mr. mullw. muls. neg. nor. or. rldic*. rlinm. rlwinm. s*i. s*wi. sf. sl sle. sli. slw slwi. sr sre. srw stm stmw stsi stswi subf. subfc.

For the definitive list of microcoded instructions see the Cell Broadband Engine Programming Handbook [ibm.com]. The corresponding instructions have been added to the sections below.

Avoiding microcoded instructions

Microcoded instructions, such as load/store multiple, were designed to save space in compiled code and offer no performance advantage over using multiple instructions. Because of the way these instructions are handled inside the processor, they might have a greater latency and take longer to execute than a sequence of individual instructions that produces the same results. Some compilers (gcc for example) have options that prevent generation of these instructions.

-- From PowerPC processor tips: Improve PowerPC 970FX performance [ibm.com]

Fortunately, there is a GCC flag that will warn the programmer if a known microcoded instruction is emitted. Simply add -mwarn-microcode to your compilation flags.

This flag is defined in rs6000.h:

{"warn-microcode", &rs6000_warn_microcode_switch, \ N_("Emitting warning of microcode") }, \ {"no-warn-microcode", &rs6000_warn_microcode_switch, "" }, \

And is processesed in rs6000.c:

/* Handle -m(no-)warn-microcode similarly. */ if (rs6000_warn_microcode_switch) { const char *base = rs6000_warn_microcode_switch; while (base[-1] != 'm') base--; if (*rs6000_warn_microcode_switch != '\0') error ("invalid option `%s'", base); rs6000_warn_microcode = (base[0] != 'n'); }

And is used in final.c:

#ifdef RS6000_GENERATE_MICROCODE /* 0 - notmicrocode, 1 - conditional microcode, 2 - microcode */ if (rs6000_warn_microcode) { if (get_attr_microcode(insn) == 2) pedwarn ("emitting microcode insn %s\t[%s] #%d",template, insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); else if (get_attr_microcode(insn) == 1) pedwarn ("emitting conditional microcode insn %s\t[%s] #%d",template, insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); } #endif

The other PowerPC specific compilation flags can also be found in these files.

The compiler source is the best source for information on compiler flags and processor specific options. Some flags do not make it into the help output.

Note that -mwarn-microcode is not in the gcc help list of flags.

How does this affect code in practice? From the above list, there are three main classes of microcoded instructions to watch out for.

Avoid multiple load/store instructions

These instructions are handy to load or store a small contiguous area of memory. However, it will always be faster to simply load each individual value into a register. GCC will not emit these instructions if the -mno-multiple flag is passed to the compiler.

List of microcoded load and store instructions, including load/store multiple. From: Cell Broadband Engine Programming Handbook [ibm.com]

|--------------------------------------------------------------------------------------------------------------------| | Unconditionally Microcoded Loads and Stores | |--------------------------------------------------------------------------------------------------------------------| | | | A microcode load or store operation can access an 8-bit byte or a 32-bit word, indicated as "by byte" or | | "by word" respectively. | | | |--------------------------------------------------------------------------------------------------------------------| | INSTRUCTION | CLASS | LATENCY | MICROWORD SIZE | COMMENT | |---------------------|-----------------|----------|-----------------------------|-----------------------------------| | lha | load algebraic | 11 | 7 | Handled by byte. | | lhau | load algebraic | 11 | 8 | Handled by byte. | | lhaux | load algebraic | 11 | 8 | Handled by byte. | | lhax | load algebraic | 11 | 8 | Handled by byte. | | lmw | load multiple | 11 |(2 + 1 × words) | This instruction is broken down | | | | | | into a series of load words. | | lswi | load string / | 10 | By word: | Optimized instruction[1] | | | optimized | | (1 × words + 2 × bytes) | | | | | | By byte: | | | | | | (2 × bytes) | | | lswx | load string / | By word: | By word: | Optimized instruction[1] | | | optimized | 10 | 4 + (1 × words + 2 × bytes) | | | | | By byte: | By byte: | | | | | 7 | 4 + (2 × bytes) | | | | | | | | | | | | | | | lwa | load algebraic | 11 | 13 | Handled by byte. | | lwaux | load algebraic | 11 | 12 | Handled by byte. | | lwax | load algebraic | 11 | 12 | Handled by byte. | | stmw | store multiple | 11 | (2 + 1 × words) | Broken into a series of store | | | | | | words. | | stswi | store string / | 10 | By word: | | | | optimized | | (1 × words + 2 × bytes) | Optimized instruction[1] | | | | | By byte: | | | | | | (2 × bytes) | | | stswx | store string / | 7 | By word: | Optimized instruction[1] | | | optimized | | 4 + (1 × words + 2 × bytes) | | | | | | By byte: | | | | | | 4 + (2 × bytes) | | | | | | | | |--------------------------------------------------------------------------------------------------------------------| [1] The instruction is first broken down into a series of load-word instructions (odd bytes are handled by byte). If this does not cause an alignment exception, then the instruction is complete. If an alignment exception occurs, the first attempt is flushed. When the instruction is returned to microcode it is then handled a byte at a time. Odd bytes, if any, are defined as the remainder of string_count / 4. For store instructions, it is a series of store words.

Avoid Condition Register recording integer instructions

Many of the integer functions, when the Condition Register (CR) modify bit is set (denoted by a "dot" at the end of the instruction), are microcoded. With this bit set, fixed-point instructions will automatically set the first field (field zero) in the Condition Register with the value's compare-with-zero result. For example, if the result of the "or." instruction is greater than zero, the GT bit will be set in CR[0].

In general, this makes branching on integer expressions more expensive and an effort should be made to eliminate them.

List of CR recording microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]

|-------------------------------------------------| | Unconditionally Microcoded Instructions | | (CR recording) | | | | Record instructions are all handled the | | same way. The "root" instruction is issued | | followed by the cmpi_x instruction. | | | | The nonrecord form used in the microcode | | sequence is only available to microcode. | | | |-------------------------------------------------| | INSTRUCTION | LATENCY | MICROWORD SIZE | |---------------------|---------|-----------------| | and. | 11 | 2 | | andc. | 11 | 2 | | andi. | 11 | 2 | | andis. | 11 | 2 | | nand. | 11 | 2 | | nor. | 11 | 2 | | nego. | 11 | 2 | | or. | 11 | 2 | | orc. | 11 | 2 | | xor. | 11 | 2 | | cntlzd. | 11 | 2 | | cntlzw. | 11 | 2 | | divd. | 11 | 2 | | divdo. | 11 | 2 | | divdu. | 11 | 2 | | divduo. | 11 | 2 | | divw. | 11 | 2 | | divwo. | 11 | 2 | | divwu. | 11 | 2 | | divwuo. | 11 | 2 | | eqv. | 11 | 2 | | extsb. | 11 | 2 | | extsh. | 11 | 2 | | extsw. | 11 | 2 | | mulhd. | 11 | 2 | | mulhdu. | 11 | 2 | | mulhw. | 11 | 2 | | mulhwu. | 11 | 2 | | mulld. | 11 | 2 | | mulldo. | 11 | 2 | | mullw. | 11 | 2 | | mullwo. | 11 | 2 | | rldcl. | 11 | 5 | | rldcr. | 11 | 5 | | rldic. | 11 | 2 | | rldicl. | 11 | 2 | | rldicr. | 11 | 2 | | rldimi. | 11 | 2 | | rlwimi. | 11 | 2 | | rlwinm. | 11 | 2 | | rlwnm. | 11 | 5 | | sld. | 11 | 5 | | slw. | 11 | 5 | | srad. | 11 | 5 | | sradi. | 11 | 2 | | sraw. | 11 | 5 | | srawi. | 11 | 2 | | srd. | 11 | 5 | | srw. | 11 | 5 | |---------------------|---------|-----------------|

Avoid indirect shift and rotate instructions

This is the simpliest case to find, but the hardest to eliminate:

int64_t right_shift64( int64_t a, int64_t sa ) { return ( a >> sa ); }

This code will emit this deceptively simple function:

.right_shift_64: srad 3,3,4 blr

And the following warning (if -mwarn-microcode is enabled):

test.c: In function `right_shift_64': test.c:7: warning: emitting microcode insn srad%I2 %0,%1,%H2 [*ashrdi3_internal1] #20

The best option for eliminating indirect shift instructions is to know the range of possible shift amounts and create an alternate branch-free expression that selects between those choices.

List of indirect shift and rotate microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]

|--------------------------------------------------| | Unconditionally Microcoded Instructions | | (Shift and Rotate) | | | | All indirect shift and rotate instructions | | are handled using the same technique. First | | the mt_shr is issued, followed by two noops | | for delay, followed by the root instruction | | (that is, rldcl_sh). | | | |--------------------------------------------------| | INSTRUCTION | LATENCY | MICROWORD SIZE | |---------------------|---------|------------------| | rldcl | 11 | 4 | | rldcr | 11 | 4 | | rlwnm | 11 | 4 | | sld | 11 | 4 | | slw | 11 | 4 | | srad | 11 | 4 | | sraw | 11 | 4 | | srd | 11 | 4 | | srw | 11 | 4 | |--------------------------------------------------|

Non-Pipelined, Complex Instructions

In addition to microcoded instructions there is another class of low performance instructions worth mentioning: the complex pipeline instructions. These instructions are are not microcoded (i.e. the resources already local to the execution pipeline can be used directly), however they are complex enough that special handling is required. In order for these instructions to be executed the instruction pipeline must be evacuated (i.e. flushed). Therefore the throughput of these instructions will be equal to the latency - They will be slow.

List of the non-pipelined instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]

|----------------------------------------------------| | Non-Microcoded, Non-Pipelined Integer Instructions | |---------|----------|-------------------------------| | Instr. | Pipeline | Latency (cycles) | |---------|----------|-------------------------------| | mulli | FXU | 6 | | mullw | FXU | 9 | | mulhw | FXU | 9 | | mulhwu | FXU | 9 | | mullwo | FXU | 9 | | mulld | FXU | 15 | | mulhd | FXU | 15 | | mulhdu | FXU | 15 | | mulldo | FXU | 15 | | divd | FXU | 10-70 | | divdu | FXU | 10-70 | | divdo | FXU | 10-70 | | divduo | FXU | 10-70 | | divw | FXU | 10-38 | | divwu | FXU | 10-38 | | divwo | FXU | 10-38 | | divwuo | FXU | 10-38 | |---------|----------|-------------------------------| Note on divide instructions: The fixed-point divide is a variable latency operation that calculates RA and RB for word or doubleword and signed or unsigned fixedpoint (integer) operands. Division is defined by the following equation: dividend = (quotient x divisor) + r where: 0 = r < |divisor|, when dividend = 0 and -|divisor| < r = 0, when dividend < 0 Overflow is set when an attempt is made to compute either the least negative integer divided by negative one or any integer divided by zero. The performance is determined by the number of bits required to represent the result. PPU cycles equal: ((1 setup) + (ceil ((rb leading digits - ra leading digits)/2) + 1 iterations) + (1 fixup)) × 2 word minimum = 10, maximum = 38 cycles doubleword minimum = 10, maximum = 70 cycles Overflow cases will complete in 10 cycles

There is no method to detect complex instructions emitted by the GCC compiler. Avoid integer multiplies and divides.

Good luck with that! ;)

Summary

Keep an eye out for microcoded instructions: use -mwarn-microcode in GCC.
Don't use multiple load/store instructions: use -mno-multiple in GCC.
Avoid CR recording integer instructions
Avoid indirect shift and rotate instructions
Avoid integer multiply and divide instructions

Eliminating microcoded and other non-pipelined instructions is sometimes difficult and not always desireable (for example, when code size is the determining factor in performance.) However, it is important to know the penalty and make an informed choice. And as always, if you are optimizing at this level, be sure to double-check your results with a real profile on real hardware.