What are microcoded instructions?
Microcode is a special instruction set that is (usually) only available
to the hardware. On the PPU (PowerPC Unit), small microprograms made up
of microcode are stored in ROM and executed in the place of those
PowerPC instructions that were too costly to implement directly in
hardware or do not fit into the pipeline design very well. The size of
a microprogram is measured in microwords. The PowerPC instructions for which a microprogram is executed are often called microcoded instructions.
Microcoded instructions may be conditionally executed or unconditionally executed. Unconditionally executed microcoded instructions always execute the microprogram. Conditionally executed microcoded instructions will only execute the microprogram when the values of the register operands are exceptional in some way. Microcoded instructions are a special case of normal instructions and conditionally executed microcoded instructions are a special case of those.
Why avoid microcoded instructions?
The G5 core implements several instructions in
microcode. These instructions cause a pipeline bubble during decode.
The most commonly used microcoded instructions are load and store
multiple -- lmw and stmw. These are often generated by the compiler to
save space when saving and restoring registers on the stack. You can
force GCC to avoid these instructions by specifying -mnomultiple.
Indexed forms and/or algebraic forms of updating load and stores are
also executed as microcode. You can force GCC to avoid these
instructions by specifying -mno-update.
Like the G5, the PPU contains microcoded instructions. Microcoded instructions are implemented in order to maintain compatibility with the PowerPC standard (a processor can only be called a PowerPC processor if it adheres to the standard [ibm.com].) When one of these instructions is decoded, the current pipeline is flushed, the microded program is then fetched from ROM and executed as a single atomic unit. The process of flushing the pipeline, fetching the microcode and executing the program takes quite a long time compared to other instructions. Additionally, because the instruction must be executed atomically in order to remain as transparent to the user as possible, any resources needed by the microcode program must be locked.
;; micr insns will stall at least 7 cycles to get the first instr from ROM, micro instructions are not dual issued.
The minimum seven (7) cycle stall for microcoded instructions is
derived from the fixed stages of the microcode section of the
instruction pipeline. Microcoded stages are inserted after the last
instruction buffer stage and before the first instruction decode stage.
The actual penalty is determined by the complexity and length of the
instruction.
For more information on the PPU pipeline stages see: Introduction to the Cell multiprocessor [ibm.com]
For more information on the PPU pipeline stages see: Introduction to the Cell multiprocessor [ibm.com]
The details on which instructions are microcoded and the associated penalties are specific to each PowerPC device and are outlined in the User's Guide for the individual processor. For example, see the IBM PowerPC 970FX RISC Microprocessor User's Guide [ibm.com] paying particular attention to Section 6.3.3 Instruction Decode, Cracking, and Microcode.
The PPU User's Guide has not been released publically. So how is a programmer to know which instructions are microcoded and how to avoid them?
Read on to find out.
UPDATE: 11 MAY 2006
On May 10, 2006 IBM released the Cell Broadband Engine Programming Handbook [ibm.com]. Section A.1.3.1 (Unconditionally Microcoded Instructions) has a detailed list of those instructions which are always microcoded, including latency information and microword count. Before this document was released there were no public documents which described in detail, the penalties for using microcoded instructions. This article has been updated to reflect those details.
From the document:
On May 10, 2006 IBM released the Cell Broadband Engine Programming Handbook [ibm.com]. Section A.1.3.1 (Unconditionally Microcoded Instructions) has a detailed list of those instructions which are always microcoded, including latency information and microword count. Before this document was released there were no public documents which described in detail, the penalties for using microcoded instructions. This article has been updated to reflect those details.
From the document:
Note: A minimum of 11 cycles are required before the first
instruction is received from the microcode ROM, so microcoded
instructions should be avoided if possible.
Most microcoded instructions are decoded into two or three simple PowerPC instructions, and they can be avoided in most cases. The microcoded instructions are typically decomposed into an integer and a load or store operation, with a dependency between them. Although most microcoded PowerPC instructions are decoded into only a few simple instructions, it is important to keep in mind that there are typically dependencies between the internal operations of the microcode, which generate stalls at the issue stage. Replacing the microcoded instructions with PowerPC instructions not only avoids stalling but also gives more latitude in scheduling instructions to avoid stalls, as well as potentially improving multithreaded performance.
Most microcoded instructions are decoded into two or three simple PowerPC instructions, and they can be avoided in most cases. The microcoded instructions are typically decomposed into an integer and a load or store operation, with a dependency between them. Although most microcoded PowerPC instructions are decoded into only a few simple instructions, it is important to keep in mind that there are typically dependencies between the internal operations of the microcode, which generate stalls at the issue stage. Replacing the microcoded instructions with PowerPC instructions not only avoids stalling but also gives more latitude in scheduling instructions to avoid stalls, as well as potentially improving multithreaded performance.
Microcoded instruction scheduling
Like many of the specific details of the processor, any good compiler
needs to understand (and take advantage of) the predicted latency and
throughput information on each instruction. So a programmer need look
no further than the CBE GCC source code [bsc.es] for a list of microcoded instructions.Here's an example extry from the rs6000.md file (which is used by the cell-ppu target) which flags the first instruction in the replacement ("rldicl.") as being microcoded.
(define_insn ""
[(set (match_operand:CC 0 "cc_reg_operand" "=x,?y")
(compare:CC (zero_extend:DI (match_operand:QI 1 "gpc_reg_operand" "r,r"))
(const_int 0)))
(clobber (match_scratch:DI 2 "=r,r"))]
"TARGET_64BIT"
"@
rldicl. %2,%1,0,56
#"
[(set_attr "type" "compare")
(set_attr "microcode" "mc,*")
(set_attr "length" "4,8")])
The above snippet is written in RTL [gnu.org], which stands for Register Transfer Language
and is used to describe the processor specific assembly output in GCC.
Assembly-level transformations, such as peephole optimizations, are
also described in RTL. For an introduction to RTL see: Using and Porting the GNU Compiler Collection (GCC) [gnu.org] and Porting GCC For Dummies [axis.se]
Here is a partial list of the microcoded instructions explicitly flagged in the same file:
and. andi. andil. andis.
andiu. doz*. lhau lhaux
lm lmw lsi lswi
mr. mullw. muls. neg.
nor. or. rldic*. rlinm.
rlwinm. s*i. s*wi. sf.
sl sle. sli. slw
slwi. sr sre. srw
stm stmw stsi stswi
subf. subfc.
For the definitive list of microcoded instructions see the Cell Broadband Engine Programming Handbook [ibm.com]. The corresponding instructions have been added to the sections below.
Avoiding microcoded instructions
Microcoded instructions, such as load/store multiple, were designed to
save space in compiled code and offer no performance advantage over
using multiple instructions. Because of the way these instructions are
handled inside the processor, they might have a greater latency and
take longer to execute than a sequence of individual instructions that
produces the same results. Some compilers (gcc for example) have
options that prevent generation of these instructions.
Fortunately, there is a GCC flag that will warn the programmer if a known microcoded instruction is emitted. Simply add -mwarn-microcode to your compilation flags.
This flag is defined in rs6000.h:
{"warn-microcode", &rs6000_warn_microcode_switch, \
N_("Emitting warning of microcode") }, \
{"no-warn-microcode", &rs6000_warn_microcode_switch, "" }, \
And is processesed in rs6000.c:
/* Handle -m(no-)warn-microcode similarly. */
if (rs6000_warn_microcode_switch)
{
const char *base = rs6000_warn_microcode_switch;
while (base[-1] != 'm') base--;
if (*rs6000_warn_microcode_switch != '\0')
error ("invalid option `%s'", base);
rs6000_warn_microcode = (base[0] != 'n');
}
And is used in final.c:
#ifdef RS6000_GENERATE_MICROCODE /* 0 - notmicrocode, 1 - conditional
microcode, 2 - microcode */ if (rs6000_warn_microcode) { if
(get_attr_microcode(insn) == 2) pedwarn ("emitting microcode insn
%s\t[%s] #%d",template,
insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); else if
(get_attr_microcode(insn) == 1) pedwarn ("emitting conditional
microcode insn %s\t[%s] #%d",template,
insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); }
#endif
The other PowerPC specific compilation flags can also be found in these files.
The compiler source is the best source for
information on compiler flags and processor specific options. Some
flags do not make it into the help output.
Note that -mwarn-microcode is not in the gcc help list of flags.How does this affect code in practice? From the above list, there are three main classes of microcoded instructions to watch out for.
Avoid multiple load/store instructions
These
instructions are handy to load or store a small contiguous area of
memory. However, it will always be faster to simply load each
individual value into a register. GCC will not emit these instructions
if the -mno-multiple flag is passed to the compiler.List of microcoded load and store instructions, including load/store multiple. From: Cell Broadband Engine Programming Handbook [ibm.com]
|--------------------------------------------------------------------------------------------------------------------|
| Unconditionally Microcoded Loads and Stores |
|--------------------------------------------------------------------------------------------------------------------|
| |
| A microcode load or store operation can access an 8-bit byte or a 32-bit word, indicated as "by byte" or |
| "by word" respectively. |
| |
|--------------------------------------------------------------------------------------------------------------------|
| INSTRUCTION | CLASS | LATENCY | MICROWORD SIZE | COMMENT |
|---------------------|-----------------|----------|-----------------------------|-----------------------------------|
| lha | load algebraic | 11 | 7 | Handled by byte. |
| lhau | load algebraic | 11 | 8 | Handled by byte. |
| lhaux | load algebraic | 11 | 8 | Handled by byte. |
| lhax | load algebraic | 11 | 8 | Handled by byte. |
| lmw | load multiple | 11 |(2 + 1 × words) | This instruction is broken down |
| | | | | into a series of load words. |
| lswi | load string / | 10 | By word: | Optimized instruction[1] |
| | optimized | | (1 × words + 2 × bytes) | |
| | | | By byte: | |
| | | | (2 × bytes) | |
| lswx | load string / | By word: | By word: | Optimized instruction[1] |
| | optimized | 10 | 4 + (1 × words + 2 × bytes) | |
| | | By byte: | By byte: | |
| | | 7 | 4 + (2 × bytes) | |
| | | | | |
| | | | | |
| lwa | load algebraic | 11 | 13 | Handled by byte. |
| lwaux | load algebraic | 11 | 12 | Handled by byte. |
| lwax | load algebraic | 11 | 12 | Handled by byte. |
| stmw | store multiple | 11 | (2 + 1 × words) | Broken into a series of store |
| | | | | words. |
| stswi | store string / | 10 | By word: | |
| | optimized | | (1 × words + 2 × bytes) | Optimized instruction[1] |
| | | | By byte: | |
| | | | (2 × bytes) | |
| stswx | store string / | 7 | By word: | Optimized instruction[1] |
| | optimized | | 4 + (1 × words + 2 × bytes) | |
| | | | By byte: | |
| | | | 4 + (2 × bytes) | |
| | | | | |
|--------------------------------------------------------------------------------------------------------------------|
[1] The instruction is first broken down into a series of load-word instructions
(odd bytes are handled by byte). If this does not cause an alignment exception,
then the instruction is complete. If an alignment exception occurs, the first
attempt is flushed. When the instruction is returned to microcode it is then
handled a byte at a time. Odd bytes, if any, are defined as the remainder of
string_count / 4. For store instructions, it is a series of store words.
Avoid Condition Register recording integer instructions
Many of the integer functions, when the Condition Register (CR) modify
bit is set (denoted by a "dot" at the end of the instruction), are
microcoded. With this bit set, fixed-point instructions will
automatically set the first field (field zero) in the Condition
Register with the value's compare-with-zero result. For example, if the
result of the "or." instruction is greater than zero, the GT bit will
be set in CR[0].In general, this makes branching on integer expressions more expensive and an effort should be made to eliminate them.
List of CR recording microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]
|-------------------------------------------------|
| Unconditionally Microcoded Instructions |
| (CR recording) |
| |
| Record instructions are all handled the |
| same way. The "root" instruction is issued |
| followed by the cmpi_x instruction. |
| |
| The nonrecord form used in the microcode |
| sequence is only available to microcode. |
| |
|-------------------------------------------------|
| INSTRUCTION | LATENCY | MICROWORD SIZE |
|---------------------|---------|-----------------|
| and. | 11 | 2 |
| andc. | 11 | 2 |
| andi. | 11 | 2 |
| andis. | 11 | 2 |
| nand. | 11 | 2 |
| nor. | 11 | 2 |
| nego. | 11 | 2 |
| or. | 11 | 2 |
| orc. | 11 | 2 |
| xor. | 11 | 2 |
| cntlzd. | 11 | 2 |
| cntlzw. | 11 | 2 |
| divd. | 11 | 2 |
| divdo. | 11 | 2 |
| divdu. | 11 | 2 |
| divduo. | 11 | 2 |
| divw. | 11 | 2 |
| divwo. | 11 | 2 |
| divwu. | 11 | 2 |
| divwuo. | 11 | 2 |
| eqv. | 11 | 2 |
| extsb. | 11 | 2 |
| extsh. | 11 | 2 |
| extsw. | 11 | 2 |
| mulhd. | 11 | 2 |
| mulhdu. | 11 | 2 |
| mulhw. | 11 | 2 |
| mulhwu. | 11 | 2 |
| mulld. | 11 | 2 |
| mulldo. | 11 | 2 |
| mullw. | 11 | 2 |
| mullwo. | 11 | 2 |
| rldcl. | 11 | 5 |
| rldcr. | 11 | 5 |
| rldic. | 11 | 2 |
| rldicl. | 11 | 2 |
| rldicr. | 11 | 2 |
| rldimi. | 11 | 2 |
| rlwimi. | 11 | 2 |
| rlwinm. | 11 | 2 |
| rlwnm. | 11 | 5 |
| sld. | 11 | 5 |
| slw. | 11 | 5 |
| srad. | 11 | 5 |
| sradi. | 11 | 2 |
| sraw. | 11 | 5 |
| srawi. | 11 | 2 |
| srd. | 11 | 5 |
| srw. | 11 | 5 |
|---------------------|---------|-----------------|
Avoid indirect shift and rotate instructions
This is the simpliest case to find, but the hardest to eliminate:
int64_t right_shift64( int64_t a, int64_t sa )
{
return ( a >> sa );
}
This code will emit this deceptively simple function:
.right_shift_64:
srad 3,3,4
blr
And the following warning (if -mwarn-microcode is enabled):
test.c: In function `right_shift_64':
test.c:7: warning: emitting microcode insn srad%I2 %0,%1,%H2 [*ashrdi3_internal1] #20
The best option for eliminating indirect shift instructions is to know the range of possible shift amounts and create an alternate branch-free expression that selects between those choices.
List of indirect shift and rotate microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]
|--------------------------------------------------|
| Unconditionally Microcoded Instructions |
| (Shift and Rotate) |
| |
| All indirect shift and rotate instructions |
| are handled using the same technique. First |
| the mt_shr is issued, followed by two noops |
| for delay, followed by the root instruction |
| (that is, rldcl_sh). |
| |
|--------------------------------------------------|
| INSTRUCTION | LATENCY | MICROWORD SIZE |
|---------------------|---------|------------------|
| rldcl | 11 | 4 |
| rldcr | 11 | 4 |
| rlwnm | 11 | 4 |
| sld | 11 | 4 |
| slw | 11 | 4 |
| srad | 11 | 4 |
| sraw | 11 | 4 |
| srd | 11 | 4 |
| srw | 11 | 4 |
|--------------------------------------------------|
Non-Pipelined, Complex Instructions
In addition to microcoded instructions there is another class of low
performance instructions worth mentioning: the complex pipeline
instructions. These instructions are are not microcoded (i.e. the
resources already local to the execution pipeline can be used
directly), however they are complex enough that special handling is
required. In order for these instructions to be executed the
instruction pipeline must be evacuated (i.e. flushed). Therefore the
throughput of these instructions will be equal to the latency - They
will be slow.List of the non-pipelined instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]
|----------------------------------------------------|
| Non-Microcoded, Non-Pipelined Integer Instructions |
|---------|----------|-------------------------------|
| Instr. | Pipeline | Latency (cycles) |
|---------|----------|-------------------------------|
| mulli | FXU | 6 |
| mullw | FXU | 9 |
| mulhw | FXU | 9 |
| mulhwu | FXU | 9 |
| mullwo | FXU | 9 |
| mulld | FXU | 15 |
| mulhd | FXU | 15 |
| mulhdu | FXU | 15 |
| mulldo | FXU | 15 |
| divd | FXU | 10-70 |
| divdu | FXU | 10-70 |
| divdo | FXU | 10-70 |
| divduo | FXU | 10-70 |
| divw | FXU | 10-38 |
| divwu | FXU | 10-38 |
| divwo | FXU | 10-38 |
| divwuo | FXU | 10-38 |
|---------|----------|-------------------------------|
Note on divide instructions:
The fixed-point divide is a variable latency operation
that calculates RA and RB for word or doubleword and
signed or unsigned fixedpoint (integer) operands.
Division is defined by the following equation:
dividend = (quotient x divisor) + r
where: 0 = r < |divisor|,
when dividend = 0 and -|divisor| < r = 0,
when dividend < 0
Overflow is set when an attempt is made to compute
either the least negative integer divided by negative
one or any integer divided by zero.
The performance is determined by the number of
bits required to represent the result.
PPU cycles equal:
((1 setup)
+ (ceil ((rb leading digits - ra leading digits)/2)
+ 1 iterations)
+ (1 fixup)) × 2
word minimum = 10,
maximum = 38 cycles
doubleword minimum = 10,
maximum = 70 cycles
Overflow cases will complete in 10 cycles
There is no method to detect complex instructions emitted by the GCC compiler. Avoid integer multiplies and divides.
Good luck with that! ;)
Good luck with that! ;)
Summary
- Keep an eye out for microcoded instructions: use -mwarn-microcode in GCC.
- Don't use multiple load/store instructions: use -mno-multiple in GCC.
- Avoid CR recording integer instructions
- Avoid indirect shift and rotate instructions
- Avoid integer multiply and divide instructions
Eliminating microcoded and other non-pipelined instructions is
sometimes difficult and not always desireable (for example, when code
size is the determining factor in performance.) However, it is
important to know the penalty and make an informed choice. And as
always, if you are optimizing at this level, be sure to double-check
your results with a real profile on real hardware.