CellPerformance

Roundup: Recent sketches on concurrency, data design and performance.

2009-08-07T07:43:24Z

Recently I've been doing some presentations as well as just general sketches of some things I've been thinking about regarding optimization, concurrency and data design. I've been posting them on Twitter to gather feedback from my pals there. A couple have caused a little controversy, but remember that all of them are given in the simple spirit of sharing ideas among peers. And don't forget it's all in good fun!

Three Big Lies

2008-03-15T04:44:50Z

This is a repost of a blog entry I wrote for the Insomniac R&D site (Three Big Lies). It's representative of what I believe are some of the fundamental problems in the culture of software development in general, and games in particular. There are some fundamental truths that seem to be often forgotten. For example, that the point of any program is simply to transform data from one form into another and nothing else. And as one "solution" which ignores the real core problems of development is developed and others over time are built on top of that idea, and so on, we're left with systems that are over-designed, perform poorly and simply do not accomplish what they intended to in the first place - and certainly not well. I continue to suggest that we all take a step back from what we're doing and the methods we're using to solve problems and try to remember what the real issues are.

One of the things we talked about this year at GDC was what we called the "Three Big Lies of Software Development." How much programmers buy into these "lies" has a pretty profound effect on the design (and performance!) of an engine, or any high-performance embedded system for that matter.

]]> (Lie #1) Software is a platform I blame the universities for this one. Academics like to remove as many variables from a problem as possible and try to solve things under "ideal" or completely general conditions. It's like old physicist jokes that go "We have made several simplifying assumptions... first, let each horse be a perfect rolling sphere..."

The reality is software is not a platform. You can't idealize the hardware. And the constants in the "Big-O notation" that are so often ignored, are often the parts that actually matter in reality (for example, memory performance.) You can't judge code in a vacuum. Hardware impacts data design. Data design impacts code choices. If you forget that, you have something that might work, but you aren't going to know if it's going to work well on the platform you're working with, with the data you actually have.

(Lie #2) Code should be designed around a model of the world

There is no value in code being some kind of model or map of an imaginary world. I don't know why this one is so compelling for some programmers, but it is extremely popular. If there's a rocket in the game, rest assured that there is a "Rocket" class (Assuming the code is C++) which contains data for exactly one rocket and does rockety stuff. With no regard at all for what data tranformation is really being done, or for the layout of the data. Or for that matter, without the basic understanding that where there's one thing, there's probably more than one.

Though there are a lot of performance penalties for this kind of design, the most significant one is that it doesn't scale. At all. One hundred rockets costs one hundred times as much as one rocket. And it's extremely likely it costs even more than that! Even to a non-programmer, that shouldn't make any sense. Economy of scale. If you have more of something, it should get cheaper, not more expensive. And the way to do that is to design the data properly and group things by similar transformations.

(Lie #3) Code is more important than data

This is the biggest lie of all. Programmers have spent untold billions of man-years writing about code, how to write it faster, better, prettier, etc. and at the end of the day, it's not that significant. Code is ephimiral and has no real intrinsic value. The algorithms certainly do, sure. But the code itself isn't worth all this time (and shelf space! - have you seen how many books there are on UML diagrams?). The code, the performance and the features hinge on one thing - the data. Bad data equals slow and crappy application. Writing a good engine means first and formost, understanding the data.

Utility: match

2007-04-08T05:51:54Z

Update! If fixed up all the greater-than and less-than symbols in this entry. I didn't make much sense before. I always forget to change those up in the HTML.

I'm just sharing a little utility I use all the time called match.

Usage: ./match [-h]  

For each line in  print the index to the 
first matching line in .

[-h] Print results in 32 bit hexidecimal (default is decimal)

Note: The max line width supported is 4095 characters.
Note: Maximum number of lines supported is (2^32)

If I have a source file of data represented as text (as I often do because it's often easier for me to read binary dumps in a text editor than a special "hex editor"), I use match to create a table of indices to unique lines (often these correspond to 128 bits since that's the size of an SPU register).

I commonly use it like so (given I have a file called "source_file")

sort source_file | uniq > uniq_file
match source_file uniq_file

Now I have a handy table of indices!

Download: match.c

Handy PS3 Linux Framebuffer Utilities

2007-03-31T07:02:25Z

While the documentation within Sony's vsync example should be enough to get you started with writing to the framebuffer, here's a couple of handy functions to test the framebuffer settings, open the virtual terminal and get access the the frame buffer. cp_vt.h
cp_vt.c

Open the framebuffer:
cp_fb.h
cp_fb.c

Dump framebuffer info:
fb_info.c

Files should be compiled with:

ppu-gcc -std=c99 -pedantic -W -Wall -O3

fb_info

fb_info dumps the current settings for the framebuffer setup on the PS3.

For example - for 480i the output should look something like this:

FBIOGET_VBLANK:
  flags:
    FB_VBLANK_VBLANKING   : FALSE
    FB_VBLANK_HBLANKING   : FALSE
    FB_VBLANK_HAVE_VBLANK : FALSE
    FB_VBLANK_HAVE_HBLANK : FALSE
    FB_VBLANK_HAVE_COUNT  : FALSE
    FB_VBLANK_HAVE_VCOUNT : FALSE
    FB_VBLANK_HAVE_HCOUNT : FALSE
    FB_VBLANK_VSYNCING    : FALSE
    FB_VBLANK_HAVE_VSYNC  : TRUE
  count  : 0
  vcount : 1
  hcount : 0
-------------------------------------
FBIOGET_FSCREENINFO:
  id          : "PS3 FB"
  smem_start  : 0x00000000
  smem_len    : 18874368
  type        : FB_TYPE_PACKED_PIXELS (0)
  type_aux    : N/A
  visual      : FB_VISUAL_TRUECOLOR (2)
  xpanstep    : 1
  ypanstep    : 1
  ywrapstep   : 1
  line_length : 2880
  mmio_start  : 0x00000000
  mmio_len    : 0
  accel       : FB_ACCEL_NONE (0)
-------------------------------------
PS3FB_IOCTL_SCREENINFO:
    xres        : 720
    yres        : 480
    xoff        : 72
    yoff        : 48
    num_frames  : 2
-------------------------------------

Using cp_vt and cp_fb

These functions are very simple to use. The user running them should have read/write access to the framebufer (/dev/fb0) and the main console (/dev/console).

{
    cp_vt vt;
    cp_fb fb;

    cp_vt_open_graphics(&vt);
    cp_fb_open(&fb);

    uint32_t frame_ndx = 0;

    while (1)
    {
        uint32_t* const restrict frame_top = (uint32_t*)fb.draw_addr[ frame_ndx ];

        // Write pixel to the frame buffer ...
        // x and y are image position
        // rgb24 is 32bit pixel value (where top 8 bits are unused)

        frame_top[ ( y * fb.stride ) + x ] = rgb24;

        // At the vsync, the previous frame is finished sending to the CRT
        cp_fb_wait_vsync( &fb );

        // Send the frame just drawn to the CRT by the next vblank
        cp_fb_flip( &fb, frame_ndx );

        frame_ndx  = frame_ndx ^ 0x01;
    }

    cp_vt_close(&vt);
    cp_fb_close(&fb);
}

A more complete example: fb_test.c]]>

HowTo: Huge TLB pages on PS3 Linux

2007-01-30T08:23:08Z

Updated! (22 Mar 07) Minor edits. Added notes for YellowDog Linux. Added source code for using huge page allocation.
Updated! (30 Mar 07) A couple minor fixes. Thanks to Guénaël Renault for pointing them out!
Updated! (15 July 07) Added notes for kernel 2.6.21

Guest article: Understanding the TLB and minimizing misses is a critical part of high performance Cell programming. Unfortunately some PS3 kernels do not come with huge page support enabled. Jakub Kurzak and Alfredo Buttari step through the details of recompiling the kernel for huge page support.

The availability of huge TLB pages depends on the way the linux kernel has been configured prior to compilation. The default kernel that ships with Fedora Core 5 (most likely with any other distribution that has binary kernel packages) doesn't include this option. So, in order to have huge TLB pages, it is necessary to reconfigure the kernel, recompile it, instruct the boot loader about the newly created kernel image. Finally we will also show a way to allocate the TLB pages automatically at boot time.

[Mike Acton] This process also works with YellowDog Linux virtually unchanged.

]]> Rebuilding the PS3 Linux Kernel

[Mike Acton] For more detailed information on the Linux Kernel and the build process, see:

The Linux Kernel HOWTO [faqs.org]
PS3 Linux Distributor's Starter Kit [kernel.org]
Also see: Building an Updated Kernel for PS3 [julipedia.blogspot.com]
Also see: PS3 NFS Root File System HOWTO by Geoff Levand (PS3 kernel maintainer)

[Mike Acton] For more information on using huge tlb pages, especially from user space, read hugetlbpages.txt which is found in the kernel source under /Documents/vm/

Here are the steps:

Recompile the kernel in order to have huge TLB pages
1. Take the kernel source from the add-on cd (filename is linux-20061110.tar.bz2)
  [Mike Acton] Download the PS3 Source Add-On CD [qj.net].
  [Mike Acton] A more recent (2.6.21 as of this update) kernel and sources can be found the more recent Add-on disc package (CELL-Linux-CL_20070516-ADDON) which can be found in various Linux mirrors:
2. unpack it in the /usr/src directory
3. make a link:
```
	$ ln -s /usr/src/linux-20061110 /usr/src/linux
```
  [Mike Acton] For Linux 2.6.21:
```
	$ ln -s /usr/src/linux-2.6.21-20070425 /usr/src/linux
```
4. prepare for kernel configuration:
  [Mike Acton] For Linux 2.6.21:
  To build a more recent kernel you will need to install a few things first:
  1. AsciiDoc. Download: asciidoc-8.2.1.tar.gz [methods.co.nz]
  2. xmlto. Download: xmlto-0.0.18.tar.bz2 [cyberelk.net]
  3. git, a revision control system. Download: git 1.5.2 [kernel.org]
```
$ cd /usr/src
$ tar xzvf git-1.5.2.tar.gz
$ cd git-1.5.2
$ make prefix=/usr all doc
$ make prefix=/usr install install-doc 
```
  4. dtc (Device Tree Compiler) NOTE: To build the kernel, you need a version newer than the dtc-20060419.tar.gz version available on the dtc web page.
```
$ cd /usr/src
$ git clone git://www.jdl.com/software/dtc.git 
$ cd dtc
$ make
$ make install
```
5. [Mike Acton] mrproper should be done before make to clean any older build data, if you have them.
```
$ make mrproper
```
6. copy the kernel config file that comes with the fedora installation into /usr/src/linux
```
$ cp /boot/config-2.6.16 /usr/src/linux/.config
```
  [Mike Acton] On YellowDog Linux, this file is /boot/config-2.6.16-20061110.ydl.1ps3
  [Mike Acton] For Linux 2.6.21:
  The config file has been updated significantly since the original 2.6.16 release. It's much easier to start with the file included in the kernel distribution.
```
$ cd /usr/src/linux
$ cp arch/powerpc/configs/ps3_defconfig .config
```
7. This next step goes through the old configuration file and prompts the user whenever a new kernel option that is not present in the old kernel is encountered (none in this case since the old and the new kernels are exactly the same version)
```
$ make oldconfig
```
  [Mike Acton] For Linux 2.6.21: There's no need for this step if you copied the file from the kernel distribution itself.
8. enable huge TLB pages in the kernel configuration
```
$ make menuconfig
```
  Now go to File systems --> Pseudo filesystems and enable huge TLB pages by pressing the space bar on the "HugeTLB file system support" option. Now select "exit" repeatedly and answer "yes" when asked to save the new kernel configuration
9. compile kernel and modules and install modules (it will take around 20 minutes):
```
$ make all
$ make modules_install
```
install the new kernel:
[Mike Acton] For Linux 2.6.21: Replace references to 2.6.16 with 2.6.21 in this and the following steps.
```
$ cp /usr/src/linux/vmlinux /boot/vmlinux-2.6.16_HTLB
```
create a ramdisk image for the new kernel:
```
$ mkinitrd /boot/initrd-2.6.16_HTLB.img 2.6.16
```
[Mike Acton] On Yellowdog Linux, mkinitrd lives in /sbin.
[Mike Acton] For Linux 2.6.21:
"When I do mkinitrd, it says: No modules available for kernel "2.6.21". What's up?

The problem is this version of the kernel doesn't isn't installed as "2.6.21", it's installed as "2.6.21-rc7". You can discover that by looking in /lib/modules:
```
$ ls /lib/modules
total 16
drwxr-xr-x 3 root root 4096 Mar 22 05:57 2.6.16
drwxr-xr-x 5 root root 4096 Jan 19 06:06 2.6.16-20061110.ydl.1ps3
drwxr-xr-x 3 root root 4096 Jul 15 08:24 2.6.20
drwxr-xr-x 3 root root 4096 Jul 17 06:22 2.6.21-rc7
```
So the actual command you need to run is:
```
$ mkinitrd /boot/initrd-2.6.21_HTLB.img 2.6.21-rc7
```

tell the bootloader (kboot) where the new kernel is:

$ vim /etc/kboot.conf

add the following line

linux_htlb='/boot/vmlinux-2.6.16_HTLB initrd=/boot/initrd-2.6.16_HTLB.img'
[Mike Acton] For YellowDog Linux, use:
ydl_htlb      ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init video=ps3fb:mode:3 rhgb'
ydl480i_htlb  ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init video=ps3fb:mode:1 rhgb'
ydl1080i_htlb ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init video=ps3fb:mode:4 rhgb'
ydltext_htlb  ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \
root=/dev/sda2 init=/sbin/init 3'

if you want this kernel to be loaded by default then change the "default" line into

default=linux_htlb
[Mike Acton] For YellowDog Linux, use one of the modes above.

instruct the boot process in order to allocate huge TLB pages. (Pick one of the following two options)

OPTION 1:

$ vim /etc/rc.local

add the following lines:

mkdir -p /huge
echo 20 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs nodev /huge
chown root:root /huge
chmod 755 /huge

be sure to change the "chown" line according to your system settings.

OPTION 2: create a /etc/init.d/htlb script with the following content:

All the commands added to the rc.local file in the previous step are executed at the end of the boot sequence. This means that the huge TLB pages allocation is performed when lots of the system memory has been already allocated by other processes. This results in the allocation of 6 or 7 pages. In order to obtain few pages more (8 or 9) we have to move the huge TLB pages allocation earlier in the boot sequence (i.e. at runlevel-1)

[Mike Acton] chkconfig required some additional settings not in the previous version of this script. Modified version is here:

	#!/bin/sh
	#
	# htlb:	Start/stop huge TLB pages allocation
	#
        # [Mike Acton] The runlevel and priority settings for chkconfig are stolen straight out of cpuspeed.
        
        # chkconfig: 12345 06 99
        # description: Start/stop huge TLB pages allocation

	. /etc/rc.d/init.d/functions

	start()
	{
	    mkdir -p /huge
	    echo 20 > /proc/sys/vm/nr_hugepages
	    mount -t hugetlbfs nodev /huge
	    chown root:root /huge
	    chmod 775 /huge
        }

	stop()
	{
	    echo 0 > /proc/sys/vm/nr_hugepages
	}
	
	case "$1" in
	  start)
		start
		;;
	  stop)
		stop
		;;
	  restart|reload)
	        stop
	        start
	        ;;
	  *)
	        echo $"Usage: $0 {start|stop|status|restart|reload}"
	        exit 1
		;;
	esac
	
	exit 0

Make the new service executable:

$ chmod a+x /etc/init.d/htlb

Add the service to runlevel-1:

$ /sbin/chkconfig --add htlb

reboot. During the boot process, when presented the "kboot:" prompt you'll be able to choose your kernel using the "tab" key.

[Mike Acton] Validate that huge pages are now installed and working by:

$ cat /proc/meminfo | grep Huge

You should see something like:

HugePages_Total:     8
HugePages_Free:      8
Hugepagesize:    16384 kB

and...

$ cat /proc/filesystems  | grep huge

You should see something like:

nodev   hugetlbfs

[Mike Acton] Here are some helper functions for allocating and freeing huge memory:

cp_hugemem.c
cp_hugemem.h

They are very simple to use:

{
    // Allocate...
    const size_t  hmem_size = 128 * 1024 * 1024;
    cp_hugemem    hmem;

    int was_hugemem_allocated = cp_hugemem_alloc( &hmem, hmem_size );
    if ( !was_hugemem_allocated )
    {
        fprintf(stderr,"Error: Could not allocate hugemem\n");
        return (-1);
    }

    // Use the memory...
    char* ptr = (char*)hmem.addr;

    // Free...
    cp_hugemem_free( &hmem );
}

About the Authors

Jakub Kurzak AKA Koobas is a researcher at the University of Tennessee, Knoxville, and a member of the Innovative Computing Lab (ICL - http://icl.cs.utk.edu/), where he mostly does things related programming multi-core processors and the Cell processor. Before that he was a student the University of Houston, where he dealt with programming distributed memory machines using message passing (MPI). Jakub's interests are in parallel programming techniques (message passing, multi-threading), parallel number crunching algorithms, and performance optimization.

Alfredo Buttari is a research associate at the Computer Science dept. of the University of Tennessee Knoxville. Alfredo is a member of the Innovative Computing Laboratory which deals with many aspects of High Performance Computing. His interests are in developing high performance software for Linear Algebra which is mostly achieved through parallel programming techniques of all sorts (MPI, OpenMP, threads...), including the more exotic approaches like the Cell programming model. Before to Tennesse Alfredo got a PhD and a Master degree in Computer Science from the "Tor Vergata" University of Rome (Italy).

Cross-compiling for PS3 Linux

2006-11-30T05:22:09Z

Now that the PS3 is out and multiple Linux-based distributions are available which can be installed using Open Platform [playstation.com] it's time to start developing on some publically available hardware!

Although the PPU and SPU compilers can be installed and used on the PS3 directly, I find it much more familiar and convinient to cross-compile from my desktop and just ship the resulting executables over to the target (PS3).

In this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3. ]]> Install Linux I have sucessfully compiled and run using both Yellow Dog Linux [terrasoftsolutions.com] and Fedora Core [redhat.com].

This article assumes that Linux is already installed on the PS3. It's very easy to install and the process is already documented quite well.

Carl Bender over at PS3PC.net has written a very good guide on Installing Fedora 5 Linux on Your PS3 [linuxps3.net]

See also: Installation Guide for Yellow Dog Linux [terrasoftsolutions.com]
See also: Installation Guide for Fedora Core 5
See also: Linux on the Playstation 3 Wiki [pslinux.org]
See also: Installing Gentoo on the PS3 [daniel.jp]

NOTE: For the sake of this article, Yellow Dog Linux 5 (32 bit version for PS3) will be assumed. A 32 bit host PowerPC Fedora Core 5 installation will also be assumed (Although 64 bit and x64 versions of the libraries are available for other types of hosts.)

cat /proc/cpuinfo (For the Target PS3)

processor : 0
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)

processor : 1
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)

timebase : 79800000
machine : PS3PF

cat /proc/interrupts (For the Target PS3)

 CPU0 CPU1
 10: 19437 0 PS3PF irq controller Edge ehci_hcd:usb1
 11: 20767742 0 PS3PF irq controller Edge ehci_hcd:usb2
 16: 0 0 PS3PF irq controller Edge ohci_hcd:usb3
 17: 0 0 PS3PF irq controller Edge ohci_hcd:usb4
128: 0 574866 PS3PF irq controller Edge IPI0 (call function)
129: 0 3024105 PS3PF irq controller Edge IPI1 (reschedule)
130: 0 0 PS3PF irq controller Edge IPI2 (unused)
131: 0 0 PS3PF irq controller Edge IPI3 (debugger break)
132: 555759 0 PS3PF irq controller Edge IPI0 (call function)
133: 2998857 0 PS3PF irq controller Edge IPI1 (reschedule)
134: 0 0 PS3PF irq controller Edge IPI2 (unused)
135: 0 0 PS3PF irq controller Edge IPI3 (debugger break)
136: 0 0 PS3PF irq controller Edge Virtual UART
137: 0 0 PS3PF irq controller Edge spe00.0
138: 1 0 PS3PF irq controller Edge spe00.1
139: 7 0 PS3PF irq controller Edge spe00.2
140: 0 0 PS3PF irq controller Edge spe01.0
141: 2 0 PS3PF irq controller Edge spe01.1
142: 6 0 PS3PF irq controller Edge spe01.2
143: 0 0 PS3PF irq controller Edge spe02.0
144: 2 0 PS3PF irq controller Edge spe02.1
145: 6 0 PS3PF irq controller Edge spe02.2
146: 0 0 PS3PF irq controller Edge spe03.0
147: 2 0 PS3PF irq controller Edge spe03.1
148: 13 0 PS3PF irq controller Edge spe03.2
149: 0 0 PS3PF irq controller Edge spe04.0
150: 2 0 PS3PF irq controller Edge spe04.1
151: 13 0 PS3PF irq controller Edge spe04.2
152: 0 0 PS3PF irq controller Edge spe05.0
153: 1 0 PS3PF irq controller Edge spe05.1
154: 9 0 PS3PF irq controller Edge spe05.2
155: 27210328 0 PS3PF irq controller Edge ps3fb vsync
156: 1809885 0 PS3PF irq controller Edge PS3PF stor
157: 387328 0 PS3PF irq controller Edge PS3PF stor
158: 65 0 PS3PF irq controller Edge PS3PF stor
159: 1509 0 PS3PF irq controller Edge snd_ps3pf
160: 0 78885 PS3PF irq controller Edge gbec connection
BAD: 0

Install elfspe2 and libspe2 on PS3

elfspe2 allows SPU executables to be run standalone from the commandline (aka spulets)
libspe2 is a PPU library for launching and communicating with SPU executables.

1. Copy the following files to the PS3. These files can be found on the PS3 Linux Add-On Packages CD in the spu directory.

libspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm
elfspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm

2. As root, rpm -ivh *.rpm

Install toochain on host PC

I am using Fedora Core 5 installed on a PowerPC Mac Mini as my host machine for PS3 development. Working from a PowerPC platform is extremely convinient. However, all of the following libraries are also either available as i686 packages or can be recompiled for the i686 platform if you prefer that.

cat /proc/cpuinfo (For the Host PC)

processor : 0
cpu : 7447A, altivec supported
clock : 1249.999995MHz
revision : 0.2 (pvr 8003 0102)
bogomips : 83.20
timebase : 41620997
machine : PowerMac10,1
motherboard : PowerMac10,1 MacRISC3 Power Macintosh
detected as : 287 (Mac mini)
pmac flags : 00000010
L2 cache : 512K unified
pmac-generation : NewWorld

1. Copy the following files to the host PC. These files can be found at Barcelona Supercomputing Center, Linux on Cell [bsc.es] under Programming Models -> Linux on Cell -> Cell BE Components -> GNU Toolchain.

ppu-binutils-3.2-4.ppc.rpm
ppu-gcc-3.2-4.ppc.rpm
ppu-gcc-c++-3.2-4.ppc.rpm
ppu-toolchain-3.2-4.src.rpm
ppu-toolchain-debuginfo-3.2-4.ppc.rpm
spu-binutils-3.2-6.ppc.rpm
spu-gcc-3.2-6.ppc.rpm
spu-gcc-c++-3.2-6.ppc.rpm
spu-newlib-1.14.0.200610300000-1.ps3pf.ppc.rpm
spu-toolchain-3.2-6.src.rpm
spu-toolchain-debuginfo-3.2-6.ppc.rpm

2. As root, rpm -ivh *.rpm

Install libspe2 on host PC

1. Copy the following files to the host PC. These files can be found on the PS3 Linux Add-On Packages CD in the spu directory.

libspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm
libspe2-devel-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm

2. As root, rpm -ivh *.rpm

Building Hello World (for libspe2)

1. On the host PC, compile the example:

ppu-gcc -m32 ppu_hello.c -lspe2 -o ppu_hello
spu-gcc spu_hello.c -o spu_hello

NOTE: If the 64 bit support headers and libraries are installed on the host the -m32 can be omitted from the PPU compilation step.

2. Copy the two executables to the PS3.
3. To execute spu_hello using libspe2, just run ./ppu_hello
4. To execute spu_hello using elfspe2, just run ./spu_hello directly.

Hello World source (for libspe2)

ppu_hello.c

  0#include 
  1#include 
  2
  3int
  4main()
  5{
  6  unsigned int          createflags = 0;
  7  unsigned int          runflags    = 0;
  8  unsigned int          entry       = SPE_DEFAULT_ENTRY;
  9  void*                 argp        = NULL;
 10  void*                 envp        = NULL;
 11
 12  spe_program_handle_t* program     = spe_image_open("spu_hello");
 13  spe_context_ptr_t     spe         = spe_context_create(createflags, NULL);
 14  spe_stop_info_t       stop_info;
 15
 16  spe_program_load(spe, program);
 17  spe_context_run(spe, &entry, runflags, argp, envp, &stop_info);
 18  spe_image_close(program);
 19  spe_context_destroy(spe);
 20
 21  return (0);
 22}

spu_hello.c

 0#include 
 1 
 2int
 3main( unsigned long spuid )
 4{
 5 printf("Hello, World! (From SPU:%d)\n",spuid);
 6 return (0);
 7}

Using the IBM SDK

The IBM SDK uses libspe not libspe2, so in order to build the IBM libraries and samples, libspe must be installed.

What is the difference between libspe and libspe2? Will both continue to be used?

libspe2 is a re-design of libspe. The folks at IBM have strongly implied that libspe is on its way out and we should expect a future revision of the SDK to be refactored for libspe2.

Roland (RSei) gave an excellent description of reasoning behind the design of libspe2 in IBM's Cell Broadband Engine Architecture forum:

"There have been a number of requirements and issues with libspe1 that led to the design of a new major version with a different API. I'll try to explain a few major aspects just briefly:

1. libspe is supposed to be the "low-level API" to use SPE resources. We think that the "SPE context" introduced in libspe2 is the better low-level construct than the "SPE thread" (as defined in libspe1), which already suggests a particular programming model and view. By using "SPE contexts", it is, e.g., possible to have other models like (synchronous) function offload to SPEs more easily without introducing the complexity and overhead of threading into an application. Another example is the possibility to exchange the code on an SPE, but leaving the data in place, which allows for easy and efficient "chaining" of processing steps und PPE control. In the thread model, this would have to rely on SPE programs using overlays. By the way, it is very easy to have the libspe1 thread model as a special case implemented on top of libspe2 and we have actually done this exercise internally.

2. Many people asked for a more complete "SPE thread library" (similar to what you usually have, e.g., in pthread). By removing the special concept of an "SPE thread" (in the libspe1 sense), we are actually addressing this requirement. When using libspe2, the programmer relies on the thread package of choice and just uses SPEs in these threads. All thread-specific aspects of the application are standard - so you have full functionality.
3. There were many complaints about the event API in libspe1 - from usability to efficiency. We think, we found a good solution in libspe2.

4. We feel that the "SPE groups" in libspe1 were tieing together rather orthogonal concepts like scheduling and event handling. So we gave up this construct. You may have noticed that we introduced "SPE gang contexts" and you have probably already guessed that we are working on gang scheduling to leverage this - but "gangs" are purely a scheduling construct and do *not* replace the previous groups.

5. You are right that binding threads to specific, physical SPEs has been part of the libspe1 API, although it had never been implemented. There are many discussions about this feature. At this point, we don't have a conclusive answer how we want to support "affinity" of threads to physical SPE resources. We simply felt we are not ready yet to define the API and stick to it in the future."

1. Copy the following files to the host PC. These files can be found at Barcelona Supercomputing Cente, Linux on Cell [bsc.es] under Programming Models -> Linux on Cell -> Cell BE Components -> GNU Toolchain.

libspe-1.1.0-1.ppc.rpm
libspe-debuginfo-1.1.0-1.ppc.rpm
libspe-devel-1.1.0-1.ppc.rpm

2. As root, rpm -ivh *.rpm

3. Copy the libspe libraries from the Host PC at /usr/lib/libspe.so.* to /usr/lib/ on the PS3.
4. Copy the following file onto the host PC. This file can be found at IBM alphaWorks' IBM Cell Broadband Engine Software Development Kit download page. You will need to agree to the licenses in order to download the file.

cell-sdk-lib-samples-1.1-10.noarch.rpm

5. As root, rpm -ivh cell-sdk-lib-samples-1.1-10.noarch.rpm. The source files should now be installed in /opt/IBM/cell-sdk-1.1.
6. Only minor modifications are needed to cross-compile the SDK.

cd /opt/IBM/cell-sdk-1.1
Open make.footer

Search for (starting at line 84 in my copy):

########################################################################
# Common GNU Defines (Host, PPU32, PPU64, SPU)
########################################################################

Delete the following section (starting at line 91 in my copy):

ifeq "$(HOST_PROCESSOR)" "ppc64"
 SCE_ROOT =
 SCE_SYSROOT =
 SCE_PPU_BINDIR = /usr/bin
 SCE_SPU_BINDIR = /usr/bin
 PPU_TOOL_PREFIX =
 PPU32_TOOL_PREFIX =
else
 # SCE_VERSION is defined in environment or in make.env
 SCE_ROOT = /opt/sce/$(SCE_VERSION)
 SCE_SYSROOT = $(SCE_ROOT)/ppu/sysroot
 SCE_PPU_BINDIR = $(SCE_ROOT)/ppu/bin
 SCE_SPU_BINDIR = $(SCE_ROOT)/spu/bin
 PPU_TOOL_PREFIX = $(PPU_PREFIX)
 PPU32_TOOL_PREFIX = $(PPU32_PREFIX)
endif

Insert the following section at the same location:

 SCE_ROOT =
 SCE_SYSROOT =
 SCE_PPU_BINDIR = /usr/bin
 SCE_SPU_BINDIR = /usr/bin

If 64 bit support is not installed, search for (line 150 in my copy):
```
#********************
# 64-bit PPU Targets
#********************
```

If 64 bit support is not installed, delete the following lines:

PPU64_TARGETS := $(strip $(PROGRAM_ppu64) \
 $(PROGRAMS_ppu64) \
 $(LIBRARY_ppu64) \
 $(SHARED_LIBRARY_ppu64))

ifdef PPU64_TARGETS
 TARGET_PROCESSOR := ppu64
endif

Save the changes

7. If GLUT is not installed on the host PC, install it (for Fedora-based hosts) with yum install freeglut-devel
8. The SDK and samples should now build without errors: cd src; make (Although quite a few warnings will be generated - there is a bit of non-standard compliant code in the SDK which should be fixed.)
9. Copy the following files from the host PC to the target PS3's /usr/lib directory.

/opt/IBM/cell-sdk-1.1/src/lib/matrix/ppu_shared/libmatrix.so
/opt/IBM/cell-sdk-1.1/src/lib/image/ppu_shared/libimage.so
/opt/IBM/cell-sdk-1.1/src/lib/vector/ppu_shared/libvector.so
/opt/IBM/cell-sdk-1.1/src/lib/surface/ppu_shared/libsurface.so
/opt/IBM/cell-sdk-1.1/src/lib/noise/ppu_shared/libnoise.so
/opt/IBM/cell-sdk-1.1/src/lib/fft/ppu_shared/libfft.so
/opt/IBM/cell-sdk-1.1/src/lib/gmath/ppu_shared/libgmath.so
/opt/IBM/cell-sdk-1.1/src/lib/math/ppu_shared/libmath.so
/opt/IBM/cell-sdk-1.1/src/lib/misc/ppu_shared/libmisc.so
/opt/IBM/cell-sdk-1.1/src/lib/audio_resample/ppu_shared/libaudio_resample.so

10. Now anything built with the IBM SDK should be able to run on the PS3.

Access the PS3 Over VNC

I have two Playstation 3 units and only one HDMI input on my HD TV and that one is going to be used for game playing, not developing. So the PS3 I use for development is head-less. The vast majority of the time I can accomplish everything I need by a simple secure shell to the PS3. But occasionally I want to use the machine as though I were local, and that is what VNC is for.

How to setup VNC on the PS3 (for Yellow Dog Linux):
1. Secure shell from the host to the PS3 with X11 using: ssh -X [PS3_IP_ADDRESS]
2. On the PS3, launch the firewall security settings application using: system-config-securitylevel. At this point you will need to enter the root password for the PS3.
3. Click on "Other ports", then "+ Add" and add port 5901 (TCP). This will allow the VNC connection through the firewall running on the PS3. Go ahead and close the application.
4. On the PS3, run the VNC server using: vncserver. If this is the first time you've run the server, you will need to provide a password that will be used to access the machine.
5. On the host PC, start the VNC client using: vncviewer [PS3_IP_ADDRESS]:[DISPLAY_NUMBER]. The display number was printed when the server was started. It defaults to 1 (ONE).
6. After you enter the password, you should now see the PS3 window manager running with an open shell by default.
7. In order to kill the VNC server use: vncserver -kill :[DISPLAY_NUMBER]
8. In order to use the default Yellow Dog window manager (Enlightenment), uncomment the following lines in ~/.vnc/xstartup on the PS3 and restart the server.

unset SESSION_MANAGER
exec /etc/X11/xinit/xinitrc

The only real practical difference between using the PS3 over VNC and using it locally will be if you are writing graphics to the framebuffer. These effects will only display over the locally connected display.

Upgrade libspe and libspe2

The official release of libspe and libspe2 that were available at launch have some minor issues that were patched recently. Both libraries are being actively developed and there will always be new patches available for brave developers. There is a cumulative version available through December 6.

To build and install the latest version:

1. Download the following files from [Cbe-oss-dev] libspe and libspe2 december release to the Host PC.

libspe-1.2.0.tar.gz
libspe2-2.0.1.tar.gz

The files will probably need to be renamed locally after download.
2. Untar the two files with:

tar xzvf libspe-1.2.0.tar.gz
tar xzvf libspe2-2.0.1.tar.gz

3. In the libspe2-2.0.1 directory, open the make.defines file, and change the equivalent section to be:

ifeq "$(CROSS_COMPILE)" "1"
SYSROOT ?= sysroot
prefix ?= /usr
CROSS ?= ppu-
EXTRA_CFLAGS = -m32 -mabi=altivec
else

4. Save the file, then build the patches for speevent using:

patch -p1 < initevent.diff
patch -p1 < event-public.diff
patch -p1 < make_speevent_thread_safe.diff

5. Build the library using: make; make install
6. Copy all the files (recursively) in the libspe2-2.0.1/sysroot/usr/ directory to the /usr/ directory on the PS3 and the Host PC.
7. In the libspe-1.2.0 directory, open the Makefile file, and change the equivalent section to be:

ifeq "$(CROSS_COMPILE)" "1"
SYSROOT ?= sysroot
prefix ?= /usr
CROSS ?= ppu-
EXTRA_CFLAGS = -m32 -mabi=altivec
else

8. Save the file, then build the library using: make; make install
9. Copy all the files (recursively) in the libspe-1.2.0/sysroot/usr/ directory to the /usr/ directory on the PS3 and the Host PC.

Congratulations, libspe-1.2.0 and libspe2-2.0.1 are now installed on the PS3 and will be used by the any applications which are dynamically linked to either of those libraries.

Special thanks to Dirk Herrendoerfer for both making the release available and for answering my questions on the build procedures.

atan2 on SPU

2006-09-13T04:35:40Z

n 2006 March 03 on the IBM developerWorks Cell Broadband Engine Architecture forum [ibm.com] an interesting question was asked:

"I am trying to port an application from an older version of SDK to SDK 1.0. It uses atan2(.....) function, which is causing trouble... This code worked fine on SDK28, but now it looks like the new functions dont have this particular function defined..
I did change the makefile to include $(SDKLIB)/libmath.a

I searched in ./sysroot/usr/spu/include/* and src/include/spu/* but couldn't find a headerfile that has it defined.

Can anyone please suggest if I should just change the code to not use that function or is there a way to invoke it still?

Thanks!"

It turned out this function was not available in the SDK.

The following is a branch-free implementation of atan2 vector floats for the SPU. A scalar version which simply casts to vector and back is also provided. This implementation is fairly quick-and-dirty and no particular level of accuracy is gauranteed, but it should be usable for many purposes.

Or download the source files:
cp_fatan-cbe-spu.h
cp_fatan-cbe-spu.c
]]>

This code is C99 source. For gcc, use the following flags: -std=c99 -pedantic

0// ## cp_fatan-cbe-spu.h (C99) 1// ## Version 1.0 2// ## 3// ## Copyright (c) 2006 Mike Acton 4// ## 5// ## SIGNIFICANT REFERENCES: 6// ## 7// ## [1] Cephes Math Library Release 2.8: June, 2000 8// ## Copyright 1984, 1995, 2000, Stephen L. Moshier 9// ## [2] Numerical Computation Guide (PDF) 10// ## Copyright 2000, Sun Microsystems, Inc. 11// ## [3] IEEE 754 Support in C99 (PDF) 12// ## Copyright 2001, Jim Thomas 13// ## [4] Solaris 10 Reference Manual : atan2(3M) 14// ## Copyright 1994-2005, Sun Microsystems, Inc. 15// ## 16// ## Permission is hereby granted, free of charge, to any person obtaining 17// ## a copy of this software and associated documentation files 18// ## (the "Software"), to deal in the Software without restriction, including 19// ## without limitation the rights to use, copy, modify, merge, publish, 20// ## distribute, sublicense, and/or sell copies of the Software, and to permit 21// ## persons to whom the Software is furnished to do so, subject to the 22// ## following conditions: 23// ## 24// ## The above copyright notice and this permission notice shall be included 25// ## in all copies or substantial portions of the Software. 26// ## 27// ## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 28// ## OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 29// ## FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 30// ## AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 31// ## LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 32// ## OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 33// ## THE SOFTWARE. 34// ## 35 36#ifndef CP_FATAN_CBE_SPU_H 37#define CP_FATAN_CBE_SPU_H 38 39#include 40#include 41 42// ## 43// ## Global Floating-point constants (32 bit) 44// ## 45// ## Constant is loaded in each element of 32 bit floating-point vector 46// ## from local store. 47// ## 48// ## cp_flpio4() +PI/+4 49// ## cp_flt3p8() tan( +3.0 * PI / +8.0 ) 50// ## cp_flnpio2() -PI/+2 51// ## cp_flpio2() +PI/+2 52// ## cp_flpt66() +0.66 53// ## cp_flpi() +PI 54// ## cp_flnpi() -PI 55 56extern const vector unsigned int _cp_f_pio4; 57extern const vector unsigned int _cp_f_t3p8; 58extern const vector unsigned int _cp_f_npio2; 59extern const vector unsigned int _cp_f_pio2; 60extern const vector unsigned int _cp_f_pt66; 61extern const vector unsigned int _cp_f_pi; 62extern const vector unsigned int _cp_f_npi; 63 64static inline qword 65cp_flpio4( void ) 66{ 67 return si_lqa( (intptr_t)&_cp_f_pio4 ); 68} 69 70static inline qword 71cp_flt3p8( void ) 72{ 73 return si_lqa( (intptr_t)&_cp_f_t3p8 ); 74} 75 76static inline qword 77cp_flnpio2( void ) 78{ 79 return si_lqa( (intptr_t)&_cp_f_npio2 ); 80} 81 82static inline qword 83cp_flpio2( void ) 84{ 85 return si_lqa( (intptr_t)&_cp_f_pio2 ); 86} 87 88static inline qword 89cp_flpt66( void ) 90{ 91 return si_lqa( (intptr_t)&_cp_f_pt66 ); 92} 93 94static inline qword 95cp_flpi( void ) 96{ 97 return si_lqa( (intptr_t)&_cp_f_pi ); 98} 99 100static inline qword 101cp_flnpi( void ) 102{ 103 return si_lqa( (intptr_t)&_cp_f_npi ); 104} 105 106// ## 107// ## Load-Immediate Floating-point constants (32 bit) 108// ## 109// ## Constant is loaded in each element of 32 bit floating-point vector 110// ## using immediate values. i.e. No loads 111// ## 112// ## cp_filzero() +0.0 +0x00000000 113// ## cp_filnzero() -0.0 +0x80000000 114// ## cp_filone() +1.0 +0x3f800000 115// ## cp_filtwo() +2.0 +0x40000000 116// ## cp_filinf() +INF +0x7f800000 117// ## cp_filninf() -INF +0xff800000 118// ## cp_filnan() NaN +0x7fc00000 119// ## 120 121static inline qword 122cp_filzero( void ) 123{ 124 return si_ilhu( (int16_t)0x0000 ); 125} 126 127static inline qword 128cp_filnzero( void ) 129{ 130 return si_ilhu( (int16_t)0x8000 ); 131} 132 133static inline qword 134cp_filone( void ) 135{ 136 return si_ilhu( (int16_t)0x3f80 ); 137} 138 139static inline qword 140cp_filtwo( void ) 141{ 142 return si_ilhu( (int16_t)0x4000 ); 143} 144 145static inline qword 146cp_filinf( void ) 147{ 148 return si_ilhu( (int16_t)0x7f80 ); 149} 150 151static inline qword 152cp_filninf( void ) 153{ 154 return si_ilhu( (int16_t)0xff80 ); 155} 156 157static inline qword 158cp_filnan( void ) 159{ 160 return si_ilhu( (int16_t)0x7fc0 ); 161} 162 163// ## 164// ## cp_fatan() Coefficients and other constants 165// ## 166 167extern const vector unsigned int _cp_f_atan_q4; 168extern const vector unsigned int _cp_f_atan_q3; 169extern const vector unsigned int _cp_f_atan_q2; 170extern const vector unsigned int _cp_f_atan_q1; 171extern const vector unsigned int _cp_f_atan_q0; 172extern const vector unsigned int _cp_f_atan_p4; 173extern const vector unsigned int _cp_f_atan_p3; 174extern const vector unsigned int _cp_f_atan_p2; 175extern const vector unsigned int _cp_f_atan_p1; 176extern const vector unsigned int _cp_f_atan_p0; 177extern const vector unsigned int _cp_f_hmorebits; 178extern const vector unsigned int _cp_f_morebits; 179 180// ## cp_fatan(x) 181// ## 182// ## 0 <= x <= 0.66 183// ## -PI/2 <= cp_fatan(x) <= +PI/2 184// ## 185// ## Each floating-point component of the result is a function of 186// ## the corresponding components of x: 187// ## 188// ## 0.0 { x == 0.0 189// ## 190// ## +PI { 191// ## --- { x == INF 192// ## 2.0 { 193// ## 194// ## -PI { 195// ## --- { x == -INF 196// ## 2.0 { 197// ## 198// ## 199// ## 2 4 6 8 { 200// ## P + P x + P x + P x + P x { 201// ## 2 0 1 2 3 4 { 202// ## x x ----------------------------------- + x { otherwise 203// ## 2 4 6 8 10 { 204// ## Q + Q x + Q x + Q x + Q x + x { 205// ## 0 1 2 3 4 { 206 207static inline qword 208_cp_fatan( const qword x ) 209{ 210 // ## 211 // ## Load constants 212 // ## 213 214 const qword f_one = cp_filone(); 215 const qword f_inf = cp_filinf(); 216 const qword f_ninf = cp_filninf(); 217 const qword f_msb = cp_filnzero(); 218 const qword f_zero = cp_filzero(); 219 220 const qword f_pt66 = si_lqa( (intptr_t)&_cp_f_pt66 ); 221 const qword f_pio2 = si_lqa( (intptr_t)&_cp_f_pio2 ); 222 const qword f_npio2 = si_lqa( (intptr_t)&_cp_f_npio2 ); 223 const qword f_pio4 = si_lqa( (intptr_t)&_cp_f_pio4 ); 224 const qword f_t3p8 = si_lqa( (intptr_t)&_cp_f_t3p8 ); 225 226 const qword f_atan_p0 = si_lqa( (intptr_t)&_cp_f_atan_p0 ); 227 const qword f_atan_p1 = si_lqa( (intptr_t)&_cp_f_atan_p1 ); 228 const qword f_atan_p2 = si_lqa( (intptr_t)&_cp_f_atan_p2 ); 229 const qword f_atan_p3 = si_lqa( (intptr_t)&_cp_f_atan_p3 ); 230 const qword f_atan_p4 = si_lqa( (intptr_t)&_cp_f_atan_p4 ); 231 const qword f_atan_q0 = si_lqa( (intptr_t)&_cp_f_atan_q0 ); 232 const qword f_atan_q1 = si_lqa( (intptr_t)&_cp_f_atan_q1 ); 233 const qword f_atan_q2 = si_lqa( (intptr_t)&_cp_f_atan_q2 ); 234 const qword f_atan_q3 = si_lqa( (intptr_t)&_cp_f_atan_q3 ); 235 const qword f_atan_q4 = si_lqa( (intptr_t)&_cp_f_atan_q4 ); 236 const qword f_morebits = si_lqa( (intptr_t)&_cp_f_morebits ); 237 const qword f_hmorebits = si_lqa( (intptr_t)&_cp_f_hmorebits ); 238 239 // ## 240 // ## pos_x = -x { x < 0 241 // ## x { otherwise 242 // ## 243 244 const qword neg_x = si_xor( x, f_msb ); 245 const qword sign_mask = si_fcgt( f_zero, x ); 246 const qword pos_x = si_selb( x, neg_x, sign_mask ); 247 248 // ## 249 // ## Range reduction 250 // ## 251 252 // ## 253 // ## range0_mask = ( pos_x > tan( 3.0 * PI / 8.0 ) ) 254 // ## range1_mask = ( pos_x <= 0.66 ) 255 // ## range2_mask = !( range0_mask || range1_mask ) 256 // ## 257 258 const qword range0_mask = si_fcgt( pos_x, f_t3p8 ); 259 const qword range1_gt_mask = si_fcgt( f_pt66, pos_x ); 260 const qword range1_eq_mask = si_fceq( f_pt66, pos_x ); 261 const qword range1_mask = si_or( range1_gt_mask, range1_eq_mask ); 262 const qword range2_mask = si_nor( range0_mask, range1_mask ); 263 264 // ## 265 // ## range0_x = -1.0 266 // ## ----- 267 // ## pos_x 268 // ## 269 // ## range0_y = PI 270 // ## --- 271 // ## 2.0 272 // ## 273 274 const qword range0_x0 = si_frest( pos_x ); 275 const qword range0_x1 = si_fi( pos_x, range0_x0 ); 276 const qword range0_x2 = si_fnms( range0_x1, pos_x, f_one ); 277 const qword range0_x3 = si_fma( range0_x2, range0_x1, range0_x1 ); 278 const qword range0_x = si_xor( range0_x3, f_msb ); 279 const qword range0_y = f_pio2; 280 281 // ## 282 // ## range1_x = pos_x 283 // ## range1_y = 0.0 284 // ## 285 286 const qword range1_x = pos_x; 287 const qword range1_y = f_zero; 288 289 290 // ## 291 // ## range2_x = (pos_x-1.0) 292 // ## ----------- 293 // ## (pos_x+1.0) 294 // ## 295 // ## range2_y = PI 296 // ## --- 297 // ## 4.0 298 // ## 299 300 const qword range2_y = f_pio4; 301 const qword range2_x0num = si_fs( pos_x, f_one ); 302 const qword range2_x0den = si_fa( pos_x, f_one ); 303 const qword range2_x0 = si_frest( range2_x0den ); 304 const qword range2_x1 = si_fnms( range2_x0, range2_x0den, f_one ); 305 const qword range2_x2 = si_fma( range2_x1, range2_x0, range2_x0 ); 306 const qword range2_x = si_fm( range2_x0num, range2_x2 ); 307 308 // ## 309 // ## range_x = range0_x { range0_mask 310 // ## range1_x { range1_mask 311 // ## range2_x { range2_mask 312 // ## 313 // ## range_y = range0_y { range0_mask 314 // ## range1_y { range1_mask 315 // ## range2_y { range2_mask 316 // ## 317 318 const qword range_x0 = si_selb( range2_x, range0_x, range0_mask ); 319 const qword range_x = si_selb( range_x0, range1_x, range1_mask ); 320 const qword range_y0 = si_selb( range2_y, range0_y, range0_mask ); 321 const qword range_y = si_selb( range_y0, range1_y, range1_mask ); 322 323 // ## 324 // ## 2 325 // ## xp2 = range_x 326 // ## 2 3 4 327 // ## P + P xp2 + P xp2 + P xp2 + P xp2 328 // ## 0 1 2 3 4 329 // ## zdiv = ------------------------------------------ 330 // ## 2 3 4 5 331 // ## Q + Q xp2 + Q xp2 + Q xp2 + Q xp2 + xp2 332 // ## 0 1 2 3 4 333 // ## 334 // ## z1 = range_x * ( xp2 * zdiv ) + range_x 335 // ## 336 337 const qword xp2 = si_fm( range_x, range_x ); 338 const qword znum0 = f_atan_p0; 339 const qword znum1 = si_fma( znum0, xp2, f_atan_p1 ); 340 const qword znum2 = si_fma( znum1, xp2, f_atan_p2 ); 341 const qword znum3 = si_fma( znum2, xp2, f_atan_p3 ); 342 const qword znum = si_fma( znum3, xp2, f_atan_p4 ); 343 const qword zden0 = si_fa( xp2, f_atan_q0 ); 344 const qword zden1 = si_fma( zden0, xp2, f_atan_q1 ); 345 const qword zden2 = si_fma( zden1, xp2, f_atan_q2 ); 346 const qword zden3 = si_fma( zden2, xp2, f_atan_q3 ); 347 const qword zden = si_fma( zden3, xp2, f_atan_q4 ); 348 const qword zden_r0 = si_frest( zden ); 349 const qword zden_r1 = si_fnms( zden_r0, zden, f_one ); 350 const qword zden_r = si_fma( zden_r1, zden_r0, zden_r0 ); 351 const qword zdiv = si_fm( znum, zden_r ); 352 const qword z0 = si_fm( xp2, zdiv ); 353 const qword z1 = si_fma( range_x, z0, range_x ); 354 355 // ## 356 // ## zadd = z1 + 0.5 * MOREBITS { range2_mask 357 // ## z1 + MOREBITS { range1_mask 358 // ## z1 { otherwise 359 // ## 360 // ## yaddz = range_y + zadd 361 // ## 362 // ## pos_yaddz = yaddz { yaddz >= 0 363 // ## -yaddz { yaddz < 0 364 // ## 365 366 const qword zadd0 = si_selb( f_zero, f_hmorebits, range2_mask ); 367 const qword zadd1 = si_selb( zadd0, f_morebits, range1_mask ); 368 const qword zadd = si_fa( z1, zadd1 ); 369 const qword yaddz = si_fa( range_y, zadd ); 370 const qword neg_yaddz = si_xor( yaddz, f_msb ); 371 const qword pos_yaddz = si_selb( yaddz, neg_yaddz, sign_mask ); 372 373 // ## 374 // ## result_y0 = 0.0 { x == 0.0 375 // ## pos_yaddz { otherwise 376 // ## 377 378 const qword x_eqz_mask = si_fceq( f_zero, x ); 379 const qword result_y0 = si_selb( pos_yaddz, x, x_eqz_mask ); 380 381 // ## 382 // ## result_y2 = +PI { 383 // ## --- { x == INF 384 // ## 2.0 { 385 // ## 386 // ## -PI { 387 // ## --- { x == -INF 388 // ## 2.0 { 389 // ## 390 // ## result_y0 { otherwise 391 // ## 392 393 const qword x_eqinf_mask = si_fceq( f_inf, x ); 394 const qword x_eqninf_mask = si_fceq( f_ninf, x ); 395 const qword result_y1 = si_selb( result_y0, f_pio2, x_eqinf_mask ); 396 const qword result = si_selb( result_y1, f_npio2, x_eqninf_mask ); 397 398 return (result); 399} 400 401static inline vector float 402cp_fatan( const vector float x ) 403{ 404 return (vector float)( _cp_fatan( (qword)x ) ); 405} 406 407static inline float 408cp_fatan_scalar( const float x ) 409{ 410 const qword vx = si_from_float( x ); 411 const qword vresult = _cp_fatan( vx ); 412 const float result = si_to_float( vresult ); 413 414 return (result); 415} 416 417// ## cp_fatan2(y,x) 418// ## 419// ## -INF <= x <= INF 420// ## -INF <= y <= INF 421// ## -PI <= cp_fatan2(y,x) <= +PI 422// ## 423// ## Each floating-point component of the result is a function of 424// ## the corresponding components of y and x: 425// ## 426// ## +PI { (y == +0.0) && (x < 0.0) 427// ## 428// ## -PI { (y == -0.0) && (x < 0.0) 429// ## 430// ## +0.0 { (y == +0.0) && (x > 0.0) 431// ## 432// ## -0.0 { (y == -0.0) && (x > 0.0) 433// ## 434// ## -PI { 435// ## ---- { (y < 0.0) && (x == 0.0) 436// ## +2.0 { 437// ## 438// ## +PI { 439// ## ---- { (y > 0.0) && (x == 0.0) 440// ## +2.0 { 441// ## 442// ## NaN { (y == NaN) || (x == NaN) 443// ## 444// ## +PI { (y == +0.0) && (x == -0.0) 445// ## 446// ## -PI { (y == -0.0) && (x == -0.0) 447// ## 448// ## +0.0 { (y == +0.0) && (x == +0.0) 449// ## 450// ## -0.0 { (y == -0.0) && (x == +0.0) 451// ## 452// ## +PI { 453// ## --- { (y == +INF) && (x == +INF) 454// ## 4.0 { 455// ## 456// ## -PI { 457// ## --- { (y == -INF) && (x == +INF) 458// ## 4.0 { 459// ## 460// ## +3.0 PI { 461// ## ------- { (y == +INF) && (x == -INF) 462// ## +4.0 { 463// ## 464// ## -3.0 PI { 465// ## ------- { (y == -INF) && (x == -INF) 466// ## +4.0 { 467// ## 468// ## +PI { isfinite(y) && (+y > 0) && (x == -INF) 469// ## 470// ## -PI { isfinite(y) && (-y > 0) && (x == -INF) 471// ## 472// ## +0.0 { isfinite(y) && (+y > 0) && (x == +INF) 473// ## 474// ## -0.0 { isfinite(y) && (-y > 0) && (x == +INF) 475// ## 476// ## +PI { 477// ## ---- { (isfinite(x) && (y == +INF) 478// ## +2.0 { 479// ## 480// ## -PI { 481// ## --- { (isfinite(x) && (y == -INF) 482// ## +2.0 { 483// ## 484// ## ( y ) { 485// ## +PI + cp_atan( - ) { ( x < 0.0 ) && ( y >= 0.0 ) 486// ## ( x ) { 487// ## 488// ## ( y ) { 489// ## -PI + cp_atan( - ) { ( x < 0.0 ) && ( y < 0.0 ) 490// ## ( x ) { 491// ## 492// ## ( y ) { 493// ## +0.0 + cp_atan( - ) { otherwise 494// ## ( x ) { 495// ## 496 497qword _cp_fatan2( qword y, qword x ) 498{ 499 const qword f_one = cp_filone(); 500 const qword f_zero = cp_filzero(); 501 const qword f_pi = si_lqa( (intptr_t)&_cp_f_pi ); 502 const qword f_npi = si_lqa( (intptr_t)&_cp_f_npi ); 503 504 // ## 505 // ## yox = y 506 // ## - 507 // ## x 508 // ## 509 // ## z = +PI + cp_atan( yox ) { ( x < 0.0 ) && ( y >= 0.0 ) 510 // ## -PI + cp_atan( yox ) { ( x < 0.0 ) && ( y < 0.0 ) 511 // ## 0.0 + cp_atan( yox ) { otherwise 512 513 const qword x_ltz_mask = si_fcgt( f_zero, x ); 514 const qword y_ltz_mask = si_fcgt( f_zero, y ); 515 const qword xy_ltz_mask = si_and( x_ltz_mask, y_ltz_mask ); 516 const qword zadd0 = si_selb( f_zero, f_pi, x_ltz_mask ); 517 const qword zadd = si_selb( zadd0, f_npi, xy_ltz_mask ); 518 const qword x_r0 = si_frest( x ); 519 const qword x_r1 = si_fnms( x_r0, x, f_one ); 520 const qword x_r = si_fma( x_r1, x_r0, x_r0 ); 521 const qword yox = si_fm( y, x_r ); 522 const qword atan_yox = _cp_fatan( yox ); 523 const qword result = si_fa( zadd, atan_yox ); 524 525 return (result); 526} 527 528vector float cp_fatan2( vector float arg0 /* y */, vector float arg1 /* x */ ) 529{ 530 const qword y = (qword)arg0; 531 const qword x = (qword)arg1; 532 const qword result = _cp_fatan2( y, x ); 533 534 return (vector float)(result); 535} 536 537float cp_fatan2_scalar( float arg0 /* y */, float arg1 /* x */ ) 538{ 539 const qword y = si_from_float( arg0 ); 540 const qword x = si_from_float( arg1 ); 541 const qword z = _cp_fatan2( y, x ); 542 const float result = si_to_float( z ); 543 544 return( result ); 545} 546 547#endif /* CP_FATAN_CBE_SPU_H */

0// ## cp_fatan-cbe-spu.c (C99) 1// ## Version 1.0 2// ## 3// ## Copyright (c) 2006 Mike Acton 4// ## 5// ## SIGNIFICANT REFERENCES: 6// ## 7// ## [1] Cephes Math Library Release 2.8: June, 2000 8// ## Copyright 1984, 1995, 2000, Stephen L. Moshier 9// ## [2] Numerical Computation Guide (PDF) 10// ## Copyright 2000, Sun Microsystems, Inc. 11// ## [3] IEEE 754 Support in C99 (PDF) 12// ## Copyright 2001, Jim Thomas 13// ## [4] Solaris 10 Reference Manual : atan2(3M) 14// ## Copyright 1994-2005, Sun Microsystems, Inc. 15// ## 16// ## Permission is hereby granted, free of charge, to any person obtaining 17// ## a copy of this software and associated documentation files 18// ## (the "Software"), to deal in the Software without restriction, including 19// ## without limitation the rights to use, copy, modify, merge, publish, 20// ## distribute, sublicense, and/or sell copies of the Software, and to permit 21// ## persons to whom the Software is furnished to do so, subject to the 22// ## following conditions: 23// ## 24// ## The above copyright notice and this permission notice shall be included 25// ## in all copies or substantial portions of the Software. 26// ## 27// ## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 28// ## OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 29// ## FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 30// ## AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 31// ## LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 32// ## OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 33// ## THE SOFTWARE. 34// ## 35 36// Loading these contants from (global) SPU local memory is going to be a win over building them 37// or storing them locally near the function. 38 39const vector unsigned int _cp_f_pio4 = {+0x3F490FDA,+0x3F490FDA,+0x3F490FDA,+0x3F490FDA}; 40const vector unsigned int _cp_f_t3p8 = {+0x401A8279,+0x401A8279,+0x401A8279,+0x401A8279}; 41const vector unsigned int _cp_f_npio2 = {-0x4036F026,-0x4036F026,-0x4036F026,-0x4036F026}; 42const vector unsigned int _cp_f_pio2 = {+0x3FC90FDA,+0x3FC90FDA,+0x3FC90FDA,+0x3FC90FDA}; 43const vector unsigned int _cp_f_pt66 = {+0x3F28F5C2,+0x3F28F5C2,+0x3F28F5C2,+0x3F28F5C2}; 44const vector unsigned int _cp_f_pi = {+0x40490fda,+0x40490fda,+0x40490fda,+0x40490fda}; 45const vector unsigned int _cp_f_npi = {-0x3fb6f026,-0x3fb6f026,-0x3fb6f026,-0x3fb6f026}; 46 47const vector unsigned int _cp_f_atan_q4 = {+0x43428CF7,+0x43428CF7,+0x43428CF7,+0x43428CF7}; 48const vector unsigned int _cp_f_atan_q3 = {+0x43F2B1F8,+0x43F2B1F8,+0x43F2B1F8,+0x43F2B1F8}; 49const vector unsigned int _cp_f_atan_q2 = {+0x43D870C6,+0x43D870C6,+0x43D870C6,+0x43D870C6}; 50const vector unsigned int _cp_f_atan_q1 = {+0x432506EA,+0x432506EA,+0x432506EA,+0x432506EA}; 51const vector unsigned int _cp_f_atan_q0 = {+0x41C6DE22,+0x41C6DE22,+0x41C6DE22,+0x41C6DE22}; 52const vector unsigned int _cp_f_atan_p4 = {-0x3D7E4CB1,-0x3D7E4CB1,-0x3D7E4CB1,-0x3D7E4CB1}; 53const vector unsigned int _cp_f_atan_p3 = {-0x3D0A3A07,-0x3D0A3A07,-0x3D0A3A07,-0x3D0A3A07}; 54const vector unsigned int _cp_f_atan_p2 = {-0x3D69FB9F,-0x3D69FB9F,-0x3D69FB9F,-0x3D69FB9F}; 55const vector unsigned int _cp_f_atan_p1 = {-0x3E7EBD5E,-0x3E7EBD5E,-0x3E7EBD5E,-0x3E7EBD5E}; 56const vector unsigned int _cp_f_atan_p0 = {-0x409FFC03,-0x409FFC03,-0x409FFC03,-0x409FFC03}; 57const vector unsigned int _cp_f_hmorebits = {+0x240D3131,+0x240D3131,+0x240D3131,+0x240D3131}; 58const vector unsigned int _cp_f_morebits = {+0x248D3131,+0x248D3131,+0x248D3131,+0x248D3131}; 59

Open Source and Console Games

2006-08-10T05:54:17Z

On August 16, 2006 I participated in a panel discussion on Open Source and media as part of Digital Hollywood's Building Blocks 2006 conference.

Here is the description of the panel [from digitalhollywood.com]

The Open Source movement began during the dot.com rise with young companies developing great tools to deliver applications and services across multiple platforms. The consumer's appetite for new content driven experiences has expanded to include ways to view, manage, and share content across devices. With the changing landscape around the home, Open Source promises to power a new generation of applications running over today's high-speed networks and the systems used to create, manage, and distribute that content.

Come join key leaders in the global electronics, online, and media communities to discuss Open Source's definition, and learn how companies will create systems, infrastructure, and applications for the next generation of the Consumer Entertainment Experience.

For those of you who did not attend, I would like to take an opportunity to discuss here my personal opinions on these issues.

]]> Background From the description of the panel, some people might be lead to believe that free and open source software are new phenomenons, somehow linked to the internet bubble. This is definately not true.

Certainly the history of "free software" can be traced back much further than the "dot com rise", and because much of the software we use (for example, GNU/Linux) is a mix of both "open" and "free" software, we should consider the larger context.

By most accounts, this begins with Richard Stallman making announcing his plan to "free unix" on usenet in 1983. But of course, the distribution of source code freely among programmers can be traced back much further than that.

Terminology

No discussion of "open source" can be complete without distinguishing between the subtle differences of "open source software" and "free software":

See: Definition of "free software"
See: Definition of "open source software"
Here's an article which tries to clarify the differences.

There has been some muddying of the waters by Microsoft's relatively recent "shared source" initiative (but there is general agreement that this is not either really "open" or "free" and so not part of this discussion.)

Licenses

There are very many open source licenses and one canonical free software license. Many companies (including IBM, SGI, Apple, ...) have also produced their own variants.

I think applicability of license to product merits some discussion here. For example, in my field (console video games), we are restricted by the platform owners (Sony, Nintendo, etc.) by NDA and cannot release specific details to the public. This necessarily limits our choices when using "open source software" and nearly eliminates "free software" as an option, as we often cannot fully reciprocate our modifications to the public.

I do not see this a problem, nor as a challange to be overcome. Authors of free software are often willing to distribute their software without cost so that others make take advantage of the work that they've done and the only ask for one thing in return - that the software remain free. Just as the case when a middleware vendor may charge half a million dollars more than you're willing to pay, if the price for the software is to steep, then something else must be used instead. I wholeheartedly respect the work of the FSF (Free Software Foundation), but I understand that the practical nature of our business makes it very difficult to directly use the products of their hard work, and the work of so many other free software developers, in our own products.

Free software definitely has its place in game development, however. I do most of my work on GNU/Linux desktops and GCC is my compiler of choice. Additionally, many offline tools used directly or indirectly to develop the games themselves are based on free or open software, and I'm grateful that those tools exist.

Reciprocity

I think reciprocity is the most important thing we can be discussing in the context of open source software and console games. It cannot be a simply a matter of "how open source benefits us", but we must also discuss "how we can participate in the open source community" and what responsibilities we have for doing so.

The free and open source software which we gladly take advantage of (if not in the games themselves, then certainly in the tools that develop them) can be thought of as the proverbial "shoulder of giants". When we forget what brought us the advantages to get where we are, we do a disservice to ourselves and the health of our industry, and thus ultimately a disservice to our shareholders and customers.

I think Yahoo Search's vision statement applies equally well to the role of open source software:

"Enable people to find, use, share and expand all human knowledge"

To share and contribute not only benefits us now, but will continue to benefit us when our current products are forgotten and dusty.

Cost of Openness

There is an ongoing debate on the cost of sharing your work with the world. Perhaps there will be a higher cost in support when calls and emails arrive from users that have configured the software in some strange environment. Maybe it will give competitors an edge when they see can clearly read the "secrets" of your product in the source code. Most arguments, including these, are never really so much about the costs involved (consider how many millions of dollars are spent developing the typical console title) but rather question the value of sharing, i.e. the return on the investment.

Consider this: The console game industry is a fast-moving industry. Consoles change, methods change and even the developers themselves change rapidly and constantly. Success of a title is usually determined by the quality of the content, not the engine that drives it, although occasionally the field of successful titles is punctuated by technical acheivement. But if competitors need access to the source of a successful product in order to become successful themselves, they are already behind, and no amount of access will allow them to gain on the continued developments of the leaders. And if it does help to make their product a little better, that's a good thing - good games are good for the platform, and what's good for the platform is good for developers wanting to sell their games on that platform.

The value of openness is in the people, not the source code.

Invest in the future. The programmers reading, modifying and commenting on the source may belong to the next-generation of coders in the industry. Help them learn by providing examples of real-world challanges and their solutions.
Invest in your team. The best way to learn is to teach. Simply by explaining what they've done, programmers will come up with new ideas and find areas that they've missed. This is no minor point - a studio's value is in it's people and since there are very few traditional training courses for the professional developer, a good studio must find different ways of helping make those developers better each day at what they do.

Call to Arms

Electronic Arts made a considerable difference to not only games but to many different industries when they released the EA IFF 85 Standard for Interchange Format Files. And it is in that tradition, almost twenty-two years later that I hope game developers, studios and publishers will re-double their efforts to share what they have created and learned with the community. Id software, the modern poster-child for sharing their technology, certainly hasn't lost anything by releasing some of their older sources.

Start small - a function, a snippet even. But make if we make it a habit, we will all be rewarded.

Has your studio released something into the wild? Tell me about it and I will happily list it here.

Branch-free implementation of half-precision (16 bit) floating point

2006-07-18T06:19:16Z

Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz().

Update! (17 July 06) The code has been considerably refactored. Decided to go with single function per expression. The expressions have been reduced as a first optimization pass.

Project

The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers (That does not use floating point hardware). This example should echo the IEEE 754 standard for floating point numbers as closely as reasonable, including support for +/- INF, QNan, SNan, and denormalized numbers. However, exceptions will not be implemented.

Half-precision floats are used in cases where neither the range nor the precision of 32 bit floating point numbers are needed, but where some dynamic precision is required. Two common uses are for image transformation, where the range of each component (e.g. red, green, blue, alpha) is typically limited to or near [0.0,1.0] or vertex data (e.g. position, texture coordinates, color values, etc.).

The main advantage of half-precision floats is their size. Beyond the considerable potential for memory savings, processing a large number of half-precision values is more cache-friendly than using 32 bit values.

The current released version (including tests) can be downloaded here: half.c half.h

Increment And Decrement Wrapping Values

2006-07-11T07:10:07Z

Small code, big impact

Occasionally you have a set of values that you want to wrap around as you increment and decrement them. For example, in a GUI where the user keys right or left and you want to wrap around the menu.

A typical implementation:

static inline int wrap_inc( int value, int min, int max ) { return ( value == max ) ? min : value + 1; } static inline int wrap_dec( int value, int min, int max ) { return ( value == min ) ? max : value - 1; }

But on processors (such as the PowerPC) where compare and branch is very costly these small one-liners can have a significant impact on performance when used in critical code. They also make optimization more difficult for the compiler for the surrounding code. ]]> Breakdown Store the desired result:

const type result_inc = val + 1;

This value may overflow if val == INT(SIZE)_MAX, but in that case the correct value will still be selected.

Get the different between the max (or min) value and the current value:

const type max_diff = max - val;

It's only important if this value is zero or not zero. If it's zero, we know we are at the max (or min) value; otherwise we can increment (or decrement).

Create a mask based on the difference:

const type max_diff_nz = (type)( (stype)( max_diff | -max_diff ) >> bit_mask );

i.e.

max_diff_nz = ( max_diff != 0 ) ? (type)-1 : (type)0;

(Remember that -1 is all bits on in two's complement.)

Complement the mask:

const type max_diff_eqz = ~max_diff_nz;

Select the correct result based on the masks:

const type result = ( result_inc & max_diff_nz ) | ( min & max_diff_eqz );

Only one of the two values can possibly be selected.
i.e.

result = ( val == max ) ? min : val + 1;

Final Code

// // wrap_int.h // #ifndef WRAP_INT_H #define WRAP_INT_H // // Increment wrapping value // // val = { ( val == max ), min // = { otherwise, val + 1 // // uint8_t wrap_inc_u8 ( const uint8_t val, const uint8_t min, const uint8_t max ); // uint16_t wrap_inc_u16( const uint16_t val, const uint16_t min, const uint16_t max ); // uint32_t wrap_inc_u32( const uint32_t val, const uint32_t min, const uint32_t max ); // uint64_t wrap_inc_u64( const uint64_t val, const uint64_t min, const uint64_t max ); // int8_t wrap_inc_s8 ( const int8_t val, const int8_t min, const int8_t max ); // int16_t wrap_inc_s16( const int16_t val, const int16_t min, const int16_t max ); // int32_t wrap_inc_s32( const int32_t val, const int32_t min, const int32_t max ); // int64_t wrap_inc_s64( const int64_t val, const int64_t min, const int64_t max ); #define DECL_WRAP_INC( type_name, type, stype, bit_mask ) \ static inline type wrap_inc_##type_name( const type val, const type min, const type max ) \ { \ const type result_inc = val + 1; \ const type max_diff = max - val; \ const type max_diff_nz = (type)( (stype)( max_diff | -max_diff ) >> bit_mask ); \ const type max_diff_eqz = ~max_diff_nz; \ const type result = ( result_inc & max_diff_nz ) | ( min & max_diff_eqz ); \ \ return (result); \ } DECL_WRAP_INC( u8, uint8_t, int8_t, 7 ); DECL_WRAP_INC( u16, uint16_t, int16_t, 15 ); DECL_WRAP_INC( u32, uint32_t, int32_t, 31 ); DECL_WRAP_INC( u64, uint64_t, int64_t, 63 ); DECL_WRAP_INC( s8, int8_t, int8_t, 7 ); DECL_WRAP_INC( s16, int16_t, int16_t, 15 ); DECL_WRAP_INC( s32, int32_t, int32_t, 31 ); DECL_WRAP_INC( s64, int64_t, int64_t, 63 ); // // Decrementing wrapping value // // val = { ( val == min ), max // = { otherwise, val - 1 // // uint8_t wrap_dec_u8 ( const uint8_t val, const uint8_t min, const uint8_t max ); // uint16_t wrap_dec_u16( const uint16_t val, const uint16_t min, const uint16_t max ); // uint32_t wrap_dec_u32( const uint32_t val, const uint32_t min, const uint32_t max ); // uint64_t wrap_dec_u64( const uint64_t val, const uint64_t min, const uint64_t max ); // int8_t wrap_dec_s8 ( const int8_t val, const int8_t min, const int8_t max ); // int16_t wrap_dec_s16( const int16_t val, const int16_t min, const int16_t max ); // int32_t wrap_dec_s32( const int32_t val, const int32_t min, const int32_t max ); // int64_t wrap_dec_s64( const int64_t val, const int64_t min, const int64_t max ); #define DECL_WRAP_DEC( type_name, type, stype, bit_mask ) \ static inline type wrap_dec_##type_name( const type val, const type min, const type max ) \ { \ const type result_dec = val - 1; \ const type min_diff = min - val; \ const type min_diff_nz = (type)( (stype)( min_diff | -min_diff ) >> bit_mask ); \ const type min_diff_eqz = ~min_diff_nz; \ const type result = ( result_dec & min_diff_nz ) | ( max & min_diff_eqz ); \ \ return (result); \ } DECL_WRAP_DEC( u8, uint8_t, int8_t, 7 ); DECL_WRAP_DEC( u16, uint16_t, int16_t, 15 ); DECL_WRAP_DEC( u32, uint32_t, int32_t, 31 ); DECL_WRAP_DEC( u64, uint64_t, int64_t, 63 ); DECL_WRAP_DEC( s8, int8_t, int8_t, 7 ); DECL_WRAP_DEC( s16, int16_t, int16_t, 15 ); DECL_WRAP_DEC( s32, int32_t, int32_t, 31 ); DECL_WRAP_DEC( s64, int64_t, int64_t, 63 ); #endif /* #ifndef WRAP_INT_H */

Box Overlap

2006-06-19T05:18:02Z

Interactive 3D applications frequently need to check whether one geometric object overlaps another. In this article, we'll look at a function to test for overlap between 3D boxes, and we'll show how to optimize this function for the CBE. Before Optimization Let's start with the example below, which is similar to solid-2.5.4/src/complex/DT_CBox.h in the SOLID library from Dtecta, but doesn't need to include a bunch of other stuff.

#include 
#include 

struct Vector3
{
    float m_co[3];

    Vector3() {}

    Vector3(const float& x, const float& y, const float& z)
    {
      m_co[0] = x;
      m_co[1] = y;
      m_co[2] = z;
    }

    float&       operator[]( int i )       { return m_co[i]; }
    const float& operator[]( int i ) const { return m_co[i]; }

    Vector3& operator-=(const Vector3& v)
    {
      this->m_co[0] -= v.m_co[0];
      this->m_co[1] -= v.m_co[1];
      this->m_co[2] -= v.m_co[2];

      return (*this);
    }

    Vector3& operator+=(const Vector3& v)
    {
      this->m_co[0] += v.m_co[0];
      this->m_co[1] += v.m_co[1];
      this->m_co[2] += v.m_co[2];

      return (*this);
    }
};

struct Box
{
    Vector3 m_center;
    Vector3 m_extent;

    Box() {}

    Box(const Vector3& center, const Vector3& extent) 
      : m_center(center),
        m_extent(extent)
    {}

    bool overlaps(const Box& b) const
    {
        return ::fabs(m_center[0] - b.m_center[0]) <= m_extent[0] + b.m_extent[0] &&
               ::fabs(m_center[1] - b.m_center[1]) <= m_extent[1] + b.m_extent[1] &&
               ::fabs(m_center[2] - b.m_center[2]) <= m_extent[2] + b.m_extent[2];
    }
};

bool
test_overlap( const Box& a, const Box& b )
{
  return a.overlaps( b );
}

We'll be looking at the compiler output for the test_overlap() function. I made the data public, since it makes the optimization pass much simpler. I'm not going to debate what makes "better" C++ code. We're just talking about what's faster here.

Nothing is obviously wrong (i.e. slow) with this code. Everything is inline, so we shouldn't expect any unneeded jumps for such small functions. There's only one reference to each element, so we shouldn't expect any extra loads.

But if we look at the compiler output, we see that almost every operation stalls waiting for the operands to load.

_Z12test_overlapRK3BoxS1_:
    stwu 1,-16(1)
    lfs 3,0(3)    -- LOAD(3)
    li 0,0
    lfs 11,0(4)   -- LOAD(11)
    fsubs 1,3,11  -- WAIT FOR LOAD(3), LOAD(11)
    lfs 2,16(3)   -- LOAD(2)
    lfs 0,16(4)   -- LOAD(0)
    fadds 12,2,0  -- WAIT FOR LOAD(2), LOAD(2)
    fabs 13,1
    fcmpu 7,13,12
    bgt- 7,.L2
    lfs 9,4(3)    -- LOAD(9)
    lfs 10,4(4)   -- LOAD(10)
    fsubs 6,9,10  -- WAIT FOR LOAD(9), LOAD(10)
    lfs 7,20(3)   -- LOAD(7)
    lfs 8,20(4)   -- LOAD(8)
    fadds 4,7,8   -- WAIT FOR LOAD(7), LOAD(8) 
    fabs 5,6
    fcmpu 0,5,4
    bgt- 0,.L2
    lfs 0,8(3)    -- LOAD(0)
    lfs 11,8(4)   -- LOAD(11)
    fsubs 1,0,11  -- WAIT FOR LOAD(0), LOAD(11)
    lfs 2,24(3)   -- LOAD(2)
    lfs 3,24(4)   -- LOAD(3)
    fadds 13,2,3  -- WAIT FOR LOAD(2), LOAD(3)
    fabs 12,1
    fcmpu 1,12,13
    bgt- 1,.L2
    li 0,1
.L2:
    mr 3,0
    addi 1,1,16
    blr

The compiler has built the dependency graph around the branches. You might think we benefit by branching out immediately when we find a case that fails, since we skip the subsequent loads. But this turns out to be a bad idea for the following reasons:

1. Operands that are adjacent in memory probably lie on the same cache line. The PPE is dual thread, so the longer the delay between loads of adjacent operands, the greater the chance of the other thread (or an interrupt) flushing the cache line.
2. The compiler has used "bgt-", meaning the branches are statically predicted unlikely. It doesn't make much sense for the compiler to hide loads behind unlikely branches.

Separating Loads From Calculations

What we want to do is queue up the loads as deep as we can before we start doing any calculations.

Don't use class or struct fields (or array elements) directly in calculations. Always follow this pattern:
1. Load everything you need into local variables of native types.
2. Do all your calculations.
3. Store your final result, while trying to avoid branches.

Here's the second version of the overlaps() method:

    bool overlaps(const Box& b) const
    {
      // 
      // LOADS
      // 

      const float a_c0 = m_center[0];
      const float a_c1 = m_center[1];
      const float a_c2 = m_center[2];
      const float a_e0 = m_extent[0];
      const float a_e1 = m_extent[1];
      const float a_e2 = m_extent[2];
      const float b_c0 = b.m_center[0];
      const float b_c1 = b.m_center[1];
      const float b_c2 = b.m_center[2];
      const float b_e0 = b.m_extent[0];
      const float b_e1 = b.m_extent[1];
      const float b_e2 = b.m_extent[2];

      // 
      // CALCULATIONS
      // 

      const float delta_c0     = a_c0 - b_c0;
      const float delta_c1     = a_c1 - b_c1;
      const float delta_c2     = a_c2 - b_c2;
      const float abs_delta_c0 = ::fabs( delta_c0 );
      const float abs_delta_c1 = ::fabs( delta_c1 );
      const float abs_delta_c2 = ::fabs( delta_c2 );
      const float sum_e0       = a_e0 + b_e0;
      const float sum_e1       = a_e1 + b_e1;
      const float sum_e2       = a_e2 + b_e2;

      // 
      // COMPARES AND BRANCHES
      // 

      const bool  in_0     = abs_delta_c0 <= sum_e0;
      const bool  in_1     = abs_delta_c1 <= sum_e1;
      const bool  in_2     = abs_delta_c2 <= sum_e2;
      const bool  result   = in_0 && in_1 && in_2;

      return (result);
    }

The results are not much better at this point. The compiler reorders things, doing each subtraction as soon as the needed operands are loaded.

_Z12test_overlapRK3BoxS1_:
    stwu 1,-16(1)
    lfs 9,4(3)
    li 9,0
    lfs 0,4(4)
    fsubs 2,9,0
    lfs 8,0(3)
    lfs 1,0(4)
    fsubs 5,8,1
    lfs 10,8(3)
    lfs 12,8(4)
    fsubs 1,10,12
    lfs 3,16(3)
    lfs 13,16(4)
    fadds 0,3,13
    fabs 9,2
    lfs 11,12(4)
    lfs 6,12(3)
    fadds 12,6,11
    lfs 4,20(3)
    lfs 7,20(4)
    fabs 8,5
    fadds 11,4,7
    fabs 10,1
    fcmpu 7,9,0
    crnot 30,29
    mfcr 0
    rlwinm 0,0,31,1
    fcmpu 1,8,12
    fcmpu 6,10,11
    crnot 26,25
    cmpwi 7,0,0
    mfcr 0
    rlwinm 0,0,27,1
    bgt- 1,.L14
    cmpwi 6,0,0
    beq- 7,.L14
    beq- 6,.L14
    li 9,1
.L14:
    mr 3,9
    addi 1,1,16
    blr

We'll use a trick to prevent the compiler from mixing loads and calculations. First we define this macro:

#define GCC_SPLIT_BLOCK __asm__ ("");

An empty inline assembly statement doesn't add any code, but it splits the basic block, forcing the compiler to schedule the code on either side separately. We'll add this macro after the loads but before the calculations. We'll also add it just after the calculations so it's easier to see what's happening, but this second split isn't really important for optimization.

Here's the third version of the overlaps() method:

    bool overlaps(const Box& b) const
    {
      // 
      // LOADS
      // 

      const float a_c0 = m_center[0];
      const float a_c1 = m_center[1];
      const float a_c2 = m_center[2];
      const float a_e0 = m_extent[0];
      const float a_e1 = m_extent[1];
      const float a_e2 = m_extent[2];
      const float b_c0 = b.m_center[0];
      const float b_c1 = b.m_center[1];
      const float b_c2 = b.m_center[2];
      const float b_e0 = b.m_extent[0];
      const float b_e1 = b.m_extent[1];
      const float b_e2 = b.m_extent[2];

      GCC_SPLIT_BLOCK

      // 
      // CALCULATIONS
      // 

      const float delta_c0     = a_c0 - b_c0;
      const float delta_c1     = a_c1 - b_c1;
      const float delta_c2     = a_c2 - b_c2;
      const float abs_delta_c0 = ::fabs( delta_c0 );
      const float abs_delta_c1 = ::fabs( delta_c1 );
      const float abs_delta_c2 = ::fabs( delta_c2 );
      const float sum_e0       = a_e0 + b_e0;
      const float sum_e1       = a_e1 + b_e1;
      const float sum_e2       = a_e2 + b_e2;

      GCC_SPLIT_BLOCK

      // 
      // COMPARES AND BRANCHES
      // 

      const bool  in_0     = abs_delta_c0 <= sum_e0;
      const bool  in_1     = abs_delta_c1 <= sum_e1;
      const bool  in_2     = abs_delta_c2 <= sum_e2;
      const bool  result   = in_0 && in_1 && in_2;

      return (result);
    }

The new output clearly shows that the code was scheduled on either side of the splits.

_Z12test_overlapRK3BoxS1_:
    //
    // PUSH STACK
    //

    stwu 1,-16(1)

    //
    // LOADS
    //

    lfs 4,20(3)
    lfs 3,20(4)
    lfs 1,0(3)
    lfs 13,4(3)
    lfs 12,8(3)
    lfs 11,12(3)
    lfs 10,16(3)
    lfs 9,0(4)
    lfs 8,4(4)
    lfs 7,8(4)
    lfs 6,12(4)
    lfs 5,16(4)

    //
    // CALCULATIONS
    //

    fsubs 0,1,9
    fsubs 2,13,8
    fsubs 1,12,7
    fadds 11,11,6
    fadds 10,10,5
    fadds 4,4,3
    fabs 0,0
    fabs 13,2
    fabs 12,1

    //
    // COMPARES AND BRANCHES
    //

    fcmpu 7,13,10
    li 3,0
    crnot 30,29
    fcmpu 1,0,11
    mfcr 0
    rlwinm 0,0,31,1
    fcmpu 6,12,4
    crnot 26,25
    cmpwi 7,0,0
    mfcr 0
    rlwinm 0,0,27,1
    bgt- 1,.L14
    cmpwi 6,0,0
    beq- 7,.L14
    beq- 6,.L14
    li 3,1

    //
    // POP STACK AND RETURN
    //
.L14:
    addi 1,1,16
    blr

We still have 3 branches and a lot of compares. Let's see what we can do about that.

Removing Branches

In the CBE (as with most pipelined architectures), it's good to reduce or eliminate branches where possible. In this case, we can use the fsel instruction to replace a compare and branch. This is an optional PowerPC instruction, but the PPU implements it. Unfortunately, the compiler doesn't generate fsel calls for the PPU, so we'll have to call it manually:

static inline float ppc_fsels( const float fra, const float frc, const float frb ) 
{
    float frt;

    // From: http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixassem/alangref/fsel.htm
    //     The double-precision floating-point operand in floating-point register (FPR) FRA 
    //     is compared with the value zero. If the value in FRA is greater than or equal to 
    //     zero, floating point register FRT is set to the contents of floating-point 
    //     register FRC. If the value in FRA is less than zero or is a NaN, floating point 
    //     register FRT is set to the contents of floating-point register FRB. The comparison 
    //     ignores the sign of zero; both +0 and -0 are equal to zero. 
    //     
    // i.e. frt = ( fra >= 0.0 ) ? frc : frb;
    //     
    __asm__( "fsel %0, %1, %2, %3" : "=f"(frt) : "f"(fra), "f"(frc), "f"(frb) );

    return (frt);
}

Now let's focus on the compares and branches portion of the method:

      const bool  in_0     = abs_delta_c0 <= sum_e0;
      const bool  in_1     = abs_delta_c1 <= sum_e1;
      const bool  in_2     = abs_delta_c2 <= sum_e2;
      const bool  result   = in_0 && in_1 && in_2;

This code can be rewritten as follows:

      const float  overlap_0 = sum_e0 - abs_delta_c0;
      const float  overlap_1 = sum_e1 - abs_delta_c1;
      const float  overlap_2 = sum_e2 - abs_delta_c2;
      const double temp_01   = ( overlap_1 >= 0.0f ) ? overlap_0 : overlap_1;
      const double temp_012  = ( overlap_2 >= 0.0f ) ? temp_01   : overlap_2;
      const bool   result    = temp_012 >= 0.0f;

The calculations of temp_01 and temp_012 can be expressed using fsel.

      const float  overlap_0 = sum_e0 - abs_delta_c0;
      const float  overlap_1 = sum_e1 - abs_delta_c1;
      const float  overlap_2 = sum_e2 - abs_delta_c2;
      const double temp_01   = ppc_fsels( overlap_1, overlap_0, overlap_1 );
      const double temp_012  = ppc_fsels( overlap_2, temp_01,   overlap_2 );
      const bool   result    = temp_012 >= 0.0f;

Now take a look at the constant value 0.0f in the last statement above. Keep in mind the PowerPC has no instruction to move an immediate value into a floating point register, so each constant appearing in an expression means an additional load from memory. That's why it's a good idea to restructure expressions if possible to reduce or eliminate constants.

Here we can't easily avoid comparing with zero, but we can get rid of the load by doing something unconventional. We can replace 0.0f with a parameter named zero, which in this case will be passed in via FPR1. Then it's up to the caller to find an optimal way to provide the value 0.0f. For example, if the calling function has plenty of register variables available, 0.0f can be loaded into one of them near the top. Alternatively, the constant can be put some place in memory where it will be on the same cache line as some other data that's needed anyway.

You might think we could construct the constant 0.0f cheaply by subtracting any float value (e.g., a_c0) from itself. But this doesn't work if the value is NaN, because you end up with NaN instead of 0.0f.

Anyway, let's also change the GCC_SPLIT_BLOCK macro so that we can inject comments into the asm output (to make it easier to track down our changes).

#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );

Here's the fourth version of the overlaps() method:

    bool overlaps(const Box& b, float zero) const
    {
      GCC_SPLIT_BLOCK("LOADS")

      const float a_c0 = m_center[0];
      const float a_c1 = m_center[1];
      const float a_c2 = m_center[2];
      const float a_e0 = m_extent[0];
      const float a_e1 = m_extent[1];
      const float a_e2 = m_extent[2];
      const float b_c0 = b.m_center[0];
      const float b_c1 = b.m_center[1];
      const float b_c2 = b.m_center[2];
      const float b_e0 = b.m_extent[0];
      const float b_e1 = b.m_extent[1];
      const float b_e2 = b.m_extent[2];

      GCC_SPLIT_BLOCK("CALCULATIONS")

      const float delta_c0     = a_c0 - b_c0;
      const float delta_c1     = a_c1 - b_c1;
      const float delta_c2     = a_c2 - b_c2;
      const float abs_delta_c0 = ::fabs( delta_c0 );
      const float abs_delta_c1 = ::fabs( delta_c1 );
      const float abs_delta_c2 = ::fabs( delta_c2 );
      const float sum_e0       = a_e0 + b_e0;
      const float sum_e1       = a_e1 + b_e1;
      const float sum_e2       = a_e2 + b_e2;
      const float overlap_0    = sum_e0 - abs_delta_c0;
      const float overlap_1    = sum_e1 - abs_delta_c1;
      const float overlap_2    = sum_e2 - abs_delta_c2;

      GCC_SPLIT_BLOCK("SELECT RESULT")

      const double temp_01   = ppc_fsels( overlap_1, overlap_0, overlap_1 );
      const double temp_012  = ppc_fsels( overlap_2, temp_01,   overlap_2 );
      const bool   result    = temp_012 >= zero;

      return (result);
    }

We'll also change the test_overlap function to add zero as a parameter:

bool
test_overlap( const Box& a, const Box& b, float zero )
{
  return a.overlaps( b, zero );
}

The output shows that we have reduced the cost for the comparisons significantly:

_Z12test_overlapRK3BoxS1_f:
    stwu 1,-16(1)

    //
    // LOADS
    //

    lfs 0,20(3)
    lfs 3,20(4)
    lfs 2,0(3)
    lfs 10,4(3)
    lfs 9,8(3)
    lfs 12,12(3)
    lfs 13,16(3)
    lfs 8,0(4)
    lfs 7,4(4)
    lfs 6,8(4)
    lfs 5,12(4)
    lfs 4,16(4)

    //
    // CALCULATIONS
    //

    fsubs 11,2,8
    fsubs 2,10,7
    fsubs 8,9,6
    fadds 7,12,5
    fadds 6,13,4
    fadds 5,0,3
    fabs 11,11
    fabs 10,2
    fabs 9,8
    fsubs 12,7,11
    fsubs 4,6,10
    fsubs 3,5,9

    //
    // SELECT RESULT
    //

    fsel 13, 4, 12, 4
    addi 1,1,16
    fsel 2, 3, 13, 3
    fmr 0,2
    fcmpu 7,1,0
    cror 30,28,30
    mfcr 3
    rlwinm 3,3,31,1
    blr

Right now our main problem is that we have 12 loads and we're not doing enough work to make up for that. Next we'll look at how to reduce loads.

Moving to VMX/Altivec

Always look for ways to reduce loads and stores. It's one of the most effective techniques for improving performance.

We're going to use the VXU (Altivec unit), which operates on 128-bit (16-byte) operands. A typical operand is a vector of 4 float values, of which we'll use 3. The compiler recognizes a set of vector data types and vector intrinsics.

Here are some of the main advantages to using Altivec:

More available registers - General purpose code will eat up most of your fixed point registers, making it more likely you'll need to keep dumping data on the stack.

Mixed integer and floating point - Mixing integer and floating point code, or converting between the two, is very expensive with scalar operations. This is because the only method of moving between the FXU (fixed point execution unit) and the FPU is through memory (typically the stack). This often creates a Load-Hit-Store data hazard event which will cause your processor to wait around until the register has been loaded. On the VXU you can freely use vector integer instructions on vector floating point values without penalty. There are also conversion instructions for your convenience.

Much higher throughput - This is really the whole point of a SIMD instruction set. One instruction works with 128 bit wide registers, so much more work can be done. Each instruction is also very fast.

Saturated arithmetic instructions - Saturated instructions are operations that basically cannot overflow or underflow. Any calculated value that is greater than the maximum value for the type of the vector component (8, 16 or 32 bits) is clamped to the maximum. Conversly for the minimum. This is extremely handy for any kind of fixed point math.

Bit manipulation on all types (permute, shift, rotate) - There is a large set of instructions for bit manipulation which you can apply to all the vector types. The permute instruction is a special instruction that lets you shuffle around the bytes in a vector. By itself, this instruction makes Altivec a win.

For our current application (testing for overlap), I'm just going to remove Vector3 completely and opt for using the vector types directly. If I did have some reason to hide the vector types (cross platform code?) I would completely remove the following methods:

    Vector3(const float& x, const float& y, const float& z)
    {
      m_co[0] = x;
      m_co[1] = y;
      m_co[2] = z;
    }

    float&       operator[]( int i )       { return m_co[i]; }
    const float& operator[]( int i ) const { return m_co[i]; }

There's no fast way to implement these methods. They're the very antithesis of working with SIMD instructions.

Anyway, the conversion to Altivec is quite straightforward in this case. The fourth element (w) must be masked out or else initialized to zero in each vector.

The fuctions beginning with vec_ are vector intrinsics. Note that vec_all_ge returns an int, not a vector type value. Specifically, it returns 1 if all elements of the first vector argument are greater than or equal to the corresponding elements of the second vector argument.

Here, we don't need to pass in zero as a parameter, because we can easily build a zero vector using vec_splat_u8. I've also used int instead of bool as the return type of overlaps and test_overlap. That way, a calling function that needs to test multiple boxes can use bitwise logical operators (& and |) to avoid branches.

Here's the fifth version of the code:

#include 

#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );

struct Box
{
    vector float m_v[2];

    enum
    {
      m_centerOffset = 0x00,
      m_extentOffset = 0x10
    };

    Box() {}

    Box(const vector float& center, const vector float& extent) 
    {
      vec_st( center, m_centerOffset, (vector float*)m_v );
      vec_st( extent, m_extentOffset, (vector float*)m_v );
    }

    int overlaps(const Box& b) const
    {
      GCC_SPLIT_BLOCK("LOADS")
      const vector float zero             = (vector float)vec_splat_u8( 0x00 );
      const vector float a_c              = vec_ld( m_centerOffset, (vector float*)m_v );
      const vector float a_e              = vec_ld( m_extentOffset, (vector float*)m_v );
      const vector float b_c              = vec_ld( m_centerOffset, (vector float*)b.m_v );
      const vector float b_e              = vec_ld( m_extentOffset, (vector float*)b.m_v );
      GCC_SPLIT_BLOCK("CALCULATE RESULT")
      const vector float delta_c          = vec_sub( a_c, b_c );
      const vector float abs_delta_c      = vec_abs( delta_c );
      const vector float sum_e            = vec_add( a_e, b_e );
      const vector float overlap          = vec_sub( sum_e, abs_delta_c );
      const int          result           = vec_all_ge( overlap, zero );

      return (result);
    }
};

int
test_overlap( const Box& a, const Box& b )
{
  return a.overlaps( b );
}

This straightforward translation to vector types reduces the number of loads from 12 to 4:

_Z12test_overlapRK3BoxS1_:
    stwu 1,-16(1)

    //
    // LOADS
    //

    li 0,16
    vspltisb 11,0
    lvx 12,4,0
    lvx 1,3,0
    lvx 0,0,3
    lvx 13,0,4

    //
    // CALCULATE RESULT
    //

    vsubfp 0,0,13
    addi 1,1,16
    vaddfp 1,1,12
    vspltisw 13,-1
    vslw 12,13,13
    vandc 0,0,12
    vsubfp 1,1,0
    vcmpgefp. 11,1,11
    mfcr 3
    rlwinm 3,3,25,1
    blr

This is fine for doing a single overlap test. But what if we need to perform a great many tests for overlap? That's where the Altivec really shines.

Doing Four Overlap Tests At Once

We'll be declaring a struct box4 representing 4 boxes. The following uniform vector layout will be used, where K is a box4 and J is the corresponding array of 4 Box objects:

K.center_x = { J[0].m_center[0], J[1].m_center[0], J[2].m_center[0], J[3].m_center[0] }
K.center_y = { J[0].m_center[1], J[1].m_center[1], J[2].m_center[1], J[3].m_center[1] }
K.center_z = { J[0].m_center[2], J[1].m_center[2], J[2].m_center[2], J[3].m_center[2] }
K.extent_x = { J[0].m_extent[0], J[1].m_extent[0], J[2].m_extent[0], J[3].m_extent[0] }
K.extent_y = { J[0].m_extent[1], J[1].m_extent[1], J[2].m_extent[1], J[3].m_extent[1] }
K.extent_z = { J[0].m_extent[2], J[1].m_extent[2], J[2].m_extent[2], J[3].m_extent[2] }

The new function box4_overlaps accepts two box4 pointers as parameters, and returns a signed int vector of overlap test results. Specifically, the Nth element of the return vector will be -1 if the Nth element of the first box4 overlaps the Nth element of the second box4. It will be 0 otherwise.

Once again we use vec_splat_u8 to build a zero vector, so we don't need zero passed in as a parameter.

Here's the sixth version of the code:

#include 
#include 

typedef struct box4 box4;

struct box4
{
  vector float center_x;
  vector float center_y;
  vector float center_z;
  vector float extent_x;
  vector float extent_y;
  vector float extent_z;
};

vector signed int
box4_overlaps( box4* const a, box4* const b )
{
  const vector float      zero       = (vector float)vec_splat_u8( 0x00 );
  const vector float      acx        = vec_ld( 0x00, &a->center_x );
  const vector float      acy        = vec_ld( 0x00, &a->center_y );
  const vector float      acz        = vec_ld( 0x00, &a->center_z );
  const vector float      aex        = vec_ld( 0x00, &a->extent_x );
  const vector float      aey        = vec_ld( 0x00, &a->extent_y );
  const vector float      aez        = vec_ld( 0x00, &a->extent_z );
  const vector float      bcx        = vec_ld( 0x00, &b->center_x );
  const vector float      bcy        = vec_ld( 0x00, &b->center_y );
  const vector float      bcz        = vec_ld( 0x00, &b->center_z );
  const vector float      bex        = vec_ld( 0x00, &b->extent_x );
  const vector float      bey        = vec_ld( 0x00, &b->extent_y );
  const vector float      bez        = vec_ld( 0x00, &b->extent_z );
  const vector float      dx         = vec_sub( acx, bcx );
  const vector float      dy         = vec_sub( acy, bcy );
  const vector float      dz         = vec_sub( acz, bcz );
  const vector float      abs_dx     = vec_abs( dx );
  const vector float      abs_dy     = vec_abs( dy );
  const vector float      abs_dz     = vec_abs( dz );
  const vector float      sum_ex     = vec_add( aex, bex );
  const vector float      sum_ey     = vec_add( aey, bey );
  const vector float      sum_ez     = vec_add( aez, bez );
  const vector float      overlap_x  = vec_sub( sum_ex, abs_dx );
  const vector float      overlap_y  = vec_sub( sum_ey, abs_dy );
  const vector float      overlap_z  = vec_sub( sum_ez, abs_dz );
  const vector signed int result_x   = vec_cmpge( overlap_x, zero );
  const vector signed int result_y   = vec_cmpge( overlap_y, zero );
  const vector signed int result_z   = vec_cmpge( overlap_z, zero );
  const vector signed int result_xy  = vec_and( result_x, result_y );
  const vector signed int result_xyz = vec_and( result_xy, result_z );

  return (result_xyz);
}

The compiler output shows 12 loads, but we're doing 4 overlap tests instead of 1, so we've nearly quadrupled the performance compared to the fourth version.

box4_overlaps:
    addi 12,3,16
    addi 5,4,16
    stwu 1,-16(1)
    lvx 1,0,4
    addi 9,3,32
    lvx 11,0,12
    addi 11,3,48
    lvx 7,0,5
    addi 10,3,64
    lvx 0,0,3
    vsubfp 2,11,7
    vsubfp 0,0,1
    addi 8,4,32
    addi 7,4,48
    lvx 7,0,10
    addi 6,4,64
    lvx 9,0,9
    lvx 10,0,6
    addi 3,3,80
    lvx 8,0,11
    addi 4,4,80
    lvx 13,0,8
    addi 1,1,16
    lvx 12,0,7
    vsubfp 1,9,13
    vaddfp 9,8,12
    lvx 13,0,4
    vaddfp 8,7,10
    lvx 12,0,3
    vaddfp 12,12,13
    vspltisw 10,-1
    vslw 7,10,10
    vandc 0,0,7
    vspltisw 13,-1
    vslw 10,13,13
    vandc 11,2,10
    vspltisw 13,-1
    vslw 10,13,13
    vandc 2,1,10
    vsubfp 10,9,0
    vsubfp 9,8,11
    vsubfp 1,12,2
    vspltisb 7,0
    vcmpgefp 8,10,7
    vcmpgefp 11,9,7
    vcmpgefp 2,1,7
    vand 0,8,11
    vand 2,0,2
    blr

Synergistic Processor Unit

The CBE has eight Synergistic Processor Units (SPUs) that are designed for computation-intensive tasks. Suppose we want our application to run on an SPU. How can we adapt the overlap test function for the SPU environment?

We'll use the same box4 structure as in the last example. The SPU compiler recognizes vector intrinsics (beginning with si_) that are similar to those of the VXU, but not identical. Here are some of the differences that have a direct bearing on the problem at hand:

1. The return from a vector comparison is a vector unsigned int, instead of a vector signed int. A value of true is represented as 1 instead of -1.
2. The SPU has no instruction for absolute value. We can calculate the absolute value of a vector float operand via a sequence of two instructions: si_shli (shift left immediate) and si_rotmi (rotate and mask immediate). The term rotate is misleading. In effect, si_rotmi(v, -n) is a logical shift right of each element of v by n bits.
3. The SPU can't directly test whether one vector float operand is greater than or equal to another. So we'll use si_fcgt (vector float greater than) with operands reversed to perform a "less than" test. Then we'll invert the result with si_nor.

The data type qword means a quadword (128 bits = 16 bytes) with unspecified structure. It could be a vector float, or a vector unsigned int, or some other vector type. It could even be a scalar kept in the first 32 bits, with the remaining 96 bits unused. For example, the first parameter of si_lqd is a qword with an address in the first 32 bits and the remaining 96 bits unused. The function si_from_uint casts an unsigned int to a qword. It doesn't generate any actual machine instructions.

Here's the seventh and final version of the code:

#include 

typedef struct box4 box4;

struct box4
{
  vector float center_x;
  vector float center_y;
  vector float center_z;
  vector float extent_x;
  vector float extent_y;
  vector float extent_z;
};

vector unsigned int
box4_overlaps( box4* const a, box4* const b )
{
  const qword zero       = si_il( 0 );
  const qword a_addr     = si_from_uint( (unsigned int) a );
  const qword b_addr     = si_from_uint( (unsigned int) b );
  const qword acx        = si_lqd( a_addr, 0x00 );
  const qword acy        = si_lqd( a_addr, 0x10 );
  const qword acz        = si_lqd( a_addr, 0x20 );
  const qword aex        = si_lqd( a_addr, 0x30 );
  const qword aey        = si_lqd( a_addr, 0x40 );
  const qword aez        = si_lqd( a_addr, 0x50 );
  const qword bcx        = si_lqd( b_addr, 0x00 );
  const qword bcy        = si_lqd( b_addr, 0x10 );
  const qword bcz        = si_lqd( b_addr, 0x20 );
  const qword bex        = si_lqd( b_addr, 0x30 );
  const qword bey        = si_lqd( b_addr, 0x40 );
  const qword bez        = si_lqd( b_addr, 0x50 );
  const qword dx         = si_fs( acx, bcx ); 
  const qword dy         = si_fs( acy, bcy );
  const qword dz         = si_fs( acz, bcz );
  const qword uns_dx     = si_shli( dx, 1 );
  const qword uns_dy     = si_shli( dy, 1 );
  const qword uns_dz     = si_shli( dz, 1 );
  const qword abs_dx     = si_rotmi( uns_dx, -1 );
  const qword abs_dy     = si_rotmi( uns_dy, -1 );
  const qword abs_dz     = si_rotmi( uns_dz, -1 );
  const qword sum_ex     = si_fa( aex, bex );
  const qword sum_ey     = si_fa( aey, bey );
  const qword sum_ez     = si_fa( aez, bez );
  const qword overlap_x  = si_fs( sum_ex, abs_dx );
  const qword overlap_y  = si_fs( sum_ey, abs_dy );
  const qword overlap_z  = si_fs( sum_ez, abs_dz );
  const qword result_x   = si_fcgt( zero, overlap_x );
  const qword result_y   = si_fcgt( zero, overlap_y );
  const qword result_z   = si_fcgt( zero, overlap_z );
  const qword result_xy  = si_and( result_x, result_y );
  const qword result_xyz = si_and( result_xy, result_z );
  const qword inv_result = si_nor( result_xyz, result_xyz );

  return (vector unsigned int)(inv_result);
}

The SPU compiler output shows that there's practically a one to one correspondence between C statements and machine instructions. The compiler has done some reordering, but that shouldn't be a problem here.

box4_overlaps:
    hbr .L2,$lr
    lnop
    il $14,0
    lqd $34,16($3)
    lqd $35,16($4)
    lqd $32,0($3)
    lqd $33,0($4)
    lqd $30,32($3)
    lqd $31,32($4)
    lqd $27,64($3)
    fs $29,$34,$35
    lqd $28,64($4)
    nop $127
    lqd $24,48($3)
    fs $26,$32,$33
    lqd $25,48($4)
    lqd $20,80($3)
    fs $23,$30,$31
    lnop
    lnop
    shli $22,$29,1
    lqd $21,80($4)
    fa $16,$27,$28
    shli $19,$26,1
    fa $13,$24,$25
    shli $18,$23,1
    rotmi $17,$22,-1
    fa $11,$20,$21
    rotmi $15,$19,-1
    rotmi $12,$18,-1
    fs $10,$16,$17
    fs $8,$13,$15
    fs $7,$11,$12
    fcgt $6,$14,$10
    fcgt $5,$14,$8
    fcgt $9,$14,$7
    and $4,$6,$5
    and $3,$4,$9
    nor $2,$3,$3
    ori $3,$2,0
    nop $127
.L2:
    bi $lr

Additional Reading

Basic Altivec references:
Altivec Instruction Cross Reference, Apple (HTML)
Altivec Programming Environments Manual, Freescale (PDF)
Altivec Programmer's Interface Manual, Freescale (PDF)

Useful Altivec introductions and tutorials:
Understanding SIMD, Apple (HTML)
Altivec Tutorial, Apple (HTML)
Altivec Tutorial, Ian Ollman (PDF)
Pratical Altivec Strategies, Ian Ollman (PDF)
Unrolling Altivec, Peter Seebach (HTML)
AltiVec Revealed, Tom Thompson

Basic SPU references:
SPU C/C++ Language Extensions (PDF)
SPU Instruction Set Architecture (PDF)
]]>

A 4x4 Matrix Inverse

2006-06-04T05:25:26Z

GUEST ARTICLE! Cédric Lallain is a Frenchman who has been working with me on Cell/PS3 research at Highmoon Studios in Carlsbad, CA.. I hope that this is only the first of many contributions to the community by Cédric. Welcome aboard! -- Mike.

Inverse matrix on PPU and on SPU using SIMD instructions.

This article will talk about how to convert some scalar code to SIMD code for the PPU and SPU using the inverse matrix as an example.

Most of the time in the video games, programmers are not doing a standard inverse matrix. It is too expensive. Instead, to inverse a matrix, they consider it as orthonormal and they just do a 3x3 transpose of the rotation part with a dot product for the translation. Sometimes the full inverse algorithm is necessary.

The main goal is to be able to do it as fast as possible. This is why the code should use SIMD instructions as much as possible.

A vector is an instruction operand containing a set of data elements packed into a one-dimensional array. The elements can be fixed-point or floating-point values. Most Vector/SIMD Multimedia Extension and SPU instructions operate on vector operands. Vectors are also called Single-Instruction, Multiple-Data (SIMD) operands, or packed operands.
SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel.

[Chapter 2.5.1 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]].

Each SPE is a 128-bit RISC processor specialized for data-rich, compute-intensive SIMD and scalar applications.

[Chapter 3 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]].

Also the number of branches should stay to the strict minimum. Any extra branches will slow down the final solution. The first step is to choose the most suitable algorithm in order to reach the objectives. Different algorithms exist to inverse a matrix:

]]> The Gauss-Jordan elimination: The Gauss-Jordan elimination is a method to find the inverse matrix solving a system of linear equations. A good explanation about how this algorithm work can be found in the book "Numerical Recipes in C" [library.cornell.edu] chapter 2.1.
For a visual demonstration using a java applet see: Gauss-Jordan Elimination [cse.uiuc.edu]. In this algorithm, the choice of a good pivot is a critical part. To do it, all floating point values of a specific column need to be tested with each other, one by one. This, by definition, doesn't suit very well in SIMD code.
Performing the algorithm, some multiplications are be done between columns (e.g.: to apply the pivot) and some other operations between rows (e.g.: to apply the multiplier to the rest of the matrix). This requires extra code to swap rows and columns in order to use SIMD instructions.

Inversion using LU decomposition: The description of the inverse calculation can be found in "Numerical Recipes in C" [library.cornell.edu] chapter 2.3.

In linear algebra, a block LU decomposition is a decomposition of a block matrix into a lower block triangular matrix L and an upper block triangular matrix U. This decomposition is used in numerical analysis to reduce the complexity of the block matrix formula.

[Block LU decomposition [wikipedia.org]]

This algorithm would probably be very useful if the size of the matrix was 8x8. In this case, it requires doing the calculation two floating points at a time where a vector type contains four.

Inversion by Partitioning: To inverse a matrix A (size N) by partitioning, the matrix is partitioned into:

       |  A0    A1  |
   A = |            | with A0 and A3 squared matrix with the respective size
       |  A2    A3  |                s0 and s3 following the rule: s0 + s3 = N

The inverse is

          |  B0    B1  |
   InvA = |            |
          |  B2    B3  |

with:

  B0 = Inv(A0 - A1 * InvA3 * A2)
  B1 = - B0 * (A1 * InvA3)
  B2 = - (InvA3 * A2) * B0
  B3 = InvA3 + B2 * (A1 * InvA3)

More information can be found in "Numerical Recipes in C" [library.cornell.edu] chapter 2.7

The issue related above is also present here; the idea is to work four floating points at a time and not only two.

Using the inverse formula ( (1/det(M)) * Transpose(Cofactor(M))): Check the article about Matrix Inverse [mathworld.wolfram.com] for more information about this formula.

This is the algorithm which will be used to inverse the matrix. Each step presents a very good factorization ratio; it's possible to group the operations in order to replace them by SIMD instructions.
The most critical part in this algorithm is the calculation of all cofactors. This part has also two great advantages for our objectives. It's 100% calculation; this allows writing code without branching. All cofactor values are computed the same way and can be computed in parallel and independently of each other. This is a perfect place to use the SIMD instructions.

This article will start with a basic implementation of the inverse formula using scalar instructions. Then this code will be modified to prepare the SIMD version. The first SIMD version will be done for the PPU. The final one will be conversion using the SPU intrinsic instruction set.

A 4x4 matrix inverse

The general formula is:

   InvM = (1/det(M)) * Transpose(Cofactor(M))

which can also be written:

   InvM = (1/det(M)) * Adjoint(M) with
   Adjoint(M) = Transpose(Cofactor(M))

For the scalar version, the matrix is defined as follow:

  typedef struct s_vector
  {
    float row[4];
  } s_vector;

  typedef struct s_matrix
  {
    s_vector cols[4];
  } s_matrix;

The first version of the code does a standard implementation of the formula. The inverse function calls the cofactor function which computes and returns the cofactor matrix.

Definition 1 - If A is a square matrix then the minor of a(i,j), denoted by M(i,j), is the determinant of the submatrix that results from removing the ith row and jth column of A.
Definition 2 - If A is a square matrix then the cofactor of a(i,j), denoted by C(i,j), is the number ((-1)^(i+j))*M(i,j).

[from The method of Cofactors [tutorial.math.lamar.edu]]

Once the cofactor matrix is computed, the result is used to calculate the determinant and also the adjoint matrix.

Theorem 1 - if A is a matrix.

Choose any row, say row i, then,
- det(A) = a(i,1)C(i,1) + a(i,2)C(i,2) + ... + a(i,n)C(i,n)
Choose any column, say column j, then,
- det(A) = a(1,j)C(1,j) + a(2,j)C(2,j) + ... + a(n,j)C(n,j)

The adjoint of A is the transpose of the matrix of cofactors and is denoted by adj(A).

[from The method of Cofactors [tutorial.math.lamar.edu]]

From there, the inverse matrix is just a division of the adjoint matrix by the determinant.

The full code is available here: inverse_v1.h

Toward the SIMD

Even for the scalar code, it's better to unroll the loop, this give more options to the compiler for optimization. This gets also rid of the branches. This rule is especially true for the small loops with little iteration, like:

    for ( col = 0 ; col < 4 ; col++ )
    {
        for ( row = 0; row < 4; row++ )
        {
            output->cols[col].row[row] =  source->cols[col].row[row] * factor;
        }
    }

The second reason to do this refactorization is to locate the SIMD blocks. Unrolling the multiplication is straight forward. The same changes can be applied to the transpose and the determinant functions.

The following chapter will detail the code of the cofactor matrix. The second scalar version can be found here: inverse_v2.h

The case of the cofactor matrix

To avoid too much confusion, in the second scalar version, the new helper function is now called 'cofactor_column_v2' instead of 'cofactor_ij_v1'. It takes care of a whole column of cofactors and not just one at a time.

The new cofactor code is:

    cofactor_column_v2(output->cols[0].row, source, col);
    cofactor_column_v2(output->cols[1].row, source, col);
    cofactor_column_v2(output->cols[2].row, source, col);
    cofactor_column_v2(output->cols[3].row, source, col);

Inside cofactor_column_v2, the rows are grouped together to have a better view of what to do to convert this into SIMD code:

    const float r0_pos_part1 = mat->cols[col0].row[1] * 
                               mat->cols[col1].row[2] * 
                               mat->cols[col2].row[3];
                               
    const float r1_pos_part1 = mat->cols[col0].row[2] * 
                               mat->cols[col1].row[3] * 
                               mat->cols[col2].row[0];
                               
    const float r2_pos_part1 = mat->cols[col0].row[3] * 
                               mat->cols[col1].row[0] * 
                               mat->cols[col2].row[1];
                               
    const float r3_pos_part1 = mat->cols[col0].row[0] * 
                               mat->cols[col1].row[1] * 
                               mat->cols[col2].row[2];

The row indices clearly show a relation between them. By noting the r0_pos_part1 as follow:

    r[0]_pos_part1 = mat->cols[c0]->row[r0] * 
                     mat->cols[c1]->row[r1] * 
                     mat->cols[c1]->row[r2]

the next rows can be written like this:

    r[N]_pos_part1 = mat->cols[c0]->row[(r0+N)&3] * 
                     mat->cols[c1]->row[(r1+N)&3] * 
                     mat->cols[c1]->row[(r2+N)&3]

The same relation is present in all positive and negative parts of the calculation. Basically, in order to calculate the different parts of the 3x3 determinants for a defined column, all three other columns need to be multiply together after being rotated by a specific value.

Those 3x3 determinants also called minor of the matrix need to have their signs adjusted.

Following the idea of converting the code using SIMD instructions, two variables have been created:

    static const unsigned int znzn[] = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };
    static const unsigned int nznz[] = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };

They contain the two possible mask signs for a whole column. When the column number is even, nznz will be the mask to select. In the other case, znzn will be the one to choose.

To select the correct variable, the basic way (and probably also the most common used nowadays) would probably use an 'if'. As indicated at the beginning of this article, the 'if' statement is something to avoid as much as possible. It generates branches. The solution to avoid it is to use a mask (col_mask) and to do a selection with it:

    const unsigned int col_mask   = (const unsigned int)(((const int)((col & 1) << 31)) >> 31);
    const unsigned int u_znzn     = (const unsigned int)(&znzn[0]);
    const unsigned int u_nznz     = (const unsigned int)(&nznz[0]);
    union 
    {
        unsigned int  u;
        unsigned int *p;
    } mask;
    mask.u = (u_nznz&col_mask)|(u_znzn&~col_mask);

The union is here to ensure the strict aliasing rule.

Once the correct pointer selected, the final calculation is simple:

    r0_cofactor.i ^= mask.p[0];
    r1_cofactor.i ^= mask.p[1];
    r2_cofactor.i ^= mask.p[2];
    r3_cofactor.i ^= mask.p[3];

The next step is the conversion of this scalar version for the PPU using the altivec instruction set.

Altivec version

A new definition for the matrix type is required:

  typedef struct s_matrix
  {
      vector float cols[4];
  } s_matrix;

In the version 2, the rows were grouped together. This showed the ideas of rotating their index to do the calculation. This one is used in the SIMD code. Different rotations for each column are required. Those rotations are computed first. The variable names are defined as follow: cXuY, where X is the column number rotated up by Y floats:

    const vector float c0u1 = vec_sld(c0, c0,  4);
    const vector float c0u2 = vec_sld(c0, c0,  8);
    const vector float c0u3 = vec_sld(c0, c0, 12);
    ....

The order to do the calculation is really important to minimize the number of operations.

The calculation of each cofactor is based on the determinant of the 3x3 matrix created by removing the cofactor's column and row from the source matrix. That is why for the first column, the multiplication is done in the reverse order (i.e.: the third and fourth column will be multiply together before doing the operation using the second one). This way, the result will be available to compute the cofactors of the second column.
With the third column, the multiplication is done in the initial order (Multiplying the first and second column together first) to share the results with fourth column.
Note: in the source code, the fourth column has been computed before the third one, for convenience only (to avoid the mistakes working with the column 0, 1, and 2 instead of 0, 1, and 3). The final result is identical.

The same masking operation for the sign bit is done using SIMD instructions.

In order to calculate the adjoint matrix, the transpose code has also been converted. The unrolled version wasn't really helpful. The knowledge of the altivec instructions was required, especially the one which manipulates the data: vec_mergeh and vec_mergel.

The determinant function now returns a vector float, each element is nearly equal (nearly due to the floating point precision) to the determinant of the matrix.
The algorithm is the same as before. A multiplication is computed between a row or a column with the corresponding value in the cofactor matrix, all values are added together.
The multiplication of each value is a simple SIMD instruction; unfortunately no instruction exists to dispatch the sum of all values in a vector. The solution is to rotate the result vector twice and add it with itself as follow:

   (  A   B   C   D  )
 + (  C   D   A   B  )
 =====================
   ( A+C B+D C+A B+D )
 + ( B+D C+A B+D A+C )
 =====================
 = the vector with the determinant store in each element.

It's important to know that the values are not necessary the same along the vector, this is due to the order of the calculation and to the lack of precision of the floating point, those values can be slightly different; a vec_splat can be apply to this vector to force them to be identical.

The final multiplication (by one over the determinant) can be easily performed considering function already has a vector filled with the determinant.

The code of this first PPU version of the inverse matrix can be found here: inverse_v3.h

Optimization Altivec

Once the code is working on PPU, the next step is the optimization.

The altivec instruction set doesn't include any instruction to define constants, every constant will have to be constructed and loaded from the memory. The following lines:

    const vector unsigned int u_znzn = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };
    const vector unsigned int u_nznz = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };

will generate the following code (using ppu-gcc from mambo):

ld 4, .LC18@toc(2)
ld 11, .LC16@toc(2)

If one operation is known from being slow, this is the access to the memory. To avoid the loading of constant values, they are built on the fly as follow:

    const vector unsigned int u_zero     = (vector unsigned int)zero;
    const vector unsigned int u_two      = vec_splat_u32(2);
    const vector unsigned int u_fifteen  = vec_splat_u32(15);
    const vector unsigned int u_2shift15 = vec_sl(u_two, u_fifteen);
    const vector unsigned int u_signmask = vec_sl(u_2shift15, u_fifteen);
    const vector unsigned int u_nznz     = vec_mergeh(u_signmask, u_zero);
    const vector unsigned int u_znzn     = vec_mergeh(u_zero, u_signmask);

u_signmask, after its initialization, contains the vector: { 0x80000000, 0x80000000, 0x80000000, 0x80000000 } The use of vec_mergeh inserts some zero values in the middle of it to finalize the constant value.

Another part of the code can be also improved using shift instructions instead of multiplication:

    const vector float m_c2u1_c3u2 = vec_madd(c2u1, c3u2, zero);

can be replaced by:

    const vector float m_c2u1_c3u2 = vec_sld(m_c2u2_c3u3, m_c2u2_c3u3, 12);

On the PPU, vec_madd and vec_sld aren't on the same pipeline. They might now be executed in parallel.

[cf: Appendix A.3.2 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]].

The optimized version of the previous PPU code can be found here: inverse_v4.h

The final version of the inverse matrix for PPU where the whole code has been placed in a single function can be downloaded here: inverse_v5.h

SPU version

The SPU is a very powerful calculator with a lot of strong intrinsic instructions, and even if some altivec functions don't have direct equivalent, they can be replaced by the intrinsic set.

Some altivec instructions have a direct equivalent as SPU intrinsic instructions:

vec_madd is replaced by either spu_madd or simply by spu_mul when the third parameter is zero.
vec_xor, vec_re, vec_sub respectively becomes spu_xor, spu_re, spu_sub

Some others require a work around:

vec_sld
vec_mergeh
vec_mergel

For the PPU, the need to build the constant values was clearly present. On the contrary, the loading time in the SPU is almost nothing; the SPU even have instructions to extract, insert values and create constant values on the fly without going through the memory.

[Table B-1 in the Appendix B.1.2 in the released pdf by IBM: Cell Broadband Engine Programming Handbook [ibm.com]]

There is no need to worry too much creating constant values. Constants can replace the PPU calculation for nznz and znzn. The instruction spu_shuffle with associated constant values will be used to replace vec_mergeh, vec_mergel and vec_sld. Five different shuffling patterns have to be created. To explain them, the four floats of the first vector will be designated by the letters: X, Y, Z, and W. The four floats of the second vector will be: A, B, C, D.

To replace the sld function, three patterns are required:

YZWX to replace: sld(v, v, 4)
ZWXY to replace: sld(v, v, 8)
WXYZ to replace: sld(v, v, 12)

To replace vec_mergeh, and vec_mergel, only two other patterns are defined: XAYB, ZCWD

The SPU version of the inverse matrix can be found here: inverse_v6.h

Summary

Avoid the algorithms which deal with special cases (like the Gauss-Jordan elimination).
Start with a simple scalar implementation.
Unroll the loops and group the code which can be executed in parallel and which follow the same patterns (like in the cofactor function).
Get use to the data manipulation instructions (vec_mergeh, vec_sld, spu_shuffle...).
Look at the generated assemble code.
Prefer to build the PPU data on the fly instead of loading them from the memory.

About The Author

Cedric Lallain is a Senior Programmer working on PS3/Cell research at Highmoon Studios (Vivendi Games).
In the past years, Cedric was mostly an working on AI. He also optimized (high and low level optimization) some PS2 code to help his game reaching a correct frame rate.
The last game he worked on is Darkwatch for Highmoon Studios.
Previously he was lead AI programmer on Street Racing Syndicate at Eutechnyx.

Understanding Strict Aliasing

2006-06-02T06:53:12Z

UPDATED! (08 Aug 06) More Clarifications! Special thanks to Nicolas Riesch, André de Leiradella and pinskia for their comments and suggestions.

UPDATED! (28 Dec 06) Minor fixes. Special thanks to Kobi Cohen-Arazi and Chris Pickett.

Aliasing

alias

  0uint32_t 
  1swap_words( uint32_t arg )
  2{
  3  uint16_t* const sp = (uint16_t*)&arg;
  4  uint16_t        hi = sp[0];
  5  uint16_t        lo = sp[1];
  6  
  7  sp[1] = hi;
  8  sp[0] = lo;
  9
 10  return (arg);
 11}

Using GCC 3.4.1 and above, the above code will generate warning: dereferencing type-punned pointer will break strict-aliasing rules on line 3.

arg

illegal

strict aliasing

arg

Dereferencing a cast of a variable from one type of pointer to a different type is usually in violation of the strict aliasing rule.

All of the examples in this article have been tested with various versions of GCC. Although you can expect most of the examples to generate similar results across the major compilers, programmers' expectations should always be validated for the compilers and compiler revisions required.

]]> What is strict aliasing? Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias eachother.)

Here are some basic examples of assumptions that may be made by the compiler when strict aliasing is enabled:

Pointers to different built in types do not alias:

  0int16_t* foo;
  1int32_t* bar;

The compiler will assume that *foo and *bar never refer to the same location.

Pointers to aggregate or union types with differing tags do not alias:

  0typedef struct
  1{
  2  uint16_t a;
  3  uint16_t b;
  4  uint16_t c;
  5} Foo;
  6
  7typedef struct
  8{
  9  uint16_t a;
 10  uint16_t b;
 11  uint16_t c;
 12} Bar;
 13
 14Foo* foo;
 15Bar* bar;

The compiler will assume that *foo and *bar never refer to the same location, even though the contents of the structures are the same.

Pointers to aggregate or union types which differ only by name may alias:

  0typedef struct
  1{
  2  uint16_t a;
  3  uint16_t b;
  4  uint16_t c;
  5} Foo;
  6
  7typedef Foo Bar;
  8
  9Foo* foo;
 10Bar* bar;

The compiler will assume that *foo and *bar may refer to the same location, and will not perform the optimizations decribed below.

Benefits to The Strict Aliasing Rule

When the compiler cannot assume that two object are not aliased, it must act very conservatively when accessing memory. For example:

  0typedef struct
  1{
  2  uint16_t a;
  3  uint16_t b;
  4  uint16_t c;
  5} Sample;
  6
  7void
  8test( uint32_t* values,
  9      Sample*   uniform,
 10      uint64_t  count )
 11{
 12  uint64_t i;
 13
 14  for (i=0;i 15  {
 16    values[i] += (uint32_t)uniform->b;
 17  }
 18}

Compiled with -fno-strict-aliasing -O3 -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test:
  1  li     10, 0      # i      = 0
  2  cmpld  7,  10, 5  # done   = (i==count)
  3  bgelr- 7          # if (done) return
  4  mtctr  5          # ctr    = count
  5.L8:
  6  sldi   11, 10, 2  # offset = i * 4
  7  lwz    9,  4(4)   # b      = *(uniform+4)
  8  addi   10, 10, 1  # i++
  9  lwzx   5,  11, 3  # value  = *(values+offset)
 10  add    0,  5,  9  # value  = value + b
 11  stwx   0,  11, 3  # *(values+offset) = value
 12  bdnz  .L8         # if (ctr--) goto .L8
 13  blr               # return

In this case uniform->b must be loaded during each iteration of the loop. This is because the compiler cannot be certain that values does not overlap b in memory. If, in fact, they do overlap, the programmer would expect that uniform->b would be properly updated and the values stored into the values array adjusted accordingly. The only method for the compiler to guarantee these results is reloading uniform->b at every iteration.

It was noted that this case is extremely uncommon in most code and the decision was made to presume objects of different types are not aliased and to be more aggresive with optimizations. It is certain the fact this presumption would break some existing code was discussed in detail. It must have been decided that those most likely to use memory aliasing techniques for optimization are are few and those that do use it are the most willing and capable of making the necessary changes.

The result, even for this small case, can make a significant performance impact. Compiled with -fstrict-aliasing -Wstrict-aliasing=2 -O3 -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test:
  1  li     11,0     # i      = 0
  2  cmpld  7,11,5   # done   = (i == count)
  3  bgelr- 7        # if (done) return
  4  lhz    4,2(4)   # b      = uniform.b
  5  mtctr  5        # ctr    = count
  6.L8:
  7  sldi   9,11,2   # offset = i * 4
  8  addi   11,11,1  # i++
  9  lwzx   5,9,3    # value  = *(values+offset)
 10  add    0,5,4    # value  = value + b
 11  stwx   0,9,3    # *(values+offset) = value
 12  bdnz   .L8      # if (ctr--) goto .L8
 13  blr             # return

The load of b is now only done once, outside the loop. For more examples of optimizations for non-aliasing memory see: Demystifying The Restrict Keyword

Casting Compatible Types

Aliases are permitted for types that only differ by qualifier or sign.

  0uint32_t
  1test( uint32_t a )
  2{
  3  uint32_t* const       a0 = &a;
  4  uint32_t* volatile    a1 = &a;
  5  int32_t*              a2 = (int32_t*)&a;
  6  int32_t* const        a3 = (int32_t*)&a;
  7  int32_t* volatile     a4 = (int32_t*)&a;
  8  const int32_t* const  a5 = (int32_t*)&a;
  9
 10  (*a0)++;
 11  (*a1)++;
 12  (*a2)++;
 13  (*a3)++;
 14  (*a4)++;
 15
 16  return (*a5);
 17}

In this case a0-a5 are all valid aliases of a and this function will return (a + 5).

GCC has two flags to enable warnings related to strict aliasing. -Wstrict-aliasing enables warnings for most common errors related to type-punning. -Wstrict-aliasing=2 attempts to warn about a larger class of cases, however false positives may be returned.

Casting through a union (1)

The most commonly accepted method of converting one type of object to another is by using a union type as in this example:

  0typedef union
  1{
  2  uint32_t u32;
  3  uint16_t u16[2];
  4}
  5U32;
  6
  7uint32_t
  8swap_words( uint32_t arg )
  9{
 10  U32      in;
 11  uint16_t lo;
 12  uint16_t hi;
 13
 14  in.u32    = arg;
 15  hi        = in.u16[0];
 16  lo        = in.u16[1];
 17  in.u16[0] = lo;
 18  in.u16[1] = hi;
 19
 20  return (in.u32);
 21}

This method is not properly called casting at all (although it may be called type-punning) as the value is simplied copied into a union which permits aliasing among its members. From a performance point of view, this method relies on the ability of the optimizer to remove the redundant stores and loads. When using recent versions of GCC, if the transformation is reasonably simple, it is very likely that the compiler will be able to remove the redundancies and produce an optimal code sequence.

Strictly speaking, reading a member of a union different from the one written to is undefined in ANSI/ISO C99 except in the special case of type-punning to a char*, similar to the example below: Casting to char*. However, it is an extremely common idiom and is well-supported by all major compilers. As a practical matter, reading and writing to any member of a union, in any order, is acceptable practice.

For example, when compiled with GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8), the argument is simply rotated 16 bits.

  0swap_words:
  1  rlwinm r3,r3,16,0xffffffff
  2  blr

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU, the loads and stores are removed but the instruction sequence is less than optimal.

  0swap_words:
  1  slwi    4,3,16     ; hi    = arg << 16
  2  rldicl  3,3,48,48  ; lo    = arg >> 16
  3  or      0,4,3      ; out   = hi | lo;
  4  rldicl  3,0,0,32   ; final = out & 0xffffffff
  5  blr

In order to generate reasonably good code across both the GCC3 and GCC4 families, use C99 style intializers:

  0uint32_t
  1swap_words( uint32_t arg )
  2{
  3  U32    in  = { .u32=arg };
  4  U32    out = { .u16[0]=in.u16[1], 
  5                 .u16[1]=in.u16[0] };
  6
  7  return (out.u32);
  8}

Compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0swap_words:
  1  stwu 1,-16(1)              ; Push stack
  2  rlwinm 3,3,16,0xffffffff   ; Rotate 16 bits
  3  addi 1,1,16                ; Pop stack
  4  blr

It is a parculiarity of the 32 bit build of GCC 3.4.1 for the Cell PPU that the stack is always pushed and popped regardless of whether or not it is used.

This method is most valuable for use with primitive types which can be returned by value. This is because it relies on doing a complete copy of the object (by value) and removing the redundancies. With more complex aggregate or union types copying may be done on the stack or through the memcpy function and redundancies are harder to eliminate.

Casting through a union (2)

Casting proper may be done between a pointer to a type and a pointer to an aggregate or union type which contains a member of a compatible type, as in the following example:

  0uint32_t
  1swap_words( uint32_t arg )
  2{
  3  U32*     in = (U32*)&arg;
  4  uint16_t lo = in->u16[0];
  5  uint16_t hi = in->u16[1];
  6
  7  in->u16[0] = hi;
  8  in->u16[1] = lo;
  9
 10  return (in->u32);
 11}

in is a pointer to a U32 type, which contains the member u32 which is of type uint32_t which is compatible with arg, which is also of type uint32_t.

The above source when compiled with GCC 4.0 with the -Wstrict-aliasing=2 flag enabled will generate a warning. This warning is an example of a false positive. This type of cast is allowed and will generate the appropriate code (see below). It is documented clearly that -Wstrict-aliasing=2 may return false positives.

Compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8),

  0swap_words:
  1  stw r3,24(r1)  ; Store arg
  2  lhz r0,24(r1)  ; Load hi
  3  lhz r2,26(r1)  ; Load lo
  4  sth r0,26(r1)  ; Store result[1] = hi
  5  sth r2,24(r1)  ; Store result[0] = lo
  6  lwz r3,24(r1)  ; Load result
  7  blr            ; Return

GCC is extremely poor at combining loads and stores done through a pointer to a union type as can be seen from the generated code above. The output is a very naive interpretation of the source and would perform badly compared to the previous examples on most architectures.

However, once this fact is accounted for, this method can be very useful. Rather than copying the argument by value, which is problematic on large or complex structures, a pointer can be passed in and the value modified directly. If the loads and stores can be combined in the source the results will usually be excellent.

"But when the address of a variable is taken, doesn't the compiler force it to be stored in memory rather than in a register?"

Yes, both a store and a load may then generated as part of the trace. However, when alias analysis is done it can be determined that the object cannot be changed another mechanism so the load and store may be marked as redundant and removed.

Do not rely on the compiler to combine loads and stores. The programmer is always better equipted to make those decisions based on alignment concerns and complex instruction penalty rules.

  0uint16_t*
  1swap_words( uint16_t* arg )
  2{
  3  U32*     combined = (U32*)arg;
  4  uint32_t start    = combined->u32;
  5  uint32_t lo       = start >> 16;
  6  uint32_t hi       = start << 16;
  7  uint32_t final    = lo | hi;
  8
  9  combined->u32 = final;
 10}

Compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8),

  0swap_words:
  1  lwz r0,0(r3)                ; Load arg
  2  rlwinm r0,r0,16,0xffffffff  ; Rotate 16 bits
  3  stw r0,0(r3)                ; Store arg
  4  blr                         ; Return

If the above source is called as a non-inline function, there will be a signficant penalty on most architectures waiting for the load before the rotate and the store on return.
If the above source is called as a inline function, it can be safely assumed the load and store will be removed by the compiler as redundant.

In C99, a static inline function, which may be included in a header file, differs from automatic inlining in that the function may be defined multiple times (e.g. included by multiple source files). Each definition of a static inline function must be identical.

  0static inline void
  1swap_words( uint16_t* arg )
  2{
  3  U32*     combined = (U32*)arg;
  4  uint32_t start    = combined->u32;
  5  uint32_t lo       = start >> 16;
  6  uint32_t hi       = start << 16;
  7  uint32_t final    = lo | hi;
  8
  9  combined->u32 = final;
 10}

With some care, this method is the most appropriate for modifying large or complex structures by multiple types.

Casting through a union (3)

Occasionally a programmer may encounter the following INVALID method for creating an alias with a pointer of a different type:

  0typedef union 
  1{
  2  uint16_t* sp; 
  3  uint32_t* wp;
  4} U32P;
  5
  6uint32_t 
  7swap_words( uint32_t arg )
  8{
  9  U32P             in = { .wp = &arg };
 10  const uint16_t   hi = in.sp[0];
 11  const uint16_t   lo = in.sp[1];
 12  
 13  in.sp[0] = lo;
 14  in.sp[1] = hi;
 15
 16  return ( arg ); <-- RESULT IS UNDEFINED
 17}

The problem with this method is although U32P does in fact say that sp is an alias for wp, it does not say anything about the relationship between the values pointed to by sp and wp. This differs in a critical way from "Casting Through a Union (1)" and "Casting Through a Union (2)" which both define aliases for the values being pointed to, not the pointers themselves.

The presumption of strict aliasing remains true: Two pointers of different types are assumed, except in a few very limited conditions specified in the C99 standard, not to alias. This is not one of those exceptions.

The above source when compiled with GCC 3.4.1 or GCC 4.0 with the -Wstrict-aliasing=2 flag enabled will NOT generate a warning. This should serve as an example to always check the generated code. Warnings are often helpful hints, but they are by no means exaustive and do not always detect when a programmer makes an error. Like any peice of software, a compiler has limits. Knowing them can only be helpful.

For example, when compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8),

  0swap_words:      ; RETURNS ARG UNCHANGED
  1  lhz r0,24(r1)  ; Load lo from stack (What value?!)
  2  lhz r2,26(r1)  ; Load hi from stack (What value?!)
  3  stw r3,24(r1)  ; Store arg to stack
  4  sth r0,26(r1)  ; Store hi to stack
  5  sth r2,24(r1)  ; Store lo to stack
  6  blr            ; Return

In this case notice that because hi, lo and arg are assumed not to alias, the resulting order of instruction has no value:

[Line 1]: lo is loaded from the stack before anything is stored to the stack
[Line 2]: hi is loaded from the stack before anything is stored to the stack
[Line 3]: arg is stored to the stack, but this value will not be read.
[Line 4]: hi is stored to the stack, but this value will not be read.
[Line 5]: lo is stored to the stack, but this value will not be read.

Or when compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0swap_words:     # RETURNS ARG UNCHANGED
  1  stw 3,48(1)   # Store arg to stack
  2  lhz 9,48(1)   # Load hi
  3  lhz 0,50(1)   # Load lo
  4  lwz 3,48(1)   # Load arg
  5  sth 0,48(1)   # Store hi to stack
  6  sth 9,50(1)   # Store lo to stack
  7  blr           # Return

Or when compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0swap_words:     # RETURNS ARG UNCHANGED
  1  stwu 1,-16(1) # Push stack
  2  addi 1,1,16   # Pop stack
  3  blr           # Return

Casting to char*

It is always presumed that a char* may refer to an alias of any object. It is therefore quite safe, if perhaps a bit unoptimal (for architecture with wide loads and stores) to cast any pointer of any type to a char* type.

  0uint32_t 
  1swap_words( uint32_t arg )
  2{
  3  char* const cp = (char*)&arg;
  4  const char  c0 = cp[0];
  5  const char  c1 = cp[1];
  6  const char  c2 = cp[2];
  7  const char  c3 = cp[3];
  8
  9  cp[0] = c2;
 10  cp[1] = c3;
 11  cp[2] = c0;
 12  cp[3] = c1;
 13
 14  return (arg);
 15}

The converse is not true. Casting a char* to a pointer of any type other than a char* and dereferencing it is usually in volation of the strict aliasing rule.

In other words, casting from a pointer of one type to pointer of an unrelated type through a char* is undefined.

  0uint32_t
  1test( uint32_t arg )
  2{
  3  char*     const cp = (char*)&arg;
  4  uint16_t* const sp = (uint16_t*)cp;
  5
  6  sp[0] = 0x0001;
  7  sp[1] = 0x0002;
  8
  9  return (arg);
 10}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test:
  1  stw 3, 48(1)   # arg stored to stack
  2  li  0, 1       # hi = 0x0001
  3  li  9, 2       # lo = 0x0002
  4  lwz 3, 48(1)   # result = loaded from stack
  5  sth 0, 48(1)   # store hi to stack
  6  sth 9, 50(1)   # store lo to stack
  7  blr            # return (result) <-- RETURNS ARG UNCHANGED

As noted by Pinskla it is not deferencing a char* per se that is specifically recognized as a potential alias of any object, but any address referring to a char object. This includes an array of char objects, as in the following example which will also break the strict aliasing assumption.

  0  char      const cp[4] = { arg0, arg1, arg2, arg3 };
  1  uint16_t* const sp    = (uint16_t*)cp;
  2
  3  sp[0] = 0x0001;
  4  sp[1] = 0x0002;

GCC RULE BREAKING

GCC allows type-punned values to be deferenced at independent locations in memory (i.e. different objects) when the source of the lvalue is not directly known.

  0void
  1set_value( uint64_t* c, 
  2           uint32_t  a_val, 
  3           uint16_t  b_val ) 
  4{
  5  uint32_t* a = (uint32_t*)c;
  6  uint16_t* b = (uint16_t*)c;
  7  
  8  a[0] = a_val; // <--- Address of c + 0
  9  b[2] = b_val; // <--- Address of c + 4
 10  b[3] = b_val; // <--- Address of c + 6
 11}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0set_value:
  1  stw 4,0(3)   # (c+0) = a_val
  2  sth 5,6(3)   # (c+6) = b_val
  3  sth 5,4(3)   # (c+4) = b_val
  4  blr          # return (c)

Note any use of c[0] here would be (more?) undefined because it would alias the uses of a and b.

  0void
  1set_value( uint64_t* c, 
  2           uint32_t  a_val, 
  3           uint16_t  b_val ) 
  4{
  5  uint32_t* a = (uint32_t*)c;
  6  uint16_t* b = (uint16_t*)c;
  7  
  8  a[0] = a_val; // < Address of c + 0
  9  b[2] = b_val; // < Address of c + 4
 10  b[3] = b_val; // < Address of c + 6
 11  
 12  // WHAT VALUE THIS WOULD PRINT IS UNDEFINED
 13  printf("c = 0x%08x\n", c[0] ); 
 14}

However, when set_value is compiled inline (perhaps automatically), the source of c may be known and GCC will assume the values do not alias and may reduce the expression differently and generate completely different code.

  0static inline void
  1set_value( uint64_t* c, 
  2           uint32_t  a_val, 
  3           uint16_t  b_val ) 
  4{
  5  uint32_t* a = (uint32_t*)c;
  6  uint16_t* b = (uint16_t*)c;
  7  
  8  a[0] = a_val; // <--- Address of c + 0
  9  b[2] = b_val; // <--- Address of c + 4
 10  b[3] = b_val; // <--- Address of c + 6
 11}

  0int64_t
  1test( int64_t  a
  2     ,int64_t  b
  3     ,uint32_t hi32
  4     ,uint16_t lo16 )
  5{
  6  int64_t c = a + b;
  7
  8  set_value( &c, hi32, lo16 );
  9
 10  return (c);
 11}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test:
  1  add 3,3,4    # c = (a+b)
  2  blr          # return (c)

In this case because the object c is never accessed through any valid aliases in set_value, the expression is reduced out.

The above example will NOT currently generate any warnings with -Wstrict-aliasing=2 and will simply generate different results depending on whether or not the expression is inlined. This is another good reason to always double check the generated code. Also, when writing unit tests, it is a good idea to test a function both as an inline function and an extern function.

With GCC, strict aliasing warnings are more likely to be generated at the point where an address is taken (e.g. uint16_t* a = (uint16_t*)&b;) than with pre-existing pointers (e.g. uint16_t* a = (uint16_t*)b_ptr;). Take special care when type-punning pre-existing pointers.

Perhaps surprisingly, illegal aliasing within a loop generates completely different results. It is probably not completely accidental though, as most of the historical arguments against strict aliasing have revolved around optimized versions of functions like memset and memcpy which would cast the data to the widest available register size to minimize the trips to and from memory.

  0void
  1set_value( uint64_t* c,
  2           uint32_t  a_val,
  3           uint16_t  b_val,
  4           uint32_t  count )
  5{
  6  uint32_t* a  = (uint32_t*)c;
  7  uint16_t* b  = (uint16_t*)c;
  8  uint32_t  i  = 0;
  9
 10  for (i=0;i 11  {
 12    a[0]  = a_val;
 13    b[2]  = b_val;
 14    b[3]  = b_val;
 15  }
 16}

As expected from the previous example above, this should still generate the "expected" result:

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0set_value:
  1  cmpwi 0, 6, 0   # done = (count == 0)
  2  stwu  1, -16(1) # Push stack
  3  mr    9, 3      # Copy c
  4  beq-  0, .L7    # if (done) goto .L7
  5  mtctr 6         # i = count
  6.L8:
  7  stw   4, 0(9)   # a[0] = a_val
  8  addi  9, 9, 4   # a++
  9  sth   5, 4(3)   # b[2] = b_val
 10  sth   5, 6(3)   # b[3] = b_val
 11  addi  3, 3, 4   # b+=2
 12  bdnz  .L8       # if (i) goto .L8
 13.L7:
 14  addi  1, 1, 16  # Pop stack
 15  blr             # return

When called inline, the previous example would suggest that the compiler, assuming c is not aliased would also return (a + b):

  0int64_t
  1test_loop( int64_t  a,
  2           int64_t  b,
  3           uint32_t hi32,
  4           uint16_t lo16,
  5           uint32_t count )
  6{
  7  static int64_t c[ C_COUNT ];
  8
  9  c[0] = a + b;
 10
 11  set_value( c, hi32, lo16, count );
 12
 13  return (c[0]);
 14}

When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test_loop:
  1  lis   12, c.0@ha      # cloc     = location of c
  2  mr.   0,  9           # i        = count
  3  la    11, c.0@l(12)   # c        = *cloc
  4  addc  10, 4, 6        # c1       = addlo (a,b)
  5  adde  9,  3, 5        # c2       = addhi (a,b)
  6  stwu  1, -16(1)       # Push stack
  7  stw   9,  0(11)       # c[0].hi  = c2
  8  mr    6,  11          # a        = c
  9  stw   10, 4(11)       # c[0].lo  = c1
 10  mr    9,  11          # b        = c
 11  beq-  0,  .L19        # if (i==0) goto .L19
 12  mtctr 0               # i        = count
 13.L20:
 14  stw   7,  0(9)        # a[0]     = hi32
 15  addi  9,  9, 4        # a++
 16  sth   8,  4(6)        # b[2]     = lo16
 17  sth   8,  6(6)        # b[3]     = lo16
 18  addi  6,  6, 4        # b+=2
 19  bdnz  .L20            # if (i) goto .L20
 20.L19:
 21  la    9,  c.0@l(12)   # c        = *cloc
 22  addi  1,  1, 16       # Pop stack
 23  lwz   3,  0(9)        # result.hi = c[0].hi
 24  lwz   4,  4(9)        # result.lo = c[0].lo
 25  blr                   # return (result)

The result is clearly different from the original version without the loop.

It is not the existance of the loop in the source that changes the transformation, but rather the existance of a loop after the initial optimization passes. For example, GCC is fairly good at optimizing (unrolling) loops with a fixed iteration count. Examine the following example:

  0int64_t
  1test_noloop( int64_t  a,
  2             int64_t  b,
  3             uint32_t hi32,
  4             uint16_t lo16 )
  5{
  6  int64_t c = a + b;
  7
  8  set_value( &c, hi32, lo16, 1 );
  9
 10  return (c);
 11}

It wouldn't be completely outrageous to expect the above example to generate similar, albeit unrolled, code. That is unless you know to expect simple loop transformations to be done fairly early in the compilation process and alias analysis to be done later. When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 32 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test_noloop:      # <--- RETURNS (A+B)
  1  stwu 1,-16(1)   # Push stack
  2  addc 4,4,6      # c.lo = addlo(a,b)
  3  adde 3,3,5      # c.hi = addhi(a,b)
  4  addi 1,1,16     # Pop stack
  5  blr             # return (c)

The existance of a loop around accessed aliases and whether or not the iteration count is known at compile time may impact the generated code. Tests should include both constant and extern'd iteration counts.

What is surprising is that the 64 bit build of the same version of the same compiler generates different results. When compiled with -fstrict-aliasing -O3 -Wstrict-aliasing -std=c99 on the 64 bit build of GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux) for the Cell PPU.

  0test_loop:
  1  li     10, 0           # i = 0
  2  cmplw  7,  10, 7       # done = (i==count)
  3  add    4,  3, 4        # sum  = a + b
  4  ld     3,  .LC0@toc(2) # cloc = location of c
  5  std    4,  0(3)        # c[0] = sum
  6  mr     9,  3           # a    = c
  7  mr     11, 3           # b    = c
  8  bge-   7,  .L18        # if (done) goto .L18
  9.L22:
 10  addi   0,  10, 1       # i++
 11  stw    5,  0(11)       # a[0] = hi32
 12  rldicl 10, 0, 0, 32    # i    = i & 0xffffffff
 13  sth    6,  4(9)        # b[2] = lo16
 14  sth    6,  6(9)        # b[3] = lo16
 15  cmplw  7,  10, 7       # done = (i==count)
 16  addi   11, 11, 4       # a++
 17  addi   9,  9, 4        # b+= 2
 18  blt+   7,  .L22        # if (!done) goto .L22
 19.L18:
 20  ld     3,0(3)          # result = c[0]
 21  blr                    # return (result)

This indicates that there are significant non-obvious side-effects to building GCC as 32 bits versus 64 bits that someone might want to look into.

The platform, version number and build data (i.e. the output of gcc --version) is not sufficient information for compatibility testing. To be thorough, units tests should be run across all versions of the same compiler, if more than one is known to exist.

C99 Standard

This article has been pretty relaxed with the use of terminology and there is always room for some interpretation when reading a standard. There are many additional cases not covered above and compiler specific issues to consider. But for those interested in up-to-date definitive information on the C standard refer to ISO/IEC 9899:TC2 [open-std.org]. Here is the most relevant text from section "6.5 Expressions":

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective type of the object,
a type that is the signed or unsigned type corresponding to the effective type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
a character type.

Note the use of types like uint64_t and uint32_t in the above examples. For decades programmers have been creating their own integer types and reworking their header files for each platform simply to get consistant integer sizes across multiple architectures. This is because the standard does not guarantee types like int or short to be of any particular width, it only guarantees their sizes relative to eachother. But finally, with C99, the debate is over. Standard width integers are now defined in stdint.h. Always use this header, and if your implementation does not have it (e.g. Microsoft), there are portable public domain versions available (e.g. This stdint.h can be used for Win32).

Summary

Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the -fstrict-aliasing flag. Be sure that all code can safely run with this rule enabled. Enable strict aliasing related warnings with -Wstrict-aliasing, but do not expect to be warned in all cases.
In order to discover aliasing problems as quickly as possible, -fstrict-aliasing should always be included in the compilation flags for GCC. Otherwise problems may only be visible at the highest optimization levels where it is the most difficult to debug.

Be wary of code that requires the use of -fno-strict-aliasing (turns off strict aliasing at any level) in order to work. This is a very good indication that the code relies on aliased memory access and is likely to be dominated by poor memory access patterns. At the very least only the minimum amount of files should have it disabled, and only because time has not permitted their repair yet. Although it may seem complex to properly alias memory, the tests where it is really necessary for performance are actually quite few and should already be tested rigorously. It is unlikely that code that does not enable strict aliasing would be able to take advantage of the restrict keyword. Using the restrict keyword allows a significant class of memory access optimizations critical to high performance code. For more information on the restrict keyword see: Demystifying The Restrict Keyword

Demystifying The Restrict Keyword

2006-05-30T05:38:59Z

UPDATED! More examples! More detailed explainations!

Contract

The restrict keyword can be considered an extension to the strict aliasing rule. It allows the programmer to declare that pointers which share the same type (or were otherwise validly created) do not alias eachother. By using restrict the programmer can declare that any loads and stores through the qualified pointer (or through another pointer copied either directly or indirectly from the restricted pointer) are the only loads and stores to the same address during the lifetime of the pointer. In other words, the pointer is not aliased by any pointers other than its own copies.

Restrict is a "no data hazards will be generated" contract between the programmer and the compiler. The compiler relies on this information to make optimizations. If the data is, in fact, aliased, the results are undefined and a programmer should not expect the compiler to output a warning. The compiler assumes the programmer is not lying.

THE RESTRICT CONTRACT

I, [insert your name], a PROFESSIONAL or AMATEUR [circle one] programmer recognize that there are limits to what a compiler can do. I certify that, to the best of my knowledge, there are no magic elves or monkeys in the compiler which through the forces of fairy dust can always make code faster. I understand that there are some problems for which there is not enough information to solve. I hereby declare that given the opportunity to provide the compiler with sufficient information, perhaps through some key word, I will gladly use said keyword and not bitch and moan about how "the compiler should be doing this for me."

In this case, I promise that the pointer declared along with the restrict qualifier is not aliased. I certify that writes through this pointer will not effect the values read through any other pointer available in the same context which is also declared as restricted.

* Your agreement to this contract is implied by use of the restrict keyword ;)

Read on for more information on the practical use and benefits to using the restrict keyword... ]]> Restrict is a type qualifier

A new feature of C99: The restrict type qualifier allows programs to be written so that translators can produce significantly faster executables. [...] Anyone for whom this is not a concern can safely ignore this feature of the language.

-- From Rationale for International Standard - Programming Languages - C [std.dkuug.dk] (6.7.3.1 Formal definition of restrict)

The restrict keyword is a type qualifier for pointers and is a formal part of the C99 standard.

Example usage:

int* restrict foo;

Notice that the restrict keyword qualifies the pointer and not the object being pointed to.

Not all compilers are compliant with the C99 standard. For example Microsoft's compiler, does not support the C99 standard at all. If you are using MSVC on a x86 platform you will not have access to this critical optimization option.

When using GCC, remember to enable the C99 standard by adding -std=c99 to your compilation flags. In code that cannot be compiled with C99, use either __restrict or __restrict__ to enable the keyword as a GCC extension.

The restrict keyword was not included as part of the C++98 standard. However some C++ compilers may support it as an extension. It's important that when restrict is used in C++ to remember that the implicit this pointer should also be restricted. Consult your compiler's manual for how to do this, if possible.

An understanding the strict aliasing rule will provide good context for problems related to the restrict keyword.

Why was restrict introduced into C99?

The problem that the restrict qualifier addresses is that potential aliasing can inhibit optimizations. Specifically, if a translator cannot determine that two different pointers are being used to reference different objects, then it cannot apply optimizations such as maintaining the values of the objects in registers rather than in memory, or reordering loads and stores of these values. This problem can have a significant effect on a program that, for example, performs arithmetic calculations on large arrays of numbers. The effect can be measured by comparing a program that uses pointers with a similar program that uses file scope arrays (or with a similar Fortran program). The array version can run faster by a factor of ten or more on a system with vector processors. Where such large performance gains are possible, implementations have of course offered their own solutions, usually in the form of compiler directives that specify particular optimizations. Differences in the spelling, scope, and precise meaning of these directives have made them troublesome to use in a program that must run on many different systems. This was the motivation for a standard solution.

-- From Rationale for International Standard - Programming Languages - C [std.dkuug.dk] (6.7.3.1 Formal definition of restrict)

In other words, proper use of the restrict keyword gives the compiler enough information to select a more optimal order of loads and stores to/from memory and to potentially make better use of registers to store non-aliased objects.

Non-aliased Memory Windows

Given the following structure, there is a significant difference in performance in even the smallest update loops.

typedef struct vector3  vector3;

struct vector3
{
  float x;
  float y;
  float z;
};

What follows is a simple example function that updates some "particles" with unrestricted pointers. Note that the pointers share the same type, so the compiler will assume they can be aliased, per the strict aliasing rule.

The example code sections in the article are not meant to serve as examples of real production code, but rather as examples of real patterns often found in production code.

void
move( vector3* velocity, 
      vector3* position, 
      vector3* acceleration, 
      float    time_step, 
      size_t   count )
{
  for (size_t i=0;i  {
    velocity[i].x += acceleration[i].x * time_step;
    velocity[i].y += acceleration[i].y * time_step;
    velocity[i].z += acceleration[i].z * time_step;
    position[i].x += velocity[i].x     * time_step;
    position[i].y += velocity[i].y     * time_step;
    position[i].z += velocity[i].z     * time_step;
  }
}

This article will examine the assembly output generated for the PowerPC. However, the principles and suggestions presented are applicable to many common architectures.

# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
  cmpwi  0,6,0
  stwu   1,-16(1)
  beq-   0,.L7
  li     8,0
  mtctr  6
.L8:
  add    9,8,3
  lfsx   13,8,5
  add    10,8,5
  lfsx   0,8,3
  lfs    8,4(9)
  add    11,8,4
  lfs    5,8(10)
  lfs    7,4(10)
  lfs    6,8(9)
  fmadds 4,13,1,0
  fmadds 3,7,1,8
  fmadds 2,5,1,6
  stfsx  4,8,3      # Store velocity_x
  stfs   3,4(9)     # Store velocity_y
  stfs   2,8(9)     # Store velocity_z
  lfsx   11,8,4     # Load position_x
  lfs    10,4(11)   # Load position_y
  lfs    9,8(11)    # Load position_z
  fmadds 12,4,1,11
  fmadds 0,3,1,10
  fmadds 13,2,1,9
  stfsx  12,8,4
  addi   8,8,12
  stfs   0,4(11)
  stfs   13,8(11)
  bdnz   .L8
.L7:
  addi   1,1,16
  blr

Notice above that position must wait for velocity to be stored. This is because the compiler cannot gaurantee that the two are not aliased and must assume that the write to velocity can overwrite the location where position will be read. Because the compiler must effectively perform the operations in the order declared in the source, it must assume this is the behavior the programmer intended.

The use of unrestricted pointers inhibits the compiler's ability to schedule loads and may cause redundant loads in many cases. With few exceptions, accessing any value through a pointer will force the compiler to load, or reload, the value after any store. This is because the compiler cannot gaurantee that the value being loaded was not aliased by the value that was stored.

For instance, there is no reason (other than sanity) why the programmer could not call the function in this way:

void 
call_move( vector3* some_data, float time_step, count )
{
  move( some_data, some_data, some_data, time_step, count );
}

The use of restricted pointers would specifically disallow this.

Compare this to the same function working with arrays of file scope. Working with file scope arrays represents the best case for the compiler with regard to alias analysis and should be used as the baseline for implementing functions with restricted pointers.

vector3 velocity     [ PARTICLE_COUNT ];
vector3 position     [ PARTICLE_COUNT ];
vector3 acceleration [ PARTICLE_COUNT ];
 
void
move( float time_step )
{
  for (size_t i=0;i  {
    velocity[i].x += acceleration[i].x * time_step;
    velocity[i].y += acceleration[i].y * time_step;
    velocity[i].z += acceleration[i].z * time_step;
    position[i].x += velocity[i].x     * time_step;
    position[i].y += velocity[i].y     * time_step;
    position[i].z += velocity[i].z     * time_step;
  }
}

With the above code the compiler knows the arrays will be stored seperately and can determine that they are three independent data windows, or stripes and there can be no aliasing among them. A data stripe can be thought of as a data channel made up of indexable elements.

Data Channel	Channel Elements (by Index)
velocity	[0] ---> [1] ---> [2] ---> [N]
position	[0] ---> [1] ---> [2] ---> [N]
acceleration	[0] ---> [1] ---> [2] ---> [N]

An element in a restricted data stripe can be a function of one or more elements of any other restricted data stripes, but cannot be a function of a change in an element of a data stripe.

# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
  lis    3,velocity@ha
  lis    11,acceleration@ha
  lis    9,position@ha
  la     6,velocity@l(3)
  la     5,acceleration@l(11)
  la     7,position@l(9)
  li     8,0
  stwu   1,-16(1)
  li     0,8192
  mtctr  0
.L18:
  add    12,8,6
  lfsx   12,8,6     # Load  velocity     + 0
  add    10,8,5
  lfsx   13,8,5     # Load  acceleration + 0
  lfs    8,4(12)    # Load  velocity     + 4
  add    4,8,7
  lfs    5,8(10)    # Load  acceleration + 8
  lfs    6,8(12)    # Load  velocity     + 8
  lfs    7,4(10)    # Load  acceleration + 4
  fmadds 9,13,1,12
  fmadds 10,7,1,8
  fmadds 11,5,1,6
  lfsx   4,8,7      # Load  position     + 0
  lfs    3,4(4)     # Load  position     + 4
  lfs    2,8(4)     # Load  position     + 8
  fmadds 0,9,1,4
  fmadds 13,10,1,3
  fmadds 12,11,1,2
  stfsx  9,8,6      # Store velocity     + 0
  stfs   11,8(12)   # Store velocity     + 8
  stfs   10,4(12)   # Store velocity     + 4
  stfsx  0,8,7      # Store position     + 0
  addi   8,8,12
  stfs   13,4(4)    # Store position     + 4
  stfs   12,8(4)    # Store position     + 8
  bdnz   .L18
  addi   1,1,16
  blr

All the stores are completed at the end of the loop. More specifically, the load for position is scheduled before the store of velocity. This validates that the compiler has enough information to determine that the values stored do not alias the values loaded.

In order to get this same behavior with non-file scope pointers, use the restrict keyword to declare that every location which is either loaded or stored has no aliases.

void
move( vector3* velocity, 
      vector3* position, 
      vector3* acceleration, 
      float    time_step, 
      size_t   count, 
      size_t   stride )
{
  float* restrict acceleration_x = &acceleration->x;
  float* restrict velocity_x     = &velocity->x;
  float* restrict position_x     = &position->x;
  float* restrict acceleration_y = &acceleration->y;
  float* restrict velocity_y     = &velocity->y;
  float* restrict position_y     = &position->y;
  float* restrict acceleration_z = &acceleration->z;
  float* restrict velocity_z     = &velocity->z;
  float* restrict position_z     = &position->z;

  for (size_t i=0;i  {
    velocity_x[i] += acceleration_x[i] * time_step;
    velocity_y[i] += acceleration_y[i] * time_step;
    velocity_z[i] += acceleration_z[i] * time_step;
    position_x[i] += velocity_x[i]     * time_step;
    position_y[i] += velocity_y[i]     * time_step;
    position_z[i] += velocity_z[i]     * time_step;
  }
}

Nine (9) non-aliased memory stipes were declared in the above code. This completely defines the aliasing relationships between all the loads and stores.

Data Channel	Channel Elements (by Index)
velocity_x	[0] ---> [1] ---> [2] ---> [N]
velocity_y	[0] ---> [1] ---> [2] ---> [N]
velocity_z	[0] ---> [1] ---> [2] ---> [N]
position_x	[0] ---> [1] ---> [2] ---> [N]
position_y	[0] ---> [1] ---> [2] ---> [N]
position_z	[0] ---> [1] ---> [2] ---> [N]
acceleration_x	[0] ---> [1] ---> [2] ---> [N]
acceleration_y	[0] ---> [1] ---> [2] ---> [N]
acceleration_z	[0] ---> [1] ---> [2] ---> [N]

By copying addresses from from pointer to another, an implicit hierarchy (or tree) of pointers is created. The child pointers are usually completely aliased by the parent pointer and it's important not to use them both at the same time (i.e. in the same scope). When restricted child pointers are created, consider the parent pointer to be out of scope and do not make an accesses through it. Note that in this case, any use of velocity, position or acceleration would invalidate the restrict contract and the results would be undefined.

                |---> velocity_x
velocity -------|---> velocity_y
                |---> velocity_z

                |---> position_x
position -------|---> position_y
                |---> position_z

                |---> acceleration_x
acceleration ---|---> acceleration_y
                |---> acceleration_z

Typically, only the leaf nodes in a hierarchy of restricted pointers should be used.

This code was compiled with GCC 3.4.1 for PowerPC with the following options: -O3 -fstrict-aliasing -std=c99

# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
  stwu   1,-32(1)
  stw    31,28(1)
  mullw  31,6,7
  stw    30,24(1)
  cmplwi 7,31,0
  mr     30,7
  addi   12,3,4
  addi   6,5,4
  addi   8,4,4
  addi   7,5,8
  addi   10,3,8
  addi   11,4,8
  li     9,0
  ble-   7,.L27
.L31:
  slwi   0,9,2
  lfsx   13,3,0     # Load  velocity_x
  add    9,9,30
  lfsx   8,12,0     # Load  velocity_y
  cmplw  7,31,9
  lfsx   6,10,0     # Load  velocity_z
  lfsx   12,5,0     # Load  acceleration_x
  lfsx   7,6,0      # Load  acceleration_y
  lfsx   5,7,0      # Load  acceleration_z
  fmadds 11,12,1,13
  fmadds 10,7,1,8
  fmadds 9,5,1,6
  lfsx   4,4,0      # Load  position_x
  lfsx   3,8,0      # Load  position_y
  lfsx   2,11,0     # Load  position_z
  fmadds 0,11,1,4
  fmadds 13,10,1,3
  fmadds 12,9,1,2
  stfsx  11,3,0     # Store velocity_x
  stfsx  10,12,0    # Store velocity_y
  stfsx  9,10,0     # Store velocity_z
  stfsx  0,4,0      # Store position_x
  stfsx  13,8,0     # Store position_y
  stfsx  12,11,0    # Store position_z
  bgt+   7,.L31
.L27:
  lwz    30,24(1)
  lwz    31,28(1)
  addi   1,1,32
  blr

This version has all the flexibility of the first (unrestricted) version and the performance of the second (file scope arrays) version. You should expect code where all aliasing information is declared with the restrict keyword to almost always perform significantly better, and never worse, than with unrestricted pointers. This is especially true on superscalar RISC, or RISC-like architectures with large register files, like the PowerPC or MIPS R4000.

The asute reader may also have noticed that because nine (9) restricted stripes were used instead of three (3) file scope arrays, the compiler has been able to select a much simplier addressing scheme. Much of the pointer arithmetic has been hoisted out of the loop. The version with the restricted pointers is actually more efficient than the one with file scope arrays.

Non-aliased Memory Access Patterns

An important distinction to make is that the restrict keyword is not restricting anything. It is in fact allowing the compiler to do more than it could previously. It should also be noted that the type of the pointer that is qualified with restrict is not important, it is only important what location and size was used when loading or storing from the pointer. The restrict keyword does not declare that the object being pointed to is completely without aliases, only that the addresses that are loaded and stored from are unaliased.

For example, the following setup would be a completely valid use of restricted pointers:

struct particle
{
  vector3 position;
  vector3 velocity;
  vector3 acceleration;
};
 
[ ... ]
 
void 
call_move( particle* particles, float time_step, count )
{
  move( &particles->position, 
        &particles->velocity, 
        &particles->acceleration, 
        time_step, 
        count, 
        sizeof(particle) );
}

Although each stripe of data is part of the same "object", none of the accesses would be aliased. Some runtime systems try to determine whether or not pointers are aliased by simply checking to see if the memory windows overlap. That is not sufficient.

Memory windows can overlap and still be non-aliased.

Usage and Suggestions

Use of the restrict keyword should be very common. It should be used as a standard part of all new code. Older code should be revisited as possible to take advantage of the new optimization opportunities. It is somewhat difficult to refactor restricted requirements into pre-existing code as a certain amount of alias analysis must be done by the programmer. However, for the majority of live code in typical applications, memory access is not aliased (nor are memory windows overlapping) and aliasing hazards will be limited to a small fraction of the code base.

Before modifying code to use the restrict keyword, ensure that all code can compile safely with strict aliasing enabled.

Programmers using functions that make assumptions about aliasing must know what those assumptions are. Certainly, if at all possible, memory usage patterns should be documented. However, at the very least, aliasing assumptions in the parameters passed to the functions should be declared. In the above examples, the parameters velocity, position and acceleration must not be aliased and the restrict contract should be made public by also declaring those parameters restricted.

void 
move( vector3* restrict velocity, 
      vector3* restrict position, 
      vector3* restrict acceleration, 
      float             time_step, 
      size_t            count, 
      size_t            stride );

Not publishing aliasing assumptions will lead to very difficult to find bugs. Programmers will not know that the data must be independent and someone, someday will find a reason to use the same array in two or more pointers.

Take for example memcpy, which has been officially changed to have the following declaration:

void* 
memcpy(void*       restrict s1, 
       const void* restrict s2, 
       size_t               n );

Can you guess why?

Use restrict in function prototypes and in structure definitions to publish the assumptions made about aliasing.

Restricted pointers can be copied from one to another to create a hierarchy of pointers. However there is one limitation defined in the C99 standard. The child pointer must not be in the same block-level scope as the parent pointer. The result of copying restricted pointers in the same block-level scope is undefined.

{
  vector3* restrict position   = &obj_a->position;
  float*   restrict position_x = &position->x; <-- UNDEFINED
  {
    float* restrict position_y = &position->y; <-- VALID
  }
}

Restricted child pointers must be in a different block-level scope than the parent pointer.

There is one additional problem in the assembly output above which is somewhat particular to the GCC scheduler. Notice that the load for position happens immediately before its update and store. The first multiply-add will stall waiting the first load to be completed before executing. The first float (position_x) will not be ready in three (3) cycles. It would be considerably better (and faster) if the load could be pushed closer to the top of the loop so that it is more likely to be completed by the time it is needed.

  lfsx   4,4,0      # Load   position_x
  lfsx   3,8,0      # Load   position_y
  lfsx   2,11,0     # Load   position_z
  fmadds 0,11,1,4   # Update position_y
  fmadds 13,10,1,3  # Update position_x
  fmadds 12,9,1,2   # Update position_z

Due to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:

void
move( vector3* restrict velocity, 
      vector3* restrict position, 
      vector3* restrict acceleration, 
      float             time_step,  
      size_t            count, 
      size_t            stride )
{
  float* restrict acceleration_x = &acceleration->x;
  float* restrict velocity_x     = &velocity->x;
  float* restrict position_x     = &position->x;
  float* restrict acceleration_y = &acceleration->y;
  float* restrict velocity_y     = &velocity->y;
  float* restrict position_y     = &position->y;
  float* restrict acceleration_z = &acceleration->z;
  float* restrict velocity_z     = &velocity->z;
  float* restrict position_z     = &position->z;

  for (size_t i=0;i  {
    const float ax  = acceleration_x[i];
    const float ay  = acceleration_y[i];
    const float az  = acceleration_z[i];
    const float vx  = velocity_x[i];
    const float vy  = velocity_y[i];
    const float vz  = velocity_z[i];
    const float px  = position_x[i];
    const float py  = position_y[i];
    const float pz  = position_z[i];

    const float nvx = vx + ( ax * time_step );
    const float nvy = vy + ( ay * time_step );
    const float nvz = vz + ( az * time_step );
    const float npx = px + ( vx * time_step );
    const float npy = py + ( vy * time_step );
    const float npz = pz + ( vz * time_step );

    velocity_x[i]   = nvx;
    velocity_y[i]   = nvy;
    velocity_z[i]   = nvz;
    position_x[i]   = npx;
    position_y[i]   = npy;
    position_z[i]   = npz;
  }
}

# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
  stwu   1,-32(1)
  stw    31,28(1)
  mullw  31,6,7
  stw    30,24(1)
  cmplwi 7,31,0
  mr     30,7
  addi   12,3,4
  addi   6,5,4
  addi   8,4,4
  addi   7,5,8
  addi   10,3,8
  addi   11,4,8
  li     9,0
  ble-   7,.L47
.L51:
  slwi   0,9,2
  lfsx   8,3,0       # Load   vx
  add    9,9,30
  lfsx   7,12,0      # Load   vy
  cmplw  7,31,9
  lfsx   6,10,0      # Load   vz
  lfsx   10,4,0      # Load   px
  lfsx   9,8,0       # Load   py
  lfsx   5,11,0      # Load   pz
  lfsx   4,5,0       # Load   ax
  lfsx   3,6,0       # Load   ay
  lfsx   2,7,0       # Load   az
  fmadds 0,8,1,10    # Update npx
  fmadds 13,7,1,9    # Update npy
  fmadds 12,6,1,5    # Update npz
  fmadds 11,4,1,8    # Update nvx
  fmadds 10,3,1,7    # Update nvy
  fmadds 9,2,1,6     # Update nvz
  stfsx  0,4,0       # Store  npx
  stfsx  13,8,0      # Store  npy
  stfsx  12,11,0     # Store  npz
  stfsx  11,3,0      # Store  nvx
  stfsx  10,12,0     # Store  nvy
  stfsx  9,10,0      # Store  nvz
  bgt+   7,.L51
.L47:
  lwz    30,24(1)
  lwz    31,28(1)
  addi   1,1,32
  blr

The loads are now properly scheduled and moved as far in advance as possible. The pattern [Load --> Update --> Store] is usually the optimal pattern for simple memory transformations on a superscalar RISC-like architecture, and is exactly what is being emitted. This is reasonably close to good hand-written assembly for the same code (without re-defining the problem), and the code now very suitable for unrolling.

Simplify expressions. Do not mix memory access with calculations. Use the [ Load --> Update --> Store ] pattern.

Summary

Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the -fstrict-aliasing flag. Be sure that all code can safely run with this rule enabled. Enable strict aliasing related warnings with -Wstrict-aliasing, but do not expect to be warned in all cases.
Compare the assembly output of the function with restricted pointers and file scope arrays to ensure that all of the possible aliasing information has been used.
Only use restricted leaf pointers. Use of parent pointers may break the restrict contract.
Publish as many assumptions as possible about aliasing information in the function declaration.
Memory windows may be overlapping and still be without aliases. Do not limit the data design to non-overlapping windows.
Begin using the restrict keyword immediately. Retrofit old code as soon as possible.
Keep loads and stores separated from calculations. This results in better scheduling in GCC, and makes the relationship between the output assembly and the original source clearer.

Additional Reading

Avoiding Microcoded Instructions On The PPU

2006-04-29T06:31:44Z

What are microcoded instructions?

Microcode is a special instruction set that is (usually) only available to the hardware. On the PPU (PowerPC Unit), small microprograms made up of microcode are stored in ROM and executed in the place of those PowerPC instructions that were too costly to implement directly in hardware or do not fit into the pipeline design very well. The size of a microprogram is measured in microwords.

The PowerPC instructions for which a microprogram is executed are often called microcoded instructions.

Microcoded instructions may be conditionally executed or unconditionally executed. Unconditionally executed microcoded instructions always execute the microprogram. Conditionally executed microcoded instructions will only execute the microprogram when the values of the register operands are exceptional in some way. Microcoded instructions are a special case of normal instructions and conditionally executed microcoded instructions are a special case of those. ]]>

Why avoid microcoded instructions?

The G5 core implements several instructions in microcode. These instructions cause a pipeline bubble during decode. The most commonly used microcoded instructions are load and store multiple -- lmw and stmw. These are often generated by the compiler to save space when saving and restoring registers on the stack. You can force GCC to avoid these instructions by specifying -mnomultiple. Indexed forms and/or algebraic forms of updating load and stores are also executed as microcode. You can force GCC to avoid these instructions by specifying -mno-update.

-- From G5 Performance Primer [apple.com]

Like the G5, the PPU contains microcoded instructions. Microcoded instructions are implemented in order to maintain compatibility with the PowerPC standard (a processor can only be called a PowerPC processor if it adheres to the standard [ibm.com].) When one of these instructions is decoded, the current pipeline is flushed, the microded program is then fetched from ROM and executed as a single atomic unit. The process of flushing the pipeline, fetching the microcode and executing the program takes quite a long time compared to other instructions. Additionally, because the instruction must be executed atomically in order to remain as transparent to the user as possible, any resources needed by the microcode program must be locked.

;; micr insns will stall at least 7 cycles to get the first instr from ROM, micro instructions are not dual issued.

-- From cellpu.md (CBE Toolchain 2.3 source code [bsc.es])

The minimum seven (7) cycle stall for microcoded instructions is derived from the fixed stages of the microcode section of the instruction pipeline. Microcoded stages are inserted after the last instruction buffer stage and before the first instruction decode stage. The actual penalty is determined by the complexity and length of the instruction.

For more information on the PPU pipeline stages see: Introduction to the Cell multiprocessor [ibm.com]

The details on which instructions are microcoded and the associated penalties are specific to each PowerPC device and are outlined in the User's Guide for the individual processor. For example, see the IBM PowerPC 970FX RISC Microprocessor User's Guide [ibm.com] paying particular attention to Section 6.3.3 Instruction Decode, Cracking, and Microcode.

The PPU User's Guide has not been released publically. So how is a programmer to know which instructions are microcoded and how to avoid them?

Read on to find out.

UPDATE: 11 MAY 2006

On May 10, 2006 IBM released the Cell Broadband Engine Programming Handbook [ibm.com]. Section A.1.3.1 (Unconditionally Microcoded Instructions) has a detailed list of those instructions which are always microcoded, including latency information and microword count. Before this document was released there were no public documents which described in detail, the penalties for using microcoded instructions. This article has been updated to reflect those details.

From the document:

Note: A minimum of 11 cycles are required before the first instruction is received from the microcode ROM, so microcoded instructions should be avoided if possible.

Most microcoded instructions are decoded into two or three simple PowerPC instructions, and they can be avoided in most cases. The microcoded instructions are typically decomposed into an integer and a load or store operation, with a dependency between them. Although most microcoded PowerPC instructions are decoded into only a few simple instructions, it is important to keep in mind that there are typically dependencies between the internal operations of the microcode, which generate stalls at the issue stage. Replacing the microcoded instructions with PowerPC instructions not only avoids stalling but also gives more latitude in scheduling instructions to avoid stalls, as well as potentially improving multithreaded performance.

Microcoded instruction scheduling

Like many of the specific details of the processor, any good compiler needs to understand (and take advantage of) the predicted latency and throughput information on each instruction. So a programmer need look no further than the CBE GCC source code [bsc.es] for a list of microcoded instructions.

Here's an example extry from the rs6000.md file (which is used by the cell-ppu target) which flags the first instruction in the replacement ("rldicl.") as being microcoded.

(define_insn "" [(set (match_operand:CC 0 "cc_reg_operand" "=x,?y") (compare:CC (zero_extend:DI (match_operand:QI 1 "gpc_reg_operand" "r,r")) (const_int 0))) (clobber (match_scratch:DI 2 "=r,r"))] "TARGET_64BIT" "@ rldicl. %2,%1,0,56 #" [(set_attr "type" "compare") (set_attr "microcode" "mc,*") (set_attr "length" "4,8")])

The above snippet is written in RTL [gnu.org], which stands for Register Transfer Language and is used to describe the processor specific assembly output in GCC. Assembly-level transformations, such as peephole optimizations, are also described in RTL. For an introduction to RTL see: Using and Porting the GNU Compiler Collection (GCC) [gnu.org] and Porting GCC For Dummies [axis.se]

Here is a partial list of the microcoded instructions explicitly flagged in the same file:

and. andi. andil. andis. andiu. doz*. lhau lhaux lm lmw lsi lswi mr. mullw. muls. neg. nor. or. rldic*. rlinm. rlwinm. s*i. s*wi. sf. sl sle. sli. slw slwi. sr sre. srw stm stmw stsi stswi subf. subfc.

For the definitive list of microcoded instructions see the Cell Broadband Engine Programming Handbook [ibm.com]. The corresponding instructions have been added to the sections below.

Avoiding microcoded instructions

Microcoded instructions, such as load/store multiple, were designed to save space in compiled code and offer no performance advantage over using multiple instructions. Because of the way these instructions are handled inside the processor, they might have a greater latency and take longer to execute than a sequence of individual instructions that produces the same results. Some compilers (gcc for example) have options that prevent generation of these instructions.

-- From PowerPC processor tips: Improve PowerPC 970FX performance [ibm.com]

Fortunately, there is a GCC flag that will warn the programmer if a known microcoded instruction is emitted. Simply add -mwarn-microcode to your compilation flags.

This flag is defined in rs6000.h:

{"warn-microcode", &rs6000_warn_microcode_switch, \ N_("Emitting warning of microcode") }, \ {"no-warn-microcode", &rs6000_warn_microcode_switch, "" }, \

And is processesed in rs6000.c:

/* Handle -m(no-)warn-microcode similarly. */ if (rs6000_warn_microcode_switch) { const char *base = rs6000_warn_microcode_switch; while (base[-1] != 'm') base--; if (*rs6000_warn_microcode_switch != '\0') error ("invalid option `%s'", base); rs6000_warn_microcode = (base[0] != 'n'); }

And is used in final.c:

#ifdef RS6000_GENERATE_MICROCODE /* 0 - notmicrocode, 1 - conditional microcode, 2 - microcode */ if (rs6000_warn_microcode) { if (get_attr_microcode(insn) == 2) pedwarn ("emitting microcode insn %s\t[%s] #%d",template, insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); else if (get_attr_microcode(insn) == 1) pedwarn ("emitting conditional microcode insn %s\t[%s] #%d",template, insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); } #endif

The other PowerPC specific compilation flags can also be found in these files.

The compiler source is the best source for information on compiler flags and processor specific options. Some flags do not make it into the help output.

Note that -mwarn-microcode is not in the gcc help list of flags.

How does this affect code in practice? From the above list, there are three main classes of microcoded instructions to watch out for.

Avoid multiple load/store instructions

These instructions are handy to load or store a small contiguous area of memory. However, it will always be faster to simply load each individual value into a register. GCC will not emit these instructions if the -mno-multiple flag is passed to the compiler.

List of microcoded load and store instructions, including load/store multiple. From: Cell Broadband Engine Programming Handbook [ibm.com]

|--------------------------------------------------------------------------------------------------------------------| | Unconditionally Microcoded Loads and Stores | |--------------------------------------------------------------------------------------------------------------------| | | | A microcode load or store operation can access an 8-bit byte or a 32-bit word, indicated as "by byte" or | | "by word" respectively. | | | |--------------------------------------------------------------------------------------------------------------------| | INSTRUCTION | CLASS | LATENCY | MICROWORD SIZE | COMMENT | |---------------------|-----------------|----------|-----------------------------|-----------------------------------| | lha | load algebraic | 11 | 7 | Handled by byte. | | lhau | load algebraic | 11 | 8 | Handled by byte. | | lhaux | load algebraic | 11 | 8 | Handled by byte. | | lhax | load algebraic | 11 | 8 | Handled by byte. | | lmw | load multiple | 11 |(2 + 1 × words) | This instruction is broken down | | | | | | into a series of load words. | | lswi | load string / | 10 | By word: | Optimized instruction[1] | | | optimized | | (1 × words + 2 × bytes) | | | | | | By byte: | | | | | | (2 × bytes) | | | lswx | load string / | By word: | By word: | Optimized instruction[1] | | | optimized | 10 | 4 + (1 × words + 2 × bytes) | | | | | By byte: | By byte: | | | | | 7 | 4 + (2 × bytes) | | | | | | | | | | | | | | | lwa | load algebraic | 11 | 13 | Handled by byte. | | lwaux | load algebraic | 11 | 12 | Handled by byte. | | lwax | load algebraic | 11 | 12 | Handled by byte. | | stmw | store multiple | 11 | (2 + 1 × words) | Broken into a series of store | | | | | | words. | | stswi | store string / | 10 | By word: | | | | optimized | | (1 × words + 2 × bytes) | Optimized instruction[1] | | | | | By byte: | | | | | | (2 × bytes) | | | stswx | store string / | 7 | By word: | Optimized instruction[1] | | | optimized | | 4 + (1 × words + 2 × bytes) | | | | | | By byte: | | | | | | 4 + (2 × bytes) | | | | | | | | |--------------------------------------------------------------------------------------------------------------------| [1] The instruction is first broken down into a series of load-word instructions (odd bytes are handled by byte). If this does not cause an alignment exception, then the instruction is complete. If an alignment exception occurs, the first attempt is flushed. When the instruction is returned to microcode it is then handled a byte at a time. Odd bytes, if any, are defined as the remainder of string_count / 4. For store instructions, it is a series of store words.

Avoid Condition Register recording integer instructions

Many of the integer functions, when the Condition Register (CR) modify bit is set (denoted by a "dot" at the end of the instruction), are microcoded. With this bit set, fixed-point instructions will automatically set the first field (field zero) in the Condition Register with the value's compare-with-zero result. For example, if the result of the "or." instruction is greater than zero, the GT bit will be set in CR[0].

In general, this makes branching on integer expressions more expensive and an effort should be made to eliminate them.

List of CR recording microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]

|-------------------------------------------------| | Unconditionally Microcoded Instructions | | (CR recording) | | | | Record instructions are all handled the | | same way. The "root" instruction is issued | | followed by the cmpi_x instruction. | | | | The nonrecord form used in the microcode | | sequence is only available to microcode. | | | |-------------------------------------------------| | INSTRUCTION | LATENCY | MICROWORD SIZE | |---------------------|---------|-----------------| | and. | 11 | 2 | | andc. | 11 | 2 | | andi. | 11 | 2 | | andis. | 11 | 2 | | nand. | 11 | 2 | | nor. | 11 | 2 | | nego. | 11 | 2 | | or. | 11 | 2 | | orc. | 11 | 2 | | xor. | 11 | 2 | | cntlzd. | 11 | 2 | | cntlzw. | 11 | 2 | | divd. | 11 | 2 | | divdo. | 11 | 2 | | divdu. | 11 | 2 | | divduo. | 11 | 2 | | divw. | 11 | 2 | | divwo. | 11 | 2 | | divwu. | 11 | 2 | | divwuo. | 11 | 2 | | eqv. | 11 | 2 | | extsb. | 11 | 2 | | extsh. | 11 | 2 | | extsw. | 11 | 2 | | mulhd. | 11 | 2 | | mulhdu. | 11 | 2 | | mulhw. | 11 | 2 | | mulhwu. | 11 | 2 | | mulld. | 11 | 2 | | mulldo. | 11 | 2 | | mullw. | 11 | 2 | | mullwo. | 11 | 2 | | rldcl. | 11 | 5 | | rldcr. | 11 | 5 | | rldic. | 11 | 2 | | rldicl. | 11 | 2 | | rldicr. | 11 | 2 | | rldimi. | 11 | 2 | | rlwimi. | 11 | 2 | | rlwinm. | 11 | 2 | | rlwnm. | 11 | 5 | | sld. | 11 | 5 | | slw. | 11 | 5 | | srad. | 11 | 5 | | sradi. | 11 | 2 | | sraw. | 11 | 5 | | srawi. | 11 | 2 | | srd. | 11 | 5 | | srw. | 11 | 5 | |---------------------|---------|-----------------|

Avoid indirect shift and rotate instructions

This is the simpliest case to find, but the hardest to eliminate:

int64_t right_shift64( int64_t a, int64_t sa ) { return ( a >> sa ); }

This code will emit this deceptively simple function:

.right_shift_64: srad 3,3,4 blr

And the following warning (if -mwarn-microcode is enabled):

test.c: In function `right_shift_64': test.c:7: warning: emitting microcode insn srad%I2 %0,%1,%H2 [*ashrdi3_internal1] #20

The best option for eliminating indirect shift instructions is to know the range of possible shift amounts and create an alternate branch-free expression that selects between those choices.

List of indirect shift and rotate microcoded instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]

|--------------------------------------------------| | Unconditionally Microcoded Instructions | | (Shift and Rotate) | | | | All indirect shift and rotate instructions | | are handled using the same technique. First | | the mt_shr is issued, followed by two noops | | for delay, followed by the root instruction | | (that is, rldcl_sh). | | | |--------------------------------------------------| | INSTRUCTION | LATENCY | MICROWORD SIZE | |---------------------|---------|------------------| | rldcl | 11 | 4 | | rldcr | 11 | 4 | | rlwnm | 11 | 4 | | sld | 11 | 4 | | slw | 11 | 4 | | srad | 11 | 4 | | sraw | 11 | 4 | | srd | 11 | 4 | | srw | 11 | 4 | |--------------------------------------------------|

Non-Pipelined, Complex Instructions

In addition to microcoded instructions there is another class of low performance instructions worth mentioning: the complex pipeline instructions. These instructions are are not microcoded (i.e. the resources already local to the execution pipeline can be used directly), however they are complex enough that special handling is required. In order for these instructions to be executed the instruction pipeline must be evacuated (i.e. flushed). Therefore the throughput of these instructions will be equal to the latency - They will be slow.

List of the non-pipelined instructions. From: Cell Broadband Engine Programming Handbook [ibm.com]

|----------------------------------------------------| | Non-Microcoded, Non-Pipelined Integer Instructions | |---------|----------|-------------------------------| | Instr. | Pipeline | Latency (cycles) | |---------|----------|-------------------------------| | mulli | FXU | 6 | | mullw | FXU | 9 | | mulhw | FXU | 9 | | mulhwu | FXU | 9 | | mullwo | FXU | 9 | | mulld | FXU | 15 | | mulhd | FXU | 15 | | mulhdu | FXU | 15 | | mulldo | FXU | 15 | | divd | FXU | 10-70 | | divdu | FXU | 10-70 | | divdo | FXU | 10-70 | | divduo | FXU | 10-70 | | divw | FXU | 10-38 | | divwu | FXU | 10-38 | | divwo | FXU | 10-38 | | divwuo | FXU | 10-38 | |---------|----------|-------------------------------| Note on divide instructions: The fixed-point divide is a variable latency operation that calculates RA and RB for word or doubleword and signed or unsigned fixedpoint (integer) operands. Division is defined by the following equation: dividend = (quotient x divisor) + r where: 0 = r < |divisor|, when dividend = 0 and -|divisor| < r = 0, when dividend < 0 Overflow is set when an attempt is made to compute either the least negative integer divided by negative one or any integer divided by zero. The performance is determined by the number of bits required to represent the result. PPU cycles equal: ((1 setup) + (ceil ((rb leading digits - ra leading digits)/2) + 1 iterations) + (1 fixup)) × 2 word minimum = 10, maximum = 38 cycles doubleword minimum = 10, maximum = 70 cycles Overflow cases will complete in 10 cycles

There is no method to detect complex instructions emitted by the GCC compiler. Avoid integer multiplies and divides.

Good luck with that! ;)

Summary

Keep an eye out for microcoded instructions: use -mwarn-microcode in GCC.
Don't use multiple load/store instructions: use -mno-multiple in GCC.
Avoid CR recording integer instructions
Avoid indirect shift and rotate instructions
Avoid integer multiply and divide instructions

Eliminating microcoded and other non-pipelined instructions is sometimes difficult and not always desireable (for example, when code size is the determining factor in performance.) However, it is important to know the penalty and make an informed choice. And as always, if you are optimizing at this level, be sure to double-check your results with a real profile on real hardware.