<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>CellPerformance</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/" />
    <link rel="self" type="application/atom+xml" href="http://cellperformance.beyond3d.com/articles/atom.xml" />
    <id>tag:cellperformance.beyond3d.com,2009-08-03:/articles//3</id>
    <updated>2009-08-07T07:45:57Z</updated>
    <subtitle>Sharing tips and experience with the cell processor, performance, data design and game programming.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.3-en</generator>

<entry>
    <title>Roundup: Recent sketches on concurrency, data design and performance.</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2009/08/roundup-recent-sketches-on-concurrency-data-design-and-performance.html" />
    <id>tag:cellperformance.beyond3d.com,2009:/articles//3.26</id>

    <published>2009-08-07T07:43:24Z</published>
    <updated>2009-08-07T07:45:57Z</updated>

    <summary> var fileList = [{&quot;thumbHeight&quot;:&quot;36&quot;,&quot;originalHeight&quot;:&quot;534&quot;,&quot;thumb&quot;:&quot;http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/Eebh3vIUDHIjEWVQCuRxev9V588eB7kxgph5p1iKHpfikCeOHceNEKLl57sT\/578066108_qy4bS-L.jpg.thumb.jpg&quot;,&quot;largeWidth&quot;:&quot;800&quot;,&quot;largeHeight&quot;:&quot;534&quot;,&quot;originalSize&quot;:&quot;103&quot;,&quot;height&quot;:&quot;334&quot;,&quot;main&quot;:&quot;http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/MP1YPZCziTxR9EYEgNm43z7iyQyvfHDziBW5oAQbqTotCT5PMjvTB6bSZ13b\/578066108_qy4bS-L.jpg.scaled.500.jpg&quot;,&quot;thumbWidth&quot;:&quot;36&quot;,&quot;large&quot;:&quot;http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/jcB48e9XltO3RwVA8nYUbYTHBtorDrG5h9IF6OwKxPo5fH4hig7YiH8oMkKV\/578066108_qy4bS-L.jpg&quot;,&quot;originalWidth&quot;:&quot;800&quot;,&quot;original&quot;:&quot;http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/jcB48e9XltO3RwVA8nYUbYTHBtorDrG5h9IF6OwKxPo5fH4hig7YiH8oMkKV\/578066108_qy4bS-L.jpg&quot;,&quot;width&quot;:&quot;500&quot;}]; var options = {&quot;showDownload&quot;:true}; new PSlideGallery2($(&apos;FDqGpGHrEv&apos;), fileList, options); Recently I&apos;ve been doing some presentations as well as just general sketches of some things I&apos;ve been thinking about regarding optimization, concurrency and data design. I&apos;ve been...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[
					
			
					
    	        
          
					
					<div class="bodytext">
						<div id="FDqGpGHrEv" class="posterousGalleryMainDiv"><a class="posterousGalleryMainlink" onclick="return false;" href="http://macton.posterous.com/roundup-recent-sketches-on-concurrency-data-d#"><img id="mainImage" src="http://posterous.com/getfile/files.posterous.com/macton/MP1YPZCziTxR9EYEgNm43z7iyQyvfHDziBW5oAQbqTotCT5PMjvTB6bSZ13b/578066108_qy4bS-L.jpg.scaled.500.jpg" width="500" height="334" /><span class="show"></span></a></div>
        <script type="text/javascript">
          var fileList = [{"thumbHeight":"36","originalHeight":"534","thumb":"http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/Eebh3vIUDHIjEWVQCuRxev9V588eB7kxgph5p1iKHpfikCeOHceNEKLl57sT\/578066108_qy4bS-L.jpg.thumb.jpg","largeWidth":"800","largeHeight":"534","originalSize":"103","height":"334","main":"http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/MP1YPZCziTxR9EYEgNm43z7iyQyvfHDziBW5oAQbqTotCT5PMjvTB6bSZ13b\/578066108_qy4bS-L.jpg.scaled.500.jpg","thumbWidth":"36","large":"http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/jcB48e9XltO3RwVA8nYUbYTHBtorDrG5h9IF6OwKxPo5fH4hig7YiH8oMkKV\/578066108_qy4bS-L.jpg","originalWidth":"800","original":"http:\/\/posterous.com\/getfile\/files.posterous.com\/macton\/jcB48e9XltO3RwVA8nYUbYTHBtorDrG5h9IF6OwKxPo5fH4hig7YiH8oMkKV\/578066108_qy4bS-L.jpg","width":"500"}];
          var options = {"showDownload":true};
          new PSlideGallery2($('FDqGpGHrEv'), fileList, options); 
        </script>  
      <p>Recently
I've been doing some presentations as well as just general sketches of
some things I've been thinking about regarding optimization,
concurrency and data design. I've been posting them on Twitter to
gather feedback from my pals there. A couple have caused a little
controversy, but remember that all of them are given in the simple
spirit of sharing ideas among peers. And don't forget it's all in good
fun!</p>
<div class="ii gt">
<ul><li><a href="http://cellperformance.beyond3d.com/articles/public/concurrency_rabit_hole.pdf" target="_blank">Intro to concurrency e.g. doubly-linked list is *not* a concurrent data structure.</a></li><li><a href="http://macton.smugmug.com/gallery/8611752_9SU2a/1/568079120_gzhk8#568079120_gzhk8" target="_blank">Problem #1: Increment Problem</a></li><li> <a href="http://macton.smugmug.com/gallery/8589754_JQx7x#566272111_EMEhx" target="_blank">Problem #2: Barbershop problem</a></li><li> <a href="http://tinyurl.com/n3empj" target="_blank">Problem #3: Hilzer's barbershop problem</a></li><li> <a href="http://tinyurl.com/lnnqsy" target="_blank">Problem #4: Insert/delete/search problem</a></li><li> <a href="http://tinyurl.com/kw49px" target="_blank">Problem #5: River crossing problem</a></li><li> <a href="http://tinyurl.com/lsuj3l" target="_blank">Problem #6: Unisex bathroom problem</a></li><li> <a href="http://tinyurl.com/l5zszx" target="_blank">Problem #7: Senate Bus</a></li><li><a href="http://tinyurl.com/mtn4hm"><span class="status-body"><span class="msgtxt en">A quick sketch on why qsort is not a concurrent algorithm</span></span></a></li><li><span class="status-body"><span class="msgtxt en"><a href="http://macton.smugmug.com/gallery/8936708_T6zQX#593426709_ZX4pZ">Typical C++ bullshit</a></span></span></li><li><span class="status-body"><span class="msgtxt en"><a href="http://macton.smugmug.com/gallery/8966729_SSA7c#595848343_NzpGQ">Linkers Suck! Part 1</a></span></span></li></ul></div></div>]]>
        
    </content>
</entry>

<entry>
    <title>Three Big Lies</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2008/03/three-big-lies.html" />
    <id>tag:cellperformance.beyond3d.com,2008:/articles//3.27</id>

    <published>2008-03-15T04:44:50Z</published>
    <updated>2009-08-08T04:57:48Z</updated>

    <summary><![CDATA[ This is a repost of a blog entry I wrote for the Insomniac R&amp;D site (Three Big Lies). It's representative of what I believe are some of the fundamental problems in the culture of software development in general, and...]]></summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note">
This is a repost of a blog entry I wrote for the Insomniac R&amp;D site (<a href="http://www.insomniacgames.com/tech/articles/0308/three_big_lies.php">Three Big Lies</a>). It's representative of what I believe are some of the fundamental problems in the culture of software development in general, and games in particular. There are some fundamental truths that seem to be often forgotten. For example, that the point of any program is simply to transform data from one form into another and nothing else. And as one "solution" which ignores the real core problems of development is developed and others over time are built on top of that idea, and so on, we're left with systems that are over-designed, perform poorly and simply do not accomplish what they intended to in the first place - and certainly not well. I continue to suggest that we all take a step back from what we're doing and the methods we're using to solve problems and try to remember what the real issues are.
</div>

<p>One of the things we talked about this year at GDC was what we called the "Three Big Lies of Software Development." How much programmers buy into these "lies" has a pretty profound effect on the design (and performance!) of an engine, or any high-performance embedded system for that matter.</p>]]>
        <![CDATA[<div class="subtitle">(Lie #1) Software is a platform</div>
I blame the universities for this one. Academics like to remove as many variables from a problem as possible and try to solve things under "ideal" or completely general conditions. It's like old physicist jokes that go "We have made several simplifying assumptions... first, let each horse be a perfect rolling sphere..."<br /><br /><p>The reality is software is not a platform. You can't idealize the hardware. And the constants in the "Big-O notation" that are so often ignored, are often the parts that actually matter in reality (for example, memory performance.) You can't judge code in a vacuum. Hardware impacts data design. Data design impacts code choices. If you forget that, you have something that might work, but you aren't going to know if it's going to work well on the platform you're working with, with the data you actually have.</p>

<div class="subtitle">(Lie #2) Code should be designed around a model of the world</div>

<p>There is no value in code being some kind of model or map of an imaginary world. I don't know why this one is so compelling for some programmers, but it is extremely popular. If there's a rocket in the game, rest assured that there is a "Rocket" class (Assuming the code is C++) which contains data for exactly one rocket and does rockety stuff. With no regard at all for what data tranformation is really being done, or for the layout of the data. Or for that matter, without the basic understanding that where there's one thing, there's probably more than one.</p>

<p>Though there are a lot of performance penalties for this kind of design, the most significant one is that it doesn't scale. At all. One hundred rockets costs one hundred times as much as one rocket. And it's extremely likely it costs even more than that! Even to a non-programmer, that shouldn't make any sense. Economy of scale. If you have more of something, it should get cheaper, not more expensive. And the way to do that is to design the data properly and group things by similar transformations.</p>

<div class="subtitle">(Lie #3) Code is more important than data</div>

<p>This is the biggest lie of all. Programmers have spent untold billions of man-years writing about code, how to write it faster, better, prettier, etc. and at the end of the day, it's not that significant. Code is ephimiral and has no real intrinsic value. The algorithms certainly do, sure. But the code itself isn't worth all this time (and shelf space! - have you seen how many books there are on UML diagrams?). The code, the performance and the features hinge on one thing - the data. Bad data equals slow and crappy application. Writing a good engine means first and formost, understanding the data. </p>]]>
    </content>
</entry>

<entry>
    <title>Utility: match</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2007/04/utility-match.html" />
    <id>tag:cellperformance.beyond3d.com,2007:/articles//3.5</id>

    <published>2007-04-08T05:51:54Z</published>
    <updated>2009-08-05T05:54:11Z</updated>

    <summary> Update! If fixed up all the greater-than and less-than symbols in this entry. I didn&apos;t make much sense before. I always forget to change those up in the HTML. I&apos;m just sharing a little utility I use all the...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note">
<b>Update!</b> If fixed up all the greater-than and less-than symbols
in this entry. I didn't make much sense before. I always forget to
change those up in the HTML.
</div>

I'm just sharing a little utility I use all the time called <b>match</b>. <br />
<br />

<pre class="code">Usage: ./match [-h] &lt;source_file&gt; &lt;uniq_file&gt;<br /><br />For each line in &lt;source_file&gt; print the index to the <br />first matching line in &lt;uniq_file&gt;.<br /><br />[-h] Print results in 32 bit hexidecimal (default is decimal)<br /><br />Note: The max line width supported is 4095 characters.<br />Note: Maximum number of lines supported is (2^32)<br /></pre>If I have a source file of data represented as text (as I often
do because it's often easier for me to read binary dumps in a text
editor than a special "hex editor"), I use match to create a table of
indices to unique lines (often these correspond to 128 bits since
that's the size of an SPU register).<br />
<br />
I commonly use it like so (given I have a file called "source_file")
<pre class="code">sort source_file | uniq &gt; uniq_file<br />match source_file uniq_file<br /></pre>

Now I have a handy table of indices! <br />
<br />
Download: <a href="http://cellperformance-snippets.googlecode.com/files/match.c">match.c</a> ]]>
        
    </content>
</entry>

<entry>
    <title>Handy PS3 Linux Framebuffer Utilities</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2007/03/handy-ps3-linux-framebuffer-utilities.html" />
    <id>tag:cellperformance.beyond3d.com,2007:/articles//3.24</id>

    <published>2007-03-31T07:02:25Z</published>
    <updated>2009-08-07T07:05:14Z</updated>

    <summary>While the documentation within Sony&apos;s vsync example should be enough to get you started with writing to the framebuffer, here&apos;s a couple of handy functions to test the framebuffer settings, open the virtual terminal and get access the the frame...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        While the documentation within Sony&apos;s vsync example should be enough to
get you started with writing to the framebuffer, here&apos;s a couple of
handy functions to test the framebuffer settings, open the virtual
terminal and get access the the frame buffer. 
        <![CDATA[Open the virtual terminal:<br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_vt.h">cp_vt.h</a><br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_vt.c">cp_vt.c</a><br />
<br />
Open the framebuffer:<br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_fb.h">cp_fb.h</a><br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_fb.c">cp_fb.c</a><br />
<br />
Dump framebuffer info:<br />
<a href="http://cellperformance-snippets.googlecode.com/files/fb_info.c">fb_info.c</a><br />
<br />

<div class="sticky-note">
Files should be compiled with:
<pre class="code">ppu-gcc -std=c99 -pedantic -W -Wall -O3<br /></pre>
</div>

        <div id="fb_info" class="subtitle">fb_info</div>

fb_info dumps the current settings for the framebuffer setup on the PS3.<br />
<br />
For example - for 480i the output should look something like this:
<pre class="code">FBIOGET_VBLANK:<br />  flags:<br />    FB_VBLANK_VBLANKING   : FALSE<br />    FB_VBLANK_HBLANKING   : FALSE<br />    FB_VBLANK_HAVE_VBLANK : FALSE<br />    FB_VBLANK_HAVE_HBLANK : FALSE<br />    FB_VBLANK_HAVE_COUNT  : FALSE<br />    FB_VBLANK_HAVE_VCOUNT : FALSE<br />    FB_VBLANK_HAVE_HCOUNT : FALSE<br />    FB_VBLANK_VSYNCING    : FALSE<br />    FB_VBLANK_HAVE_VSYNC  : TRUE<br />  count  : 0<br />  vcount : 1<br />  hcount : 0<br />-------------------------------------<br />FBIOGET_FSCREENINFO:<br />  id          : "PS3 FB"<br />  smem_start  : 0x00000000<br />  smem_len    : 18874368<br />  type        : FB_TYPE_PACKED_PIXELS (0)<br />  type_aux    : N/A<br />  visual      : FB_VISUAL_TRUECOLOR (2)<br />  xpanstep    : 1<br />  ypanstep    : 1<br />  ywrapstep   : 1<br />  line_length : 2880<br />  mmio_start  : 0x00000000<br />  mmio_len    : 0<br />  accel       : FB_ACCEL_NONE (0)<br />-------------------------------------<br />PS3FB_IOCTL_SCREENINFO:<br />    xres        : 720<br />    yres        : 480<br />    xoff        : 72<br />    yoff        : 48<br />    num_frames  : 2<br />-------------------------------------<br /></pre>

<div id="fb_use" class="subtitle">Using cp_vt and cp_fb</div>
These functions are very simple to use. The user running them should
have read/write access to the framebufer (/dev/fb0) and the main
console (/dev/console).
<pre class="code">{<br />    cp_vt vt;<br />    cp_fb fb;<br /><br />    cp_vt_open_graphics(&amp;vt);<br />    cp_fb_open(&amp;fb);<br /><br />    uint32_t frame_ndx = 0;<br /><br />    while (1)<br />    {<br />        uint32_t* const restrict frame_top = (uint32_t*)fb.draw_addr[ frame_ndx ];<br /><br />        // Write pixel to the frame buffer ...<br />        // x and y are image position<br />        // rgb24 is 32bit pixel value (where top 8 bits are unused)<br /><br />        frame_top[ ( y * fb.stride ) + x ] = rgb24;<br /><br />        // At the vsync, the previous frame is finished sending to the CRT<br />        cp_fb_wait_vsync( &amp;fb );<br /><br />        // Send the frame just drawn to the CRT by the next vblank<br />        cp_fb_flip( &amp;fb, frame_ndx );<br /><br />        frame_ndx  = frame_ndx ^ 0x01;<br />    }<br /><br />    cp_vt_close(&amp;vt);<br />    cp_fb_close(&amp;fb);<br />}<br /></pre>

A more complete example: <a href="http://cellperformance-snippets.googlecode.com/files/fb_test.c">fb_test.c</a>]]>
    </content>
</entry>

<entry>
    <title>HowTo: Huge TLB pages on PS3 Linux</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2007/01/howto-huge-tlb-pages-on-ps3-linux.html" />
    <id>tag:cellperformance.beyond3d.com,2007:/articles//3.25</id>

    <published>2007-01-30T08:23:08Z</published>
    <updated>2009-08-07T07:26:35Z</updated>

    <summary> Updated! (22 Mar 07) Minor edits. Added notes for YellowDog Linux. Added source code for using huge page allocation. Updated! (30 Mar 07) A couple minor fixes. Thanks to Guénaël Renault for pointing them out! Updated! (15 July 07)...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note">
<b>Updated! (22 Mar 07) Minor edits. Added notes for YellowDog Linux. Added source code for using huge page allocation.</b> <br />
<b>Updated! (30 Mar 07) A couple minor fixes. Thanks to Guénaël Renault for pointing them out!</b><br />
<b>Updated! (15 July 07) Added notes for kernel 2.6.21</b>
</div>

<div class="sticky-note">
Guest article: Understanding the TLB and minimizing misses is a
critical part of high performance Cell programming. Unfortunately some
PS3 kernels do not come with huge page support enabled. Jakub Kurzak
and Alfredo Buttari step through the details of recompiling the kernel
for huge page support.
</div>The availability of huge TLB pages depends on the way the linux
kernel has been configured prior to compilation. The default kernel
that ships with Fedora Core 5 (most likely with any other distribution
that has binary kernel packages) doesn't include this option. So, in
order to have huge TLB pages, it is necessary to reconfigure the
kernel, recompile it, instruct the boot loader about the newly created
kernel image. Finally we will also show a way to allocate the TLB pages
automatically at boot time.<br />
<br />

<div class="sticky-note">
[Mike Acton] This process also works with YellowDog Linux virtually unchanged.
</div> ]]>
        <![CDATA[<div class="subtitle">Rebuilding the PS3 Linux Kernel</div>

<div class="sticky-note">
[Mike Acton] For more detailed information on the Linux Kernel and the build process, see: 
<ul><li><a href="http://www.faqs.org/docs/Linux-HOWTO/Kernel-HOWTO.html">The Linux Kernel HOWTO [faqs.org]</a></li><li><a href="http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/ps3-linux-docs-08.06.09/">PS3 Linux Distributor's Starter Kit [kernel.org]</a></li><li>Also see: <a href="http://julipedia.blogspot.com/2007/03/building-updated-kernel-for-ps3.html">Building an Updated Kernel for PS3 [julipedia.blogspot.com]</a></li><li>Also see: <a href="http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-nfs-root-howto.txt">PS3 NFS Root File System HOWTO</a> by Geoff Levand (PS3 kernel maintainer)</li></ul>
</div>

<div class="sticky-note">
[Mike Acton] For more information on using huge tlb pages, especially from user space, read <a href="http://www.gelato.unsw.edu.au/lxr/source/Documentation/vm/hugetlbpage.txt?v=2.6.16;a=ppc">hugetlbpages.txt</a> which is found in the kernel source under /Documents/vm/
</div>

Here are the steps:<br />
<br />

<ol><li>Recompile the kernel in order to have huge TLB pages
<ol><li> Take the kernel source from the add-on cd (filename is linux-20061110.tar.bz2)
<div class="sticky-note">[Mike Acton] Download the <a href="http://dl.qj.net/PS3-Linux-Addon-Disc-Source-PlayStation-3/pg/12/fid/11310/catid/514">PS3 Source Add-On CD [qj.net]</a>.</div>

<div class="sticky-note">[Mike Acton] A more recent (2.6.21 as of this
update) kernel and sources can be found the more recent Add-on disc
package (CELL-Linux-CL_20070516-ADDON) which can be found in various
Linux mirrors: <br />
<ul><li><a href="http://ftp.uk.linux.org/pub/linux/Sony-PS3/">http://ftp.uk.linux.org/pub/linux/Sony-PS3/</a></li><li><a href="http://www.kernel.org/pub/linux/kernel/people/geoff/cell/">http://www.kernel.org/pub/linux/kernel/people/geoff/cell/</a></li><li><a href="http://ftp.riken.go.jp/pub/Linux/kernel/people/geoff/cell/">http://ftp.riken.go.jp/pub/Linux/kernel/people/geoff/cell/</a></li></ul>
</div>

</li><li> unpack it in the /usr/src directory

</li><li> make a link:
<pre class="code">	$ ln -s /usr/src/linux-20061110 /usr/src/linux<br /></pre>
<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b><br />
<pre class="code">	$ ln -s /usr/src/linux-2.6.21-20070425 /usr/src/linux<br /></pre>
</div>
</li><li> prepare for kernel configuration:
<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b><br />
To build a more recent kernel you will need to install a few things first:

<ol><li><a href="http://www.methods.co.nz/asciidoc/index.html">AsciiDoc</a>. Download: <a href="http://www.methods.co.nz/asciidoc/asciidoc-8.2.1.tar.gz">asciidoc-8.2.1.tar.gz [methods.co.nz]</a></li><pre class="code">$ cd /usr/src<br />$ tar xzvf asciidoc.tar.gz<br />$ cd asciidoc-8.2.1<br />$ ./install.sh<br /></pre><li><a href="http://cyberelk.net/tim/software/xmlto/">xmlto</a>. Download: <a href="http://cyberelk.net/tim/data/xmlto/stable/xmlto-0.0.18.tar.bz2">xmlto-0.0.18.tar.bz2 [cyberelk.net]</a></li><pre class="code">$ cd /usr/src<br />$ tar xjvf xmlto-0.0.18.tar.bz2<br />$ cd xmlto-0.0.18<br />$ ./configure<br />$ make<br />$ make install <br /></pre><li><a href="http://git.or.cz/">git</a>, a revision control system. Download: <a href="http://www.kernel.org/pub/software/scm/git/git-1.5.2.tar.gz">git 1.5.2 [kernel.org]</a>
<pre class="code">$ cd /usr/src<br />$ tar xzvf git-1.5.2.tar.gz<br />$ cd git-1.5.2<br />$ make prefix=/usr all doc<br />$ make prefix=/usr install install-doc <br /></pre>
</li><li><a href="http://dtc.ozlabs.org/">dtc</a> (Device Tree Compiler)
NOTE: To build the kernel, you need a version newer than the
dtc-20060419.tar.gz version available on the dtc web page.
<pre class="code">$ cd /usr/src<br />$ git clone git://www.jdl.com/software/dtc.git <br />$ cd dtc<br />$ make<br />$ make install<br /><br /></pre></li></ol>

</div>
</li><li><div class="sticky-note">[Mike Acton] mrproper should be done before make to clean any older build data, if you have them.</div>
<pre class="code">$ make mrproper<br /></pre>
</li><li> copy the kernel config file that comes with the fedora installation into /usr/src/linux
<pre class="code">$ cp /boot/config-2.6.16 /usr/src/linux/.config<br /></pre>

<div class="sticky-note">
[Mike Acton] On YellowDog Linux, this file is /boot/config-2.6.16-20061110.ydl.1ps3
</div>
<div class="sticky-note">[Mike Acton] <b>For Linux 2.6.21:</b><br />The
config file has been updated significantly since the original 2.6.16
release. It's much easier to start with the file included in the kernel
distribution. <pre class="code">$ cd /usr/src/linux<br />$ cp arch/powerpc/configs/ps3_defconfig .config<br /></pre>
</div>
</li><li>This next step goes through the old configuration file and prompts the user whenever 
     a new kernel option that is not present in the old kernel is encountered (none in this case
     since the old and the new kernels are exactly the same version)
<pre class="code">$ make oldconfig<br /></pre>
<div class="sticky-note">[Mike Acton] <b>For Linux 2.6.21:</b> There's no need for this step if you copied the file from the kernel distribution itself.
</div>
</li><li> enable huge TLB pages in the kernel configuration
<pre class="code">$ make menuconfig<br /></pre>
     Now go to File systems --&gt; Pseudo filesystems and enable huge TLB pages by pressing
     the space bar on the "HugeTLB file system support" option. Now select "exit" repeatedly and
     answer "yes" when asked to save the new kernel configuration
</li><li> compile kernel and modules and install modules (it will take around 20 minutes):
<pre class="code">$ make all<br />$ make modules_install<br /></pre>
</li></ol>

</li><li> install the new kernel:
<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b> Replace references to 2.6.16 with 2.6.21 in this and the following steps.
</div>
<pre class="code">$ cp /usr/src/linux/vmlinux /boot/vmlinux-2.6.16_HTLB<br /></pre>
</li><li> create a ramdisk image for the new kernel:

<pre class="code">$ mkinitrd /boot/initrd-2.6.16_HTLB.img 2.6.16<br /></pre>

<div class="sticky-note">
[Mike Acton] On Yellowdog Linux, mkinitrd lives in /sbin.
</div>

<div class="sticky-note">
[Mike Acton] <b>For Linux 2.6.21:</b><br />
<i>"When I do mkinitrd, it says: No modules available for kernel "2.6.21". What's up?</i><br />
<br />
The problem is this version of the kernel doesn't isn't installed as
"2.6.21", it's installed as "2.6.21-rc7". You can discover that by
looking in /lib/modules:
<pre>$ ls /lib/modules<br />total 16<br />drwxr-xr-x 3 root root 4096 Mar 22 05:57 2.6.16<br />drwxr-xr-x 5 root root 4096 Jan 19 06:06 2.6.16-20061110.ydl.1ps3<br />drwxr-xr-x 3 root root 4096 Jul 15 08:24 2.6.20<br />drwxr-xr-x 3 root root 4096 Jul 17 06:22 2.6.21-rc7<br /></pre>
So the actual command you need to run is:
<pre class="code">$ mkinitrd /boot/initrd-2.6.21_HTLB.img 2.6.21-rc7<br /></pre>
</div>

</li><li> tell the bootloader (kboot) where the new kernel is:
<pre class="code">$ vim /etc/kboot.conf<br /></pre>
     add the following line
<pre class="code">linux_htlb='/boot/vmlinux-2.6.16_HTLB initrd=/boot/initrd-2.6.16_HTLB.img'<br /><div class="sticky-note">[Mike Acton] For YellowDog Linux, use:<br />ydl_htlb      ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \<br />root=/dev/sda2 init=/sbin/init video=ps3fb:mode:3 rhgb'<br />ydl480i_htlb  ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \<br />root=/dev/sda2 init=/sbin/init video=ps3fb:mode:1 rhgb'<br />ydl1080i_htlb ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \<br />root=/dev/sda2 init=/sbin/init video=ps3fb:mode:4 rhgb'<br />ydltext_htlb  ='/dev/sda1:/vmlinux-2.6.16_HTLB initrd=/dev/sda1:/initrd-2.6.16_HTLB.img \<br />root=/dev/sda2 init=/sbin/init 3'</div><br /></pre>
     if you want this kernel to be loaded by default then change the "default" line into
<pre class="code">default=linux_htlb<br /><div class="sticky-note">[Mike Acton] For YellowDog Linux, use one of the modes above.</div><br /></pre>
</li><li> instruct the boot process in order to allocate huge TLB pages. (Pick one of the following two options)
<ol><li> OPTION 1:
<pre class="code">$ vim /etc/rc.local<br /></pre>
     add the following lines:
<pre class="code">mkdir -p /huge<br />echo 20 &gt; /proc/sys/vm/nr_hugepages<br />mount -t hugetlbfs nodev /huge<br />chown root:root /huge<br />chmod 755 /huge<br /></pre>
     be sure to change the "chown" line according to your system settings.
</li><li> OPTION 2: create a /etc/init.d/htlb script with the following content:<br />
<div class="quote"><i>All the commands added to the rc.local file in the previous step are executed at the end of the boot sequence.
This means that the huge TLB pages allocation is performed when lots of the system memory has been
already allocated by other processes. This results in the allocation of 6 or 7 pages. In order to obtain
few pages more (8 or 9) we have to move the huge TLB pages allocation earlier in the boot sequence (i.e. at
runlevel-1)</i>
</div>
<br />

<div class="sticky-note">
[Mike Acton] chkconfig required some additional settings not in the previous version of this script. Modified version is here:
<pre class="code">	#!/bin/sh<br />	#<br />	# htlb:	Start/stop huge TLB pages allocation<br />	#<br />        # [Mike Acton] The runlevel and priority settings for chkconfig are stolen straight out of cpuspeed.<br />        <br />        # chkconfig: 12345 06 99<br />        # description: Start/stop huge TLB pages allocation<br /><br />	. /etc/rc.d/init.d/functions<br /><br />	start()<br />	{<br />	    mkdir -p /huge<br />	    echo 20 &gt; /proc/sys/vm/nr_hugepages<br />	    mount -t hugetlbfs nodev /huge<br />	    chown root:root /huge<br />	    chmod 775 /huge<br />        }<br /><br />	stop()<br />	{<br />	    echo 0 &gt; /proc/sys/vm/nr_hugepages<br />	}<br />	<br />	case "$1" in<br />	  start)<br />		start<br />		;;<br />	  stop)<br />		stop<br />		;;<br />	  restart|reload)<br />	        stop<br />	        start<br />	        ;;<br />	  *)<br />	        echo $"Usage: $0 {start|stop|status|restart|reload}"<br />	        exit 1<br />		;;<br />	esac<br />	<br />	exit 0<br /></pre>
</div>
Make the new service executable:
<pre class="code">$ chmod a+x /etc/init.d/htlb<br /></pre>
Add the service to runlevel-1:
<pre class="code">$ /sbin/chkconfig --add htlb<br /></pre>
</li></ol>
</li><li> reboot. During the boot process, when presented the "kboot:" prompt you'll be able to choose your kernel using the "tab" key.
</li></ol>

<div class="sticky-note">
[Mike Acton] Validate that huge pages are now installed and working by:
<pre class="code">$ cat /proc/meminfo | grep Huge<br /></pre>

You should see something like:

<pre class="code">HugePages_Total:     8<br />HugePages_Free:      8<br />Hugepagesize:    16384 kB<br /></pre>

and...
<pre class="code">$ cat /proc/filesystems  | grep huge<br /></pre>

You should see something like:

<pre class="code">nodev   hugetlbfs<br /></pre>
</div>


<div class="sticky-note">
[Mike Acton] Here are some helper functions for allocating and freeing huge memory:<br />
<br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_hugemem.c">cp_hugemem.c</a><br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_hugemem.h">cp_hugemem.h</a><br />
<br />
They are very simple to use:
<pre class="code">{<br />    // Allocate...<br />    const size_t  hmem_size = 128 * 1024 * 1024;<br />    cp_hugemem    hmem;<br /><br />    int was_hugemem_allocated = cp_hugemem_alloc( &amp;hmem, hmem_size );<br />    if ( !was_hugemem_allocated )<br />    {<br />        fprintf(stderr,"Error: Could not allocate hugemem\n");<br />        return (-1);<br />    }<br /><br />    // Use the memory...<br />    char* ptr = (char*)hmem.addr;<br /><br />    // Free...<br />    cp_hugemem_free( &amp;hmem );<br />}<br /></pre>
</div>

<div class="subtitle">About the Authors</div>

<b><a href="http://www.cs.utk.edu/%7Ekurzak/">Jakub Kurzak</a> AKA Koobas</b>
is a researcher at the University of Tennessee, Knoxville, and a member
of the Innovative Computing Lab (ICL - http://icl.cs.utk.edu/), where
he mostly does things related programming multi-core processors and the
Cell processor. Before that he was a student the University of Houston,
where he dealt with programming distributed memory machines using
message passing (MPI). Jakub's interests are in parallel programming
techniques (message passing, multi-threading), parallel
number crunching algorithms, and performance optimization.<br />
<br />

<b><a href="http://www.cs.utk.edu/%7Ebuttari/">Alfredo Buttari</a></b>
is a research associate at the Computer Science dept. of the University
of Tennessee Knoxville. Alfredo is a member of the Innovative Computing
Laboratory which deals with many aspects of High Performance Computing.
His interests are in developing high performance software for Linear
Algebra which is mostly achieved through parallel programming
techniques of all sorts (MPI, OpenMP, threads...), including the more
exotic approaches like the Cell programming model. Before to Tennesse
Alfredo got a PhD and a Master degree in Computer Science from the "Tor
Vergata" University of Rome (Italy).]]>
    </content>
</entry>

<entry>
    <title>Cross-compiling for PS3 Linux</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/11/cross-compiling-for-ps3-linux-1.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.20</id>

    <published>2006-11-30T05:22:09Z</published>
    <updated>2009-08-06T04:28:51Z</updated>

    <summary>Now that the PS3 is out and multiple Linux-based distributions are available which can be installed using Open Platform [playstation.com] it&apos;s time to start developing on some publically available hardware! Although the PPU and SPU compilers can be installed and...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[Now that the PS3 is out and multiple Linux-based distributions are available which can be installed using <a href="http://www.playstation.com/ps3-openplatform/index.html">Open Platform [playstation.com]</a> it's time to start developing on some publically available hardware!<br />
<br />
Although the PPU and SPU compilers can be installed and used on the PS3
directly, I find it much more familiar and convinient to cross-compile
from my desktop and just ship the resulting executables over to the
target (PS3). <br />
<br />
In this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

]]>
        <![CDATA[<div class="subtitle" id="install_linux" s=""> Install Linux </div> 
I have sucessfully compiled and run using both <a href="http://www.terrasoftsolutions.com/products/ydl/">Yellow Dog Linux [terrasoftsolutions.com]</a> and <a href="http://fedora.redhat.com/">Fedora Core [redhat.com]</a>.<br /> <br />
This article assumes that Linux is already installed on the PS3. It's
very easy to install and the process is already documented quite well.<br /> <br /> 

Carl Bender over at <a href="http://www.ps3pc.net/">PS3PC.net</a> has written a very good guide on <a href="http://linuxps3.net/index.php?option=com_content&amp;task=view&amp;id=33&amp;Itemid=32">Installing Fedora 5 Linux on Your PS3 [linuxps3.net]</a><br /><br />


See also: <a href="http://www.terrasoftsolutions.com/support/installation/">Installation Guide for Yellow Dog Linux [terrasoftsolutions.com]</a><br /> 
See also: <a href="https://docs.google.com/Doc?docid=0AQvxOJpR_xIAZGZxcXpqajhfNzcxOTc4anJjY3c&amp;hl=en">Installation Guide for Fedora Core 5</a><br /> 
See also: <a href="http://www.pslinux.org/index.php?title=Main_Page">Linux on the Playstation 3 Wiki [pslinux.org]</a><br />
See also: <a href="http://www.daniel.jp/joomla/info/ps3/installing-gentoo-on-the-ps3.html">Installing Gentoo on the PS3 [daniel.jp]</a><br />
<br />

<div class="sticky-note"> 
<b>
NOTE: For the sake of this article, Yellow Dog Linux 5 (32 bit version
for PS3) will be assumed. A 32 bit host PowerPC Fedora Core 5
installation will also be assumed (Although 64 bit and x64 versions of
the libraries are available for other types of hosts.)
</b>
</div>

<div class="sticky-note"> 
<span class="monospace-strong">cat /proc/cpuinfo</span> (For the Target PS3)<br /> 
<pre class="code">processor : 0<br />cpu : Cell Broadband Engine, altivec supported<br />clock : 3192.000000MHz<br />revision : 5.1 (pvr 0070 0501)<br /><br />processor : 1<br />cpu : Cell Broadband Engine, altivec supported<br />clock : 3192.000000MHz<br />revision : 5.1 (pvr 0070 0501)<br /><br />timebase : 79800000<br />machine : PS3PF<br /></pre>
<br />

<span class="monospace-strong">cat /proc/interrupts</span> (For the Target PS3)<br /> 
<pre class="code"> CPU0 CPU1<br /> 10: 19437 0 PS3PF irq controller Edge ehci_hcd:usb1<br /> 11: 20767742 0 PS3PF irq controller Edge ehci_hcd:usb2<br /> 16: 0 0 PS3PF irq controller Edge ohci_hcd:usb3<br /> 17: 0 0 PS3PF irq controller Edge ohci_hcd:usb4<br />128: 0 574866 PS3PF irq controller Edge IPI0 (call function)<br />129: 0 3024105 PS3PF irq controller Edge IPI1 (reschedule)<br />130: 0 0 PS3PF irq controller Edge IPI2 (unused)<br />131: 0 0 PS3PF irq controller Edge IPI3 (debugger break)<br />132: 555759 0 PS3PF irq controller Edge IPI0 (call function)<br />133: 2998857 0 PS3PF irq controller Edge IPI1 (reschedule)<br />134: 0 0 PS3PF irq controller Edge IPI2 (unused)<br />135: 0 0 PS3PF irq controller Edge IPI3 (debugger break)<br />136: 0 0 PS3PF irq controller Edge Virtual UART<br />137: 0 0 PS3PF irq controller Edge spe00.0<br />138: 1 0 PS3PF irq controller Edge spe00.1<br />139: 7 0 PS3PF irq controller Edge spe00.2<br />140: 0 0 PS3PF irq controller Edge spe01.0<br />141: 2 0 PS3PF irq controller Edge spe01.1<br />142: 6 0 PS3PF irq controller Edge spe01.2<br />143: 0 0 PS3PF irq controller Edge spe02.0<br />144: 2 0 PS3PF irq controller Edge spe02.1<br />145: 6 0 PS3PF irq controller Edge spe02.2<br />146: 0 0 PS3PF irq controller Edge spe03.0<br />147: 2 0 PS3PF irq controller Edge spe03.1<br />148: 13 0 PS3PF irq controller Edge spe03.2<br />149: 0 0 PS3PF irq controller Edge spe04.0<br />150: 2 0 PS3PF irq controller Edge spe04.1<br />151: 13 0 PS3PF irq controller Edge spe04.2<br />152: 0 0 PS3PF irq controller Edge spe05.0<br />153: 1 0 PS3PF irq controller Edge spe05.1<br />154: 9 0 PS3PF irq controller Edge spe05.2<br />155: 27210328 0 PS3PF irq controller Edge ps3fb vsync<br />156: 1809885 0 PS3PF irq controller Edge PS3PF stor<br />157: 387328 0 PS3PF irq controller Edge PS3PF stor<br />158: 65 0 PS3PF irq controller Edge PS3PF stor<br />159: 1509 0 PS3PF irq controller Edge snd_ps3pf<br />160: 0 78885 PS3PF irq controller Edge gbec connection<br />BAD: 0<br /></pre> 
</div> 

<div class="subtitle" id="install_libspe2"> Install elfspe2 and libspe2 on PS3 </div> 
<b>elfspe2</b> allows SPU executables to be run standalone from the commandline (aka spulets)<br /> 
<b>libspe2</b> is a PPU library for launching and communicating with SPU executables.<br /> 
<br /> 

1. Copy the following files to the PS3. These files can be found on the <a href="http://dl.qj.net/PS3-Linux-Addon-Disc-PlayStation-3/pg/12/fid/11308/catid/514">PS3 Linux Add-On Packages CD</a> in the <b>spu</b> directory.<br /> 
<ul><li> libspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li><li> elfspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li></ul> 

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br /> 
<br /> 

<div class="subtitle" id="install_toolchain"> Install toochain on host PC </div>
I am using Fedora Core 5 installed on a PowerPC Mac Mini as my host
machine for PS3 development. Working from a PowerPC platform is
extremely convinient. However, all of the following libraries are also
either available as i686 packages or can be recompiled for the i686
platform if you prefer that.<br /> 
<br /> 

<div class="sticky-note"> 
<span class="monospace-strong">cat /proc/cpuinfo</span> (For the Host PC)<br /> 
<pre class="code">processor : 0<br />cpu : 7447A, altivec supported<br />clock : 1249.999995MHz<br />revision : 0.2 (pvr 8003 0102)<br />bogomips : 83.20<br />timebase : 41620997<br />machine : PowerMac10,1<br />motherboard : PowerMac10,1 MacRISC3 Power Macintosh<br />detected as : 287 (Mac mini)<br />pmac flags : 00000010<br />L2 cache : 512K unified<br />pmac-generation : NewWorld<br /></pre> 
</div> 

1. Copy the following files to the host PC. These files can be found at <a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/">Barcelona Supercomputing Center, Linux on Cell [bsc.es]</a> under <span class="monospace-strong">Programming Models -&gt; Linux on Cell -&gt; Cell BE Components -&gt; GNU Toolchain</span>. 
<ul><li> ppu-binutils-3.2-4.ppc.rpm </li><li> ppu-gcc-3.2-4.ppc.rpm </li><li> ppu-gcc-c++-3.2-4.ppc.rpm </li><li> ppu-toolchain-3.2-4.src.rpm </li><li> ppu-toolchain-debuginfo-3.2-4.ppc.rpm </li><li> spu-binutils-3.2-6.ppc.rpm </li><li> spu-gcc-3.2-6.ppc.rpm </li><li> spu-gcc-c++-3.2-6.ppc.rpm </li><li> spu-newlib-1.14.0.200610300000-1.ps3pf.ppc.rpm </li><li> spu-toolchain-3.2-6.src.rpm </li><li> spu-toolchain-debuginfo-3.2-6.ppc.rpm </li></ul>
<br />

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br /> 

<div class="subtitle" id="install_libspe2_host"> Install libspe2 on host PC </div>

1. Copy the following files to the host PC. These files can be found on the <a href="http://dl.qj.net/PS3-Linux-Addon-Disc-PlayStation-3/pg/12/fid/11308/catid/514">PS3 Linux Add-On Packages CD</a> in the <b>spu</b> directory.<br /> 
<ul><li> libspe2-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li><li> libspe2-devel-2.0.0-be0644.3.20061107.1.ps3pf.ppc.rpm </li></ul> 

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br /> 
<br /> 
<div class="subtitle" id="build_hello_libspe2"> Building Hello World (for libspe2)</div> 

1. On the host PC, compile the example:<br /> 
<br /> 
<span class="monospace-strong">ppu-gcc -m32 ppu_hello.c -lspe2 -o ppu_hello</span><br /> 
<span class="monospace-strong">spu-gcc spu_hello.c -o spu_hello</span><br /> 
<br /> 
NOTE: If the 64 bit support headers and libraries are installed on the host the <span class="monospace-strong">-m32</span> can be omitted from the PPU compilation step.<br /> 
<br /> 

2. Copy the two executables to the PS3.<br /> 
3. To execute <span class="monospace-strong">spu_hello</span> using libspe2, just run <span class="monospace-strong">./ppu_hello</span><br /> 
4. To execute <span class="monospace-strong">spu_hello</span> using elfspe2, just run <span class="monospace-strong">./spu_hello</span> directly.<br /> 
<br /> 

<div class="subtitle" id="hello_source_libspe2"> Hello World source (for libspe2)</div> 

<a href="http://cellperformance-snippets.googlecode.com/files/ppu_hello.c">ppu_hello.c</a> 
<pre class="code"><span class="line-number">  0</span>#include &lt;stdlib.h&gt;<br /><span class="line-number">  1</span>#include &lt;libspe2.h&gt;<br /><span class="line-number">  2</span><br /><span class="line-number">  3</span>int<br /><span class="line-number">  4</span>main()<br /><span class="line-number">  5</span>{<br /><span class="line-number">  6</span>  unsigned int          createflags = 0;<br /><span class="line-number">  7</span>  unsigned int          runflags    = 0;<br /><span class="line-number">  8</span>  unsigned int          entry       = SPE_DEFAULT_ENTRY;<br /><span class="line-number">  9</span>  void*                 argp        = NULL;<br /><span class="line-number"> 10</span>  void*                 envp        = NULL;<br /><span class="line-number"> 11</span><br /><span class="line-number"> 12</span>  spe_program_handle_t* program     = spe_image_open("spu_hello");<br /><span class="line-number"> 13</span>  spe_context_ptr_t     spe         = spe_context_create(createflags, NULL);<br /><span class="line-number"> 14</span>  spe_stop_info_t       stop_info;<br /><span class="line-number"> 15</span><br /><span class="line-number"> 16</span>  spe_program_load(spe, program);<br /><span class="line-number"> 17</span>  spe_context_run(spe, &amp;entry, runflags, argp, envp, &amp;stop_info);<br /><span class="line-number"> 18</span>  spe_image_close(program);<br /><span class="line-number"> 19</span>  spe_context_destroy(spe);<br /><span class="line-number"> 20</span><br /><span class="line-number"> 21</span>  return (0);<br /><span class="line-number"> 22</span>}<br /></pre>


<a href="http://cellperformance-snippets.googlecode.com/files/spu_hello.c">spu_hello.c</a> 
<pre class="code"><span class="line-number"> 0</span>#include &lt;stdio.h&gt;<br /><span class="line-number"> 1</span> <br /><span class="line-number"> 2</span>int<br /><span class="line-number"> 3</span>main( unsigned long spuid )<br /><span class="line-number"> 4</span>{<br /><span class="line-number"> 5</span> printf("Hello, World! (From SPU:%d)\n",spuid);<br /><span class="line-number"> 6</span> return (0);<br /><span class="line-number"> 7</span>}<br /></pre> 

<div class="subtitle" id="using_ibm_sdk"> Using the IBM SDK </div> 

The IBM SDK uses libspe not libspe2, so in order to build the IBM libraries and samples, libspe must be installed.<br /> 
<br /> 

<div class="sticky-note">
<b>What is the difference between libspe and libspe2? Will both continue to be used?</b><br />
<br />
libspe2 is a re-design of libspe. The folks at IBM have strongly
implied that libspe is on its way out and we should expect a future
revision of the SDK to be refactored for libspe2.<br />
<br />
<b>Roland (RSei)</b> gave an excellent description of reasoning behind the design of libspe2 in <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?message=13896030&amp;cat=46&amp;thread=144504&amp;treeDisplayType=threadmode1&amp;forum=739#13896030">IBM's Cell Broadband Engine Architecture forum</a>:

<div class="quote">
"There have been a number of requirements and issues with libspe1 that
led to the design of a new major version with a different API. I'll try
to explain a few major aspects just briefly:<br /><br />
1. libspe is supposed to be the "low-level API" to use SPE resources.
We think that the "SPE context" introduced in libspe2 is the better
low-level construct than the "SPE thread" (as defined in libspe1),
which already suggests a particular programming model and view. By
using "SPE contexts", it is, e.g., possible to have other models like
(synchronous) function offload to SPEs more easily without introducing
the complexity and overhead of threading into an application. Another
example is the possibility to exchange the code on an SPE, but leaving
the data in place, which allows for easy and efficient "chaining" of
processing steps und PPE control. In the thread model, this would have
to rely on SPE programs using overlays. By the way, it is very easy to
have the libspe1 thread model as a special case implemented on top of
libspe2 and we have actually done this exercise internally.<br /><br />
2. Many people asked for a more complete "SPE thread library" (similar
to what you usually have, e.g., in pthread). By removing the special
concept of an "SPE thread" (in the libspe1 sense), we are actually
addressing this requirement. When using libspe2, the programmer relies
on the thread package of choice and just uses SPEs in these threads.
All thread-specific aspects of the application are standard - so you
have full functionality.<br />
3. There were many complaints about the event API in libspe1 - from
usability to efficiency. We think, we found a good solution in libspe2.<br /><br />
4. We feel that the "SPE groups" in libspe1 were tieing together rather
orthogonal concepts like scheduling and event handling. So we gave up
this construct. You may have noticed that we introduced "SPE gang
contexts" and you have probably already guessed that we are working on
gang scheduling to leverage this - but "gangs" are purely a scheduling
construct and do *not* replace the previous groups.<br /><br />
5. You are right that binding threads to specific, physical SPEs has
been part of the libspe1 API, although it had never been implemented.
There are many discussions about this feature. At this point, we don't
have a conclusive answer how we want to support "affinity" of threads
to physical SPE resources. We simply felt we are not ready yet to
define the API and stick to it in the future."<br />
</div>

</div>

1. Copy the following files to the host PC. These files can be found at <a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/">Barcelona Supercomputing Cente, Linux on Cell [bsc.es]</a> under <span class="monospace-strong">Programming Models -&gt; Linux on Cell -&gt; Cell BE Components -&gt; GNU Toolchain</span>. 
<ul><li> libspe-1.1.0-1.ppc.rpm </li><li> libspe-debuginfo-1.1.0-1.ppc.rpm </li><li> libspe-devel-1.1.0-1.ppc.rpm </li></ul> 

2. As root, <span class="monospace-strong">rpm -ivh *.rpm</span><br /> <br /> 
3. Copy the libspe libraries from the Host PC at <span class="monospace-strong">/usr/lib/libspe.so.*</span> to <span class="monospace-strong">/usr/lib/</span> on the PS3.<br />
4. Copy the following file onto the host PC. This file can be found at IBM alphaWorks' <a href="http://www.alphaworks.ibm.com/tech/cellsw/download">IBM Cell Broadband Engine Software Development Kit download page</a>. You will need to agree to the licenses in order to download the file. 
<ul><li> cell-sdk-lib-samples-1.1-10.noarch.rpm </li></ul> 

5. As root, <span class="monospace-strong">rpm -ivh cell-sdk-lib-samples-1.1-10.noarch.rpm</span>. The source files should now be installed in <span class="monospace-strong">/opt/IBM/cell-sdk-1.1</span>.<br /> 
6. Only minor modifications are needed to cross-compile the SDK.<br /> 
<ul><li> <span class="monospace-strong">cd /opt/IBM/cell-sdk-1.1</span> </li><li> Open <span class="monospace-strong">make.footer</span> </li><li> Search for (starting at line 84 in my copy):<br /> 
<pre class="code">########################################################################<br /># Common GNU Defines (Host, PPU32, PPU64, SPU)<br />########################################################################<br /></pre> 
</li><li> Delete the following section (starting at line 91 in my copy): 
<pre class="code">ifeq "$(HOST_PROCESSOR)" "ppc64"<br /> SCE_ROOT =<br /> SCE_SYSROOT =<br /> SCE_PPU_BINDIR = /usr/bin<br /> SCE_SPU_BINDIR = /usr/bin<br /> PPU_TOOL_PREFIX =<br /> PPU32_TOOL_PREFIX =<br />else<br /> # SCE_VERSION is defined in environment or in make.env<br /> SCE_ROOT = /opt/sce/$(SCE_VERSION)<br /> SCE_SYSROOT = $(SCE_ROOT)/ppu/sysroot<br /> SCE_PPU_BINDIR = $(SCE_ROOT)/ppu/bin<br /> SCE_SPU_BINDIR = $(SCE_ROOT)/spu/bin<br /> PPU_TOOL_PREFIX = $(PPU_PREFIX)<br /> PPU32_TOOL_PREFIX = $(PPU32_PREFIX)<br />endif<br /></pre>
</li><li> Insert the following section at the same location: 
<pre class="code"> SCE_ROOT =<br /> SCE_SYSROOT =<br /> SCE_PPU_BINDIR = /usr/bin<br /> SCE_SPU_BINDIR = /usr/bin</pre> 
</li><li> If 64 bit support is not installed, search for (line 150 in my copy):<br /> 
<pre class="code">#********************<br /># 64-bit PPU Targets<br />#********************<br /></pre> 
</li><li> If 64 bit support is not installed, delete the following lines:<br /> 
<pre class="code">PPU64_TARGETS := $(strip $(PROGRAM_ppu64) \<br /> $(PROGRAMS_ppu64) \<br /> $(LIBRARY_ppu64) \<br /> $(SHARED_LIBRARY_ppu64))<br /><br />ifdef PPU64_TARGETS<br /> TARGET_PROCESSOR := ppu64<br />endif<br /></pre> 
</li><li> Save the changes </li></ul> 
7. If GLUT is not installed on the host PC, install it (for Fedora-based hosts) with <span class="monospace-strong">yum install freeglut-devel</span><br /> 
8. The SDK and samples should now build without errors: <span class="monospace-strong">cd src; make</span>
(Although quite a few warnings will be generated - there is a bit of
non-standard compliant code in the SDK which should be fixed.)<br /> 
9. Copy the following files from the host PC to the target PS3's <span class="monospace-strong">/usr/lib</span> directory. 
<ul><li> /opt/IBM/cell-sdk-1.1/src/lib/matrix/ppu_shared/libmatrix.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/image/ppu_shared/libimage.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/vector/ppu_shared/libvector.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/surface/ppu_shared/libsurface.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/noise/ppu_shared/libnoise.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/fft/ppu_shared/libfft.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/gmath/ppu_shared/libgmath.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/math/ppu_shared/libmath.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/misc/ppu_shared/libmisc.so </li><li> /opt/IBM/cell-sdk-1.1/src/lib/audio_resample/ppu_shared/libaudio_resample.so </li></ul> 10. Now anything built with the IBM SDK should be able to run on the PS3. 

<div class="subtitle" id="access_ps3_over_vnc"> Access the PS3 Over VNC </div>
I have two Playstation 3 units and only one HDMI input on my HD TV and
that one is going to be used for game playing, not developing. So the
PS3 I use for development is head-less. The vast majority of the time I
can accomplish everything I need by a simple secure shell to the PS3.
But occasionally I want to use the machine as though I were local, and
that is what VNC is for.<br /> 
<br /> 

How to setup VNC on the PS3 (for Yellow Dog Linux):<br /> 
1. Secure shell from the host to the PS3 with X11 using: <span class="monospace-strong">ssh -X [PS3_IP_ADDRESS]</span><br /> 
2. On the PS3, launch the firewall security settings application using: <span class="monospace-strong">system-config-securitylevel</span>. At this point you will need to enter the root password for the PS3.<br />
3. Click on "Other ports", then "+ Add" and add port 5901 (TCP). This
will allow the VNC connection through the firewall running on the PS3.
Go ahead and close the application.<br /> 
4. On the PS3, run the VNC server using: <span class="monospace-strong">vncserver</span>. If this is the first time you've run the server, you will need to provide a password that will be used to access the machine.<br /> 
5. On the host PC, start the VNC client using: <span class="monospace-strong">vncviewer [PS3_IP_ADDRESS]:[DISPLAY_NUMBER]</span>. The display number was printed when the server was started. It defaults to 1 (ONE).<br /> 
6. After you enter the password, you should now see the PS3 window manager running with an open shell by default.<br /> 
7. In order to kill the VNC server use: <span class="monospace-strong">vncserver -kill :[DISPLAY_NUMBER]</span><br />
8. In order to use the default Yellow Dog window manager
(Enlightenment), uncomment the following lines in ~/.vnc/xstartup on
the PS3 and restart the server.<br /> 

<pre class="code">unset SESSION_MANAGER<br />exec /etc/X11/xinit/xinitrc<br /></pre> 
<br /> The only real practical difference between using the PS3 over VNC
and using it locally will be if you are writing graphics to the
framebuffer. These effects will only display over the locally connected
display.<br /> <br /> 

<div class="subtitle" id="upgrade_libspe"> Upgrade libspe and libspe2 </div>
The official release of libspe and libspe2 that were available at
launch have some minor issues that were patched recently. Both
libraries are being actively developed and there will always be new
patches available for brave developers. There is a cumulative version
available through December 6.<br />
<br /> 

To build and install the latest version:<br /> <br /> 

1. Download the following files from <a href="http://ozlabs.org/pipermail/cbe-oss-dev/2006-December/000682.html"> [Cbe-oss-dev] libspe and libspe2 december release</a> to the Host PC.
<ul><li>libspe-1.2.0.tar.gz</li><li>libspe2-2.0.1.tar.gz</li></ul>
The files will probably need to be renamed locally after download.<br />
2. Untar the two files with:
<ul><li><span class="monospace-strong">tar xzvf libspe-1.2.0.tar.gz</span></li><li><span class="monospace-strong">tar xzvf libspe2-2.0.1.tar.gz</span></li></ul>
3. In the <span class="monospace-strong">libspe2-2.0.1</span> directory, open the <span class="monospace-strong">make.defines</span> file, and change the equivalent section to be:
<pre class="code">ifeq "$(CROSS_COMPILE)" "1"<br />SYSROOT ?= sysroot<br />prefix ?= /usr<br />CROSS ?= ppu-<br />EXTRA_CFLAGS = -m32 -mabi=altivec<br />else<br /></pre>
4. Save the file, then build the patches for <span class="monospace-strong">speevent</span> using:
<pre class="code">patch -p1 &lt; initevent.diff<br />patch -p1 &lt; event-public.diff<br />patch -p1 &lt; make_speevent_thread_safe.diff<br /></pre>
5. Build the library using: <span class="monospace-strong">make; make install</span><br />
6. Copy all the files (recursively) in the <span class="monospace-strong">libspe2-2.0.1/sysroot/usr/</span> directory to the <span class="monospace-strong">/usr/</span> directory on the PS3 <b>and</b> the Host PC.<br />
7. In the <span class="monospace-strong">libspe-1.2.0</span> directory, open
the <span class="monospace-strong">Makefile</span>
file, and change the equivalent section to be:
<pre class="code">ifeq "$(CROSS_COMPILE)" "1"<br />SYSROOT ?= sysroot<br />prefix ?= /usr<br />CROSS ?= ppu-<br />EXTRA_CFLAGS = -m32 -mabi=altivec<br />else<br /></pre>
8. Save the file, then build the library using: <span class="monospace-strong">make; make install</span><br />
9. Copy all the files (recursively) in the <span class="monospace-strong">libspe-1.2.0/sysroot/usr/</span>
directory to the <span class="monospace-strong">/usr/</span> directory on the
PS3 <b>and</b> the Host PC.<br />
<br />
Congratulations, <span class="monospace-strong">libspe-1.2.0</span> and <span class="monospace-strong">libspe2-2.0.1</span>
are now installed on the PS3 and will be used by the any applications
which are dynamically linked to either of those libraries.<br />
<br />
<div class="sticky-note">
Special thanks to <b>Dirk Herrendoerfer</b> for both making the release available and for answering my questions on the build procedures.
</div>]]>
    </content>
</entry>

<entry>
    <title>atan2 on SPU</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/09/atan2-on-spu.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.21</id>

    <published>2006-09-13T04:35:40Z</published>
    <updated>2009-08-06T04:39:04Z</updated>

    <summary>n 2006 March 03 on the IBM developerWorks Cell Broadband Engine Architecture forum [ibm.com] an interesting question was asked: &quot;I am trying to port an application from an older version of SDK to SDK 1.0. It uses atan2(.....) function, which...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[n 2006 March 03 on the IBM developerWorks <a href="http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&amp;thread=109947&amp;message=13795522&amp;cat=46&amp;q=atan2#13795522">Cell Broadband Engine Architecture forum [ibm.com]</a> an interesting question was asked:<br />
<div class="quote">
"I am trying to port an application from an older version of SDK to SDK
1.0. It uses atan2(.....) function, which is causing trouble... This
code worked fine on SDK28, but now it looks like the new functions dont
have this particular function defined..<br />
I did change the makefile to include $(SDKLIB)/libmath.a<br />
<br />
I searched in ./sysroot/usr/spu/include/* and src/include/spu/* but couldn't find a headerfile that has it defined.<br />
<br />
Can anyone please suggest if I should just change the code to not use that function or is there a way to invoke it still?<br />
<br />
Thanks!"
</div>
<br />
It turned out this function was not available in the SDK.<br />
<br />
The following is a branch-free implementation of atan2 vector floats
for the SPU. A scalar version which simply casts to vector and back is
also provided. This implementation is fairly quick-and-dirty and no
particular level of accuracy is gauranteed, but it should be usable for
many purposes.<br /><br /><br />
Or download the source files:<br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_fatan-cbe-spu.h">cp_fatan-cbe-spu.h</a><br />
<a href="http://cellperformance-snippets.googlecode.com/files/cp_fatan-cbe-spu.c">cp_fatan-cbe-spu.c</a><br />
 ]]>
        <![CDATA[<br />

<div class="sticky-note">
This code is C99 source. For gcc, use the following flags: <span class="monospace-strong">-std=c99 -pedantic</span>
</div>
        <div class="code">
<span class="line-number">  0</span>// ## cp_fatan-cbe-spu.h (C99)
<span class="line-number">  1</span>// ## Version 1.0
<span class="line-number">  2</span>// ##                        
<span class="line-number">  3</span>// ## Copyright (c) 2006 Mike Acton <macton@gmail.com>
<span class="line-number">  4</span>// ##                        
<span class="line-number">  5</span>// ## SIGNIFICANT REFERENCES:
<span class="line-number">  6</span>// ##                        
<span class="line-number">  7</span>// ##    [1] Cephes Math Library Release 2.8:  June, 2000
<span class="line-number">  8</span>// ##        Copyright 1984, 1995, 2000, Stephen L. Moshier
<span class="line-number">  9</span>// ##    [2] Numerical Computation Guide (PDF)
<span class="line-number"> 10</span>// ##        Copyright 2000, Sun Microsystems, Inc.
<span class="line-number"> 11</span>// ##    [3] IEEE 754 Support in C99 (PDF)
<span class="line-number"> 12</span>// ##        Copyright 2001, Jim Thomas
<span class="line-number"> 13</span>// ##    [4] Solaris 10 Reference Manual : atan2(3M)
<span class="line-number"> 14</span>// ##        Copyright 1994-2005, Sun Microsystems, Inc.
<span class="line-number"> 15</span>// ##                        
<span class="line-number"> 16</span>// ## Permission is hereby granted, free of charge, to any person obtaining
<span class="line-number"> 17</span>// ## a copy of this software and associated documentation files 
<span class="line-number"> 18</span>// ## (the "Software"), to deal in the Software without restriction, including
<span class="line-number"> 19</span>// ## without limitation the rights to use, copy, modify, merge, publish, 
<span class="line-number"> 20</span>// ## distribute, sublicense, and/or sell copies of the Software, and to permit
<span class="line-number"> 21</span>// ## persons to whom the Software is furnished to do so, subject to the 
<span class="line-number"> 22</span>// ## following conditions:
<span class="line-number"> 23</span>// ##                        
<span class="line-number"> 24</span>// ## The above copyright notice and this permission notice shall be included 
<span class="line-number"> 25</span>// ## in all copies or substantial portions of the Software.
<span class="line-number"> 26</span>// ##                        
<span class="line-number"> 27</span>// ## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 
<span class="line-number"> 28</span>// ## OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
<span class="line-number"> 29</span>// ## FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
<span class="line-number"> 30</span>// ## AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
<span class="line-number"> 31</span>// ## LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
<span class="line-number"> 32</span>// ## OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
<span class="line-number"> 33</span>// ## THE SOFTWARE.
<span class="line-number"> 34</span>// ##                        
<span class="line-number"> 35</span>
<span class="line-number"> 36</span>#ifndef CP_FATAN_CBE_SPU_H
<span class="line-number"> 37</span>#define CP_FATAN_CBE_SPU_H
<span class="line-number"> 38</span>
<span class="line-number"> 39</span>#include &lt;stdint.h&gt;
<span class="line-number"> 40</span>#include &lt;spu_intrinsics.h&gt;
<span class="line-number"> 41</span>
<span class="line-number"> 42</span>// ##                        
<span class="line-number"> 43</span>// ## Global Floating-point constants (32 bit)
<span class="line-number"> 44</span>// ##                        
<span class="line-number"> 45</span>// ## Constant is loaded in each element of 32 bit floating-point vector
<span class="line-number"> 46</span>// ## from local store.
<span class="line-number"> 47</span>// ##                        
<span class="line-number"> 48</span>// ## cp_flpio4()  +PI/+4
<span class="line-number"> 49</span>// ## cp_flt3p8()  tan( +3.0 * PI / +8.0 )
<span class="line-number"> 50</span>// ## cp_flnpio2() -PI/+2
<span class="line-number"> 51</span>// ## cp_flpio2()  +PI/+2
<span class="line-number"> 52</span>// ## cp_flpt66()  +0.66
<span class="line-number"> 53</span>// ## cp_flpi()    +PI
<span class="line-number"> 54</span>// ## cp_flnpi()   -PI
<span class="line-number"> 55</span>
<span class="line-number"> 56</span>extern const vector unsigned int _cp_f_pio4;
<span class="line-number"> 57</span>extern const vector unsigned int _cp_f_t3p8;
<span class="line-number"> 58</span>extern const vector unsigned int _cp_f_npio2;
<span class="line-number"> 59</span>extern const vector unsigned int _cp_f_pio2;
<span class="line-number"> 60</span>extern const vector unsigned int _cp_f_pt66;
<span class="line-number"> 61</span>extern const vector unsigned int _cp_f_pi;
<span class="line-number"> 62</span>extern const vector unsigned int _cp_f_npi;
<span class="line-number"> 63</span>
<span class="line-number"> 64</span>static inline qword
<span class="line-number"> 65</span>cp_flpio4( void )
<span class="line-number"> 66</span>{
<span class="line-number"> 67</span>    return si_lqa( (intptr_t)&amp;_cp_f_pio4 );
<span class="line-number"> 68</span>}
<span class="line-number"> 69</span>
<span class="line-number"> 70</span>static inline qword
<span class="line-number"> 71</span>cp_flt3p8( void )
<span class="line-number"> 72</span>{
<span class="line-number"> 73</span>    return si_lqa( (intptr_t)&amp;_cp_f_t3p8 );
<span class="line-number"> 74</span>}
<span class="line-number"> 75</span>
<span class="line-number"> 76</span>static inline qword
<span class="line-number"> 77</span>cp_flnpio2( void )
<span class="line-number"> 78</span>{
<span class="line-number"> 79</span>    return si_lqa( (intptr_t)&amp;_cp_f_npio2 );
<span class="line-number"> 80</span>}
<span class="line-number"> 81</span>
<span class="line-number"> 82</span>static inline qword
<span class="line-number"> 83</span>cp_flpio2( void )
<span class="line-number"> 84</span>{
<span class="line-number"> 85</span>    return si_lqa( (intptr_t)&amp;_cp_f_pio2 );
<span class="line-number"> 86</span>}
<span class="line-number"> 87</span>
<span class="line-number"> 88</span>static inline qword
<span class="line-number"> 89</span>cp_flpt66( void )
<span class="line-number"> 90</span>{
<span class="line-number"> 91</span>    return si_lqa( (intptr_t)&amp;_cp_f_pt66 );
<span class="line-number"> 92</span>}
<span class="line-number"> 93</span>
<span class="line-number"> 94</span>static inline qword
<span class="line-number"> 95</span>cp_flpi( void )
<span class="line-number"> 96</span>{
<span class="line-number"> 97</span>    return si_lqa( (intptr_t)&amp;_cp_f_pi );
<span class="line-number"> 98</span>}
<span class="line-number"> 99</span>
<span class="line-number">100</span>static inline qword
<span class="line-number">101</span>cp_flnpi( void )
<span class="line-number">102</span>{
<span class="line-number">103</span>    return si_lqa( (intptr_t)&amp;_cp_f_npi );
<span class="line-number">104</span>}
<span class="line-number">105</span>
<span class="line-number">106</span>// ##                        
<span class="line-number">107</span>// ## Load-Immediate Floating-point constants (32 bit)
<span class="line-number">108</span>// ##                        
<span class="line-number">109</span>// ## Constant is loaded in each element of 32 bit floating-point vector
<span class="line-number">110</span>// ## using immediate values. i.e. No loads
<span class="line-number">111</span>// ##                        
<span class="line-number">112</span>// ## cp_filzero()   +0.0  +0x00000000
<span class="line-number">113</span>// ## cp_filnzero()  -0.0  +0x80000000
<span class="line-number">114</span>// ## cp_filone()    +1.0  +0x3f800000
<span class="line-number">115</span>// ## cp_filtwo()    +2.0  +0x40000000
<span class="line-number">116</span>// ## cp_filinf()    +INF  +0x7f800000
<span class="line-number">117</span>// ## cp_filninf()   -INF  +0xff800000
<span class="line-number">118</span>// ## cp_filnan()     NaN  +0x7fc00000
<span class="line-number">119</span>// ##                        
<span class="line-number">120</span>
<span class="line-number">121</span>static inline qword 
<span class="line-number">122</span>cp_filzero( void )
<span class="line-number">123</span>{
<span class="line-number">124</span>    return si_ilhu( (int16_t)0x0000 );
<span class="line-number">125</span>}
<span class="line-number">126</span>
<span class="line-number">127</span>static inline qword 
<span class="line-number">128</span>cp_filnzero( void )
<span class="line-number">129</span>{
<span class="line-number">130</span>    return si_ilhu( (int16_t)0x8000 );
<span class="line-number">131</span>}
<span class="line-number">132</span>
<span class="line-number">133</span>static inline qword 
<span class="line-number">134</span>cp_filone( void )
<span class="line-number">135</span>{
<span class="line-number">136</span>    return si_ilhu( (int16_t)0x3f80 );
<span class="line-number">137</span>}
<span class="line-number">138</span>
<span class="line-number">139</span>static inline qword 
<span class="line-number">140</span>cp_filtwo( void )
<span class="line-number">141</span>{
<span class="line-number">142</span>    return si_ilhu( (int16_t)0x4000 );
<span class="line-number">143</span>}
<span class="line-number">144</span>
<span class="line-number">145</span>static inline qword 
<span class="line-number">146</span>cp_filinf( void )
<span class="line-number">147</span>{
<span class="line-number">148</span>    return si_ilhu( (int16_t)0x7f80 );
<span class="line-number">149</span>}
<span class="line-number">150</span>
<span class="line-number">151</span>static inline qword 
<span class="line-number">152</span>cp_filninf( void )
<span class="line-number">153</span>{
<span class="line-number">154</span>    return si_ilhu( (int16_t)0xff80 );
<span class="line-number">155</span>}
<span class="line-number">156</span>
<span class="line-number">157</span>static inline qword 
<span class="line-number">158</span>cp_filnan( void )
<span class="line-number">159</span>{
<span class="line-number">160</span>    return si_ilhu( (int16_t)0x7fc0 );
<span class="line-number">161</span>}
<span class="line-number">162</span>
<span class="line-number">163</span>// ##                        
<span class="line-number">164</span>// ## cp_fatan() Coefficients and other constants
<span class="line-number">165</span>// ##                        
<span class="line-number">166</span>
<span class="line-number">167</span>extern const vector unsigned int _cp_f_atan_q4;
<span class="line-number">168</span>extern const vector unsigned int _cp_f_atan_q3;
<span class="line-number">169</span>extern const vector unsigned int _cp_f_atan_q2;
<span class="line-number">170</span>extern const vector unsigned int _cp_f_atan_q1;
<span class="line-number">171</span>extern const vector unsigned int _cp_f_atan_q0;
<span class="line-number">172</span>extern const vector unsigned int _cp_f_atan_p4;
<span class="line-number">173</span>extern const vector unsigned int _cp_f_atan_p3;
<span class="line-number">174</span>extern const vector unsigned int _cp_f_atan_p2;
<span class="line-number">175</span>extern const vector unsigned int _cp_f_atan_p1;
<span class="line-number">176</span>extern const vector unsigned int _cp_f_atan_p0;
<span class="line-number">177</span>extern const vector unsigned int _cp_f_hmorebits;
<span class="line-number">178</span>extern const vector unsigned int _cp_f_morebits;
<span class="line-number">179</span>
<span class="line-number">180</span>// ## cp_fatan(x)
<span class="line-number">181</span>// ##                        
<span class="line-number">182</span>// ## 0     &lt;= x           &lt;= 0.66
<span class="line-number">183</span>// ## -PI/2 &lt;= cp_fatan(x) &lt;= +PI/2
<span class="line-number">184</span>// ##                        
<span class="line-number">185</span>// ## Each floating-point component of the result is a function of
<span class="line-number">186</span>// ## the corresponding components of x:
<span class="line-number">187</span>// ##
<span class="line-number">188</span>// ##    0.0                                             { x == 0.0
<span class="line-number">189</span>// ##                        
<span class="line-number">190</span>// ##    +PI                                             {
<span class="line-number">191</span>// ##    ---                                             { x == INF
<span class="line-number">192</span>// ##    2.0                                             {
<span class="line-number">193</span>// ##                        
<span class="line-number">194</span>// ##    -PI                                             {
<span class="line-number">195</span>// ##    ---                                             { x == -INF
<span class="line-number">196</span>// ##    2.0                                             {
<span class="line-number">197</span>// ##                        
<span class="line-number">198</span>// ##                           
<span class="line-number">199</span>// ##                   2      4      6     8            {
<span class="line-number">200</span>// ##           P  + P x  + P x  + P x + P x             {
<span class="line-number">201</span>// ##        2   0    1      2      3     4              {
<span class="line-number">202</span>// ##    x  x   ----------------------------------- + x  { otherwise
<span class="line-number">203</span>// ##                    2     4      6      8   10      {
<span class="line-number">204</span>// ##            Q  + Q x + Q x  + Q x  + Q x + x        {
<span class="line-number">205</span>// ##             0    1     2      3      4             {
<span class="line-number">206</span>
<span class="line-number">207</span>static inline qword
<span class="line-number">208</span>_cp_fatan( const qword x )
<span class="line-number">209</span>{
<span class="line-number">210</span>    // ##                        
<span class="line-number">211</span>    // ## Load constants
<span class="line-number">212</span>    // ##                        
<span class="line-number">213</span>    
<span class="line-number">214</span>    const qword f_one           = cp_filone();
<span class="line-number">215</span>    const qword f_inf           = cp_filinf();
<span class="line-number">216</span>    const qword f_ninf          = cp_filninf();
<span class="line-number">217</span>    const qword f_msb           = cp_filnzero();
<span class="line-number">218</span>    const qword f_zero          = cp_filzero();
<span class="line-number">219</span>
<span class="line-number">220</span>    const qword f_pt66          = si_lqa( (intptr_t)&amp;_cp_f_pt66      );
<span class="line-number">221</span>    const qword f_pio2          = si_lqa( (intptr_t)&amp;_cp_f_pio2      );
<span class="line-number">222</span>    const qword f_npio2         = si_lqa( (intptr_t)&amp;_cp_f_npio2     );
<span class="line-number">223</span>    const qword f_pio4          = si_lqa( (intptr_t)&amp;_cp_f_pio4      );
<span class="line-number">224</span>    const qword f_t3p8          = si_lqa( (intptr_t)&amp;_cp_f_t3p8      );
<span class="line-number">225</span>
<span class="line-number">226</span>    const qword f_atan_p0       = si_lqa( (intptr_t)&amp;_cp_f_atan_p0    );
<span class="line-number">227</span>    const qword f_atan_p1       = si_lqa( (intptr_t)&amp;_cp_f_atan_p1    );
<span class="line-number">228</span>    const qword f_atan_p2       = si_lqa( (intptr_t)&amp;_cp_f_atan_p2    );
<span class="line-number">229</span>    const qword f_atan_p3       = si_lqa( (intptr_t)&amp;_cp_f_atan_p3    );
<span class="line-number">230</span>    const qword f_atan_p4       = si_lqa( (intptr_t)&amp;_cp_f_atan_p4    );
<span class="line-number">231</span>    const qword f_atan_q0       = si_lqa( (intptr_t)&amp;_cp_f_atan_q0    );
<span class="line-number">232</span>    const qword f_atan_q1       = si_lqa( (intptr_t)&amp;_cp_f_atan_q1    );
<span class="line-number">233</span>    const qword f_atan_q2       = si_lqa( (intptr_t)&amp;_cp_f_atan_q2    );
<span class="line-number">234</span>    const qword f_atan_q3       = si_lqa( (intptr_t)&amp;_cp_f_atan_q3    );
<span class="line-number">235</span>    const qword f_atan_q4       = si_lqa( (intptr_t)&amp;_cp_f_atan_q4    );
<span class="line-number">236</span>    const qword f_morebits      = si_lqa( (intptr_t)&amp;_cp_f_morebits  );
<span class="line-number">237</span>    const qword f_hmorebits     = si_lqa( (intptr_t)&amp;_cp_f_hmorebits );
<span class="line-number">238</span>    
<span class="line-number">239</span>    // ##                        
<span class="line-number">240</span>    // ## pos_x = -x            { x &lt; 0
<span class="line-number">241</span>    // ##          x            { otherwise
<span class="line-number">242</span>    // ##                        
<span class="line-number">243</span>    
<span class="line-number">244</span>    const qword neg_x           = si_xor( x, f_msb );          
<span class="line-number">245</span>    const qword sign_mask       = si_fcgt( f_zero, x );
<span class="line-number">246</span>    const qword pos_x           = si_selb( x, neg_x, sign_mask );
<span class="line-number">247</span>    
<span class="line-number">248</span>    // ##                        
<span class="line-number">249</span>    // ## Range reduction
<span class="line-number">250</span>    // ##                        
<span class="line-number">251</span>    
<span class="line-number">252</span>    // ##                        
<span class="line-number">253</span>    // ## range0_mask = ( pos_x &gt; tan( 3.0 * PI / 8.0 ) )
<span class="line-number">254</span>    // ## range1_mask = ( pos_x &lt;= 0.66 )
<span class="line-number">255</span>    // ## range2_mask = !( range0_mask || range1_mask )
<span class="line-number">256</span>    // ##                        
<span class="line-number">257</span>    
<span class="line-number">258</span>    const qword range0_mask     = si_fcgt( pos_x, f_t3p8 );
<span class="line-number">259</span>    const qword range1_gt_mask  = si_fcgt( f_pt66, pos_x );
<span class="line-number">260</span>    const qword range1_eq_mask  = si_fceq( f_pt66, pos_x );
<span class="line-number">261</span>    const qword range1_mask     = si_or( range1_gt_mask, range1_eq_mask );
<span class="line-number">262</span>    const qword range2_mask     = si_nor( range0_mask, range1_mask );
<span class="line-number">263</span>    
<span class="line-number">264</span>    // ##                        
<span class="line-number">265</span>    // ## range0_x = -1.0 
<span class="line-number">266</span>    // ##            -----
<span class="line-number">267</span>    // ##            pos_x
<span class="line-number">268</span>    // ##                        
<span class="line-number">269</span>    // ## range0_y = PI
<span class="line-number">270</span>    // ##            ---
<span class="line-number">271</span>    // ##            2.0
<span class="line-number">272</span>    // ##                        
<span class="line-number">273</span>    
<span class="line-number">274</span>    const qword range0_x0       = si_frest( pos_x );
<span class="line-number">275</span>    const qword range0_x1       = si_fi( pos_x, range0_x0 );
<span class="line-number">276</span>    const qword range0_x2       = si_fnms( range0_x1, pos_x, f_one );
<span class="line-number">277</span>    const qword range0_x3       = si_fma( range0_x2, range0_x1, range0_x1 );
<span class="line-number">278</span>    const qword range0_x        = si_xor( range0_x3, f_msb );
<span class="line-number">279</span>    const qword range0_y        = f_pio2;
<span class="line-number">280</span>    
<span class="line-number">281</span>    // ##                        
<span class="line-number">282</span>    // ## range1_x = pos_x
<span class="line-number">283</span>    // ## range1_y = 0.0
<span class="line-number">284</span>    // ##                        
<span class="line-number">285</span>    
<span class="line-number">286</span>    const qword range1_x        = pos_x;
<span class="line-number">287</span>    const qword range1_y        = f_zero;
<span class="line-number">288</span>    
<span class="line-number">289</span>    
<span class="line-number">290</span>    // ##                        
<span class="line-number">291</span>    // ## range2_x = (pos_x-1.0)
<span class="line-number">292</span>    // ##            -----------
<span class="line-number">293</span>    // ##            (pos_x+1.0)
<span class="line-number">294</span>    // ##                        
<span class="line-number">295</span>    // ## range2_y = PI
<span class="line-number">296</span>    // ##            ---
<span class="line-number">297</span>    // ##            4.0
<span class="line-number">298</span>    // ##                        
<span class="line-number">299</span>    
<span class="line-number">300</span>    const qword range2_y        = f_pio4;
<span class="line-number">301</span>    const qword range2_x0num    = si_fs( pos_x, f_one );
<span class="line-number">302</span>    const qword range2_x0den    = si_fa( pos_x, f_one );
<span class="line-number">303</span>    const qword range2_x0       = si_frest( range2_x0den );
<span class="line-number">304</span>    const qword range2_x1       = si_fnms( range2_x0, range2_x0den, f_one );
<span class="line-number">305</span>    const qword range2_x2       = si_fma( range2_x1, range2_x0, range2_x0 );
<span class="line-number">306</span>    const qword range2_x        = si_fm( range2_x0num, range2_x2 );
<span class="line-number">307</span>    
<span class="line-number">308</span>    // ##                        
<span class="line-number">309</span>    // ## range_x  = range0_x { range0_mask
<span class="line-number">310</span>    // ##            range1_x { range1_mask
<span class="line-number">311</span>    // ##            range2_x { range2_mask
<span class="line-number">312</span>    // ##                        
<span class="line-number">313</span>    // ## range_y  = range0_y { range0_mask
<span class="line-number">314</span>    // ##            range1_y { range1_mask
<span class="line-number">315</span>    // ##            range2_y { range2_mask
<span class="line-number">316</span>    // ##                        
<span class="line-number">317</span>    
<span class="line-number">318</span>    const qword range_x0        = si_selb( range2_x, range0_x, range0_mask );
<span class="line-number">319</span>    const qword range_x         = si_selb( range_x0, range1_x, range1_mask );
<span class="line-number">320</span>    const qword range_y0        = si_selb( range2_y, range0_y, range0_mask );
<span class="line-number">321</span>    const qword range_y         = si_selb( range_y0, range1_y, range1_mask );
<span class="line-number">322</span>    
<span class="line-number">323</span>    // ##                        
<span class="line-number">324</span>    // ##                  2
<span class="line-number">325</span>    // ## xp2    =  range_x 
<span class="line-number">326</span>    // ##                             2        3     4
<span class="line-number">327</span>    // ##           P  + P xp2 + P xp2  + P xp2 + P xp2
<span class="line-number">328</span>    // ##            0    1       2        3       4
<span class="line-number">329</span>    // ## zdiv   =  ------------------------------------------
<span class="line-number">330</span>    // ##                             2        3       4     5
<span class="line-number">331</span>    // ##           Q  + Q xp2 + Q xp2  + Q xp2 + Q xp2 + xp2
<span class="line-number">332</span>    // ##            0    1       2        3       4 
<span class="line-number">333</span>    // ## 
<span class="line-number">334</span>    // ## z1     = range_x * ( xp2 * zdiv ) + range_x
<span class="line-number">335</span>    // ## 
<span class="line-number">336</span>    
<span class="line-number">337</span>    const qword xp2             = si_fm( range_x, range_x );
<span class="line-number">338</span>    const qword znum0           = f_atan_p0;
<span class="line-number">339</span>    const qword znum1           = si_fma( znum0, xp2, f_atan_p1 );
<span class="line-number">340</span>    const qword znum2           = si_fma( znum1, xp2, f_atan_p2 );
<span class="line-number">341</span>    const qword znum3           = si_fma( znum2, xp2, f_atan_p3 );
<span class="line-number">342</span>    const qword znum            = si_fma( znum3, xp2, f_atan_p4 );
<span class="line-number">343</span>    const qword zden0           = si_fa( xp2, f_atan_q0 );
<span class="line-number">344</span>    const qword zden1           = si_fma( zden0, xp2, f_atan_q1 );
<span class="line-number">345</span>    const qword zden2           = si_fma( zden1, xp2, f_atan_q2 );
<span class="line-number">346</span>    const qword zden3           = si_fma( zden2, xp2, f_atan_q3 );
<span class="line-number">347</span>    const qword zden            = si_fma( zden3, xp2, f_atan_q4 );
<span class="line-number">348</span>    const qword zden_r0         = si_frest( zden );
<span class="line-number">349</span>    const qword zden_r1         = si_fnms( zden_r0, zden, f_one );
<span class="line-number">350</span>    const qword zden_r          = si_fma( zden_r1, zden_r0, zden_r0 );
<span class="line-number">351</span>    const qword zdiv            = si_fm( znum, zden_r );
<span class="line-number">352</span>    const qword z0              = si_fm( xp2, zdiv );
<span class="line-number">353</span>    const qword z1              = si_fma( range_x, z0, range_x );
<span class="line-number">354</span>    
<span class="line-number">355</span>    // ##                        
<span class="line-number">356</span>    // ## zadd      =  z1 + 0.5 * MOREBITS { range2_mask
<span class="line-number">357</span>    // ##              z1 + MOREBITS       { range1_mask
<span class="line-number">358</span>    // ##              z1                  { otherwise
<span class="line-number">359</span>    // ##                        
<span class="line-number">360</span>    // ## yaddz     = range_y + zadd
<span class="line-number">361</span>    // ##                        
<span class="line-number">362</span>    // ## pos_yaddz = yaddz      { yaddz &gt;= 0
<span class="line-number">363</span>    // ##             -yaddz     { yaddz &lt;  0
<span class="line-number">364</span>    // ##                        
<span class="line-number">365</span>
<span class="line-number">366</span>    const qword zadd0           = si_selb( f_zero, f_hmorebits, range2_mask );
<span class="line-number">367</span>    const qword zadd1           = si_selb( zadd0,  f_morebits,  range1_mask );
<span class="line-number">368</span>    const qword zadd            = si_fa( z1, zadd1 );
<span class="line-number">369</span>    const qword yaddz           = si_fa( range_y, zadd );
<span class="line-number">370</span>    const qword neg_yaddz       = si_xor( yaddz, f_msb );
<span class="line-number">371</span>    const qword pos_yaddz       = si_selb( yaddz,  neg_yaddz,  sign_mask );
<span class="line-number">372</span>    
<span class="line-number">373</span>    // ##                        
<span class="line-number">374</span>    // ## result_y0 = 0.0        { x == 0.0
<span class="line-number">375</span>    // ##             pos_yaddz  { otherwise
<span class="line-number">376</span>    // ##                        
<span class="line-number">377</span>    
<span class="line-number">378</span>    const qword x_eqz_mask      = si_fceq( f_zero, x );
<span class="line-number">379</span>    const qword result_y0       = si_selb( pos_yaddz, x, x_eqz_mask );
<span class="line-number">380</span>
<span class="line-number">381</span>    // ##                        
<span class="line-number">382</span>    // ## result_y2 = +PI         {
<span class="line-number">383</span>    // ##             ---         { x == INF
<span class="line-number">384</span>    // ##             2.0         {
<span class="line-number">385</span>    // ##                        
<span class="line-number">386</span>    // ##             -PI         {
<span class="line-number">387</span>    // ##             ---         { x == -INF
<span class="line-number">388</span>    // ##             2.0         {
<span class="line-number">389</span>    // ##                        
<span class="line-number">390</span>    // ##             result_y0   { otherwise
<span class="line-number">391</span>    // ##                        
<span class="line-number">392</span>
<span class="line-number">393</span>    const qword x_eqinf_mask    = si_fceq( f_inf,  x );
<span class="line-number">394</span>    const qword x_eqninf_mask   = si_fceq( f_ninf, x );
<span class="line-number">395</span>    const qword result_y1       = si_selb( result_y0, f_pio2,  x_eqinf_mask );
<span class="line-number">396</span>    const qword result          = si_selb( result_y1, f_npio2, x_eqninf_mask );
<span class="line-number">397</span>
<span class="line-number">398</span>    return (result);
<span class="line-number">399</span>}
<span class="line-number">400</span>
<span id="cp_fatan" class="line-number">401</span>static inline vector float
<span class="line-number">402</span>cp_fatan( const vector float x )
<span class="line-number">403</span>{
<span class="line-number">404</span>    return (vector float)( _cp_fatan( (qword)x ) );
<span class="line-number">405</span>}
<span class="line-number">406</span>
<span id="cp_fatan_scalar" class="line-number">407</span>static inline float
<span class="line-number">408</span>cp_fatan_scalar( const float x )
<span class="line-number">409</span>{
<span class="line-number">410</span>    const qword vx      = si_from_float( x );
<span class="line-number">411</span>    const qword vresult = _cp_fatan( vx );
<span class="line-number">412</span>    const float result  = si_to_float( vresult );
<span class="line-number">413</span>
<span class="line-number">414</span>    return (result);
<span class="line-number">415</span>}
<span class="line-number">416</span>
<span class="line-number">417</span>// ## cp_fatan2(y,x)
<span class="line-number">418</span>// ## 
<span class="line-number">419</span>// ## -INF &lt;= x              &lt;= INF
<span class="line-number">420</span>// ## -INF &lt;= y              &lt;= INF
<span class="line-number">421</span>// ## -PI  &lt;= cp_fatan2(y,x) &lt;= +PI
<span class="line-number">422</span>// ##                        
<span class="line-number">423</span>// ## Each floating-point component of the result is a function of
<span class="line-number">424</span>// ## the corresponding components of y and x:
<span class="line-number">425</span>// ##                        
<span class="line-number">426</span>// ##     +PI                  { (y == +0.0) &amp;&amp; (x &lt; 0.0)
<span class="line-number">427</span>// ##                        
<span class="line-number">428</span>// ##     -PI                  { (y == -0.0) &amp;&amp; (x &lt; 0.0)
<span class="line-number">429</span>// ##     
<span class="line-number">430</span>// ##     +0.0                 { (y == +0.0) &amp;&amp; (x &gt; 0.0)
<span class="line-number">431</span>// ##     
<span class="line-number">432</span>// ##     -0.0                 { (y == -0.0) &amp;&amp; (x &gt; 0.0)
<span class="line-number">433</span>// ##      
<span class="line-number">434</span>// ##     -PI                  {
<span class="line-number">435</span>// ##     ----                 { (y &lt; 0.0) &amp;&amp; (x == 0.0)
<span class="line-number">436</span>// ##     +2.0                 {
<span class="line-number">437</span>// ##     
<span class="line-number">438</span>// ##     +PI                  {
<span class="line-number">439</span>// ##     ----                 { (y &gt; 0.0) &amp;&amp; (x == 0.0)
<span class="line-number">440</span>// ##     +2.0                 {
<span class="line-number">441</span>// ##     
<span class="line-number">442</span>// ##     NaN                  { (y == NaN) || (x == NaN) 
<span class="line-number">443</span>// ##     
<span class="line-number">444</span>// ##     +PI                  { (y == +0.0) &amp;&amp; (x == -0.0)
<span class="line-number">445</span>// ##                        
<span class="line-number">446</span>// ##     -PI                  { (y == -0.0) &amp;&amp; (x == -0.0)
<span class="line-number">447</span>// ##                        
<span class="line-number">448</span>// ##     +0.0                 { (y == +0.0) &amp;&amp; (x == +0.0)
<span class="line-number">449</span>// ##                        
<span class="line-number">450</span>// ##     -0.0                 { (y == -0.0) &amp;&amp; (x == +0.0)
<span class="line-number">451</span>// ##                        
<span class="line-number">452</span>// ##     +PI                  {
<span class="line-number">453</span>// ##     ---                  { (y == +INF) &amp;&amp; (x == +INF)
<span class="line-number">454</span>// ##     4.0                  {
<span class="line-number">455</span>// ##                        
<span class="line-number">456</span>// ##     -PI                  {
<span class="line-number">457</span>// ##     ---                  { (y == -INF) &amp;&amp; (x == +INF)
<span class="line-number">458</span>// ##     4.0                  {
<span class="line-number">459</span>// ##                        
<span class="line-number">460</span>// ##     +3.0 PI              {
<span class="line-number">461</span>// ##     -------              { (y == +INF) &amp;&amp; (x == -INF)
<span class="line-number">462</span>// ##     +4.0                 {
<span class="line-number">463</span>// ##                        
<span class="line-number">464</span>// ##     -3.0 PI              {
<span class="line-number">465</span>// ##     -------              { (y == -INF) &amp;&amp; (x == -INF)
<span class="line-number">466</span>// ##     +4.0                 {
<span class="line-number">467</span>// ##                        
<span class="line-number">468</span>// ##     +PI                  { isfinite(y) &amp;&amp; (+y &gt; 0) &amp;&amp; (x == -INF)
<span class="line-number">469</span>// ##                        
<span class="line-number">470</span>// ##     -PI                  { isfinite(y) &amp;&amp; (-y &gt; 0) &amp;&amp; (x == -INF)
<span class="line-number">471</span>// ##                        
<span class="line-number">472</span>// ##     +0.0                 { isfinite(y) &amp;&amp; (+y &gt; 0) &amp;&amp; (x == +INF)
<span class="line-number">473</span>// ##                        
<span class="line-number">474</span>// ##     -0.0                 { isfinite(y) &amp;&amp; (-y &gt; 0) &amp;&amp; (x == +INF)
<span class="line-number">475</span>// ##                        
<span class="line-number">476</span>// ##     +PI                  {
<span class="line-number">477</span>// ##     ----                 { (isfinite(x) &amp;&amp; (y == +INF)
<span class="line-number">478</span>// ##     +2.0                 {
<span class="line-number">479</span>// ##                        
<span class="line-number">480</span>// ##     -PI                  {
<span class="line-number">481</span>// ##     ---                  { (isfinite(x) &amp;&amp; (y == -INF)
<span class="line-number">482</span>// ##     +2.0                 {
<span class="line-number">483</span>// ##                        
<span class="line-number">484</span>// ##                   ( y )  {
<span class="line-number">485</span>// ##     +PI  + cp_atan( - )  { ( x &lt;  0.0 ) &amp;&amp; ( y &gt;= 0.0 )
<span class="line-number">486</span>// ##                   ( x )  {
<span class="line-number">487</span>// ##                                     
<span class="line-number">488</span>// ##                   ( y )  {
<span class="line-number">489</span>// ##     -PI  + cp_atan( - )  { ( x &lt;  0.0 ) &amp;&amp; ( y &lt; 0.0 )
<span class="line-number">490</span>// ##                   ( x )  {
<span class="line-number">491</span>// ##                                     
<span class="line-number">492</span>// ##                   ( y )  {
<span class="line-number">493</span>// ##     +0.0 + cp_atan( - )  { otherwise
<span class="line-number">494</span>// ##                   ( x )  {
<span class="line-number">495</span>// ##                                     
<span class="line-number">496</span>
<span class="line-number">497</span>qword _cp_fatan2( qword y, qword x )
<span class="line-number">498</span>{
<span class="line-number">499</span>    const qword f_one       = cp_filone();
<span class="line-number">500</span>    const qword f_zero      = cp_filzero();
<span class="line-number">501</span>    const qword f_pi        = si_lqa( (intptr_t)&amp;_cp_f_pi  );
<span class="line-number">502</span>    const qword f_npi       = si_lqa( (intptr_t)&amp;_cp_f_npi );
<span class="line-number">503</span>
<span class="line-number">504</span>    // ##                        
<span class="line-number">505</span>    // ## yox = y
<span class="line-number">506</span>    // ##       -
<span class="line-number">507</span>    // ##       x
<span class="line-number">508</span>    // ##                        
<span class="line-number">509</span>    // ## z   = +PI + cp_atan( yox ) { ( x &lt;  0.0 ) &amp;&amp; ( y &gt;= 0.0 )
<span class="line-number">510</span>    // ##       -PI + cp_atan( yox ) { ( x &lt;  0.0 ) &amp;&amp; ( y &lt;  0.0 )
<span class="line-number">511</span>    // ##       0.0 + cp_atan( yox ) { otherwise
<span class="line-number">512</span>
<span class="line-number">513</span>    const qword x_ltz_mask  = si_fcgt( f_zero, x );
<span class="line-number">514</span>    const qword y_ltz_mask  = si_fcgt( f_zero, y );
<span class="line-number">515</span>    const qword xy_ltz_mask = si_and( x_ltz_mask, y_ltz_mask );
<span class="line-number">516</span>    const qword zadd0       = si_selb( f_zero, f_pi, x_ltz_mask );
<span class="line-number">517</span>    const qword zadd        = si_selb( zadd0, f_npi, xy_ltz_mask );
<span class="line-number">518</span>    const qword x_r0        = si_frest( x );
<span class="line-number">519</span>    const qword x_r1        = si_fnms( x_r0, x, f_one );
<span class="line-number">520</span>    const qword x_r         = si_fma( x_r1, x_r0, x_r0 );
<span class="line-number">521</span>    const qword yox         = si_fm( y, x_r );
<span class="line-number">522</span>    const qword atan_yox    = _cp_fatan( yox );
<span class="line-number">523</span>    const qword result      = si_fa( zadd, atan_yox );
<span class="line-number">524</span>
<span class="line-number">525</span>    return (result);
<span class="line-number">526</span>}
<span class="line-number">527</span>
<span id="cp_fatan2" class="line-number">528</span>vector float cp_fatan2( vector float arg0 /* y */, vector float arg1 /* x */ )
<span class="line-number">529</span>{
<span class="line-number">530</span>    const qword y           = (qword)arg0;
<span class="line-number">531</span>    const qword x           = (qword)arg1;
<span class="line-number">532</span>    const qword result      = _cp_fatan2( y, x );
<span class="line-number">533</span>
<span class="line-number">534</span>    return (vector float)(result);
<span class="line-number">535</span>}
<span class="line-number">536</span>
<span id="cp_fatan2_scalar" class="line-number">537</span>float cp_fatan2_scalar( float arg0 /* y */, float arg1 /* x */ )
<span class="line-number">538</span>{
<span class="line-number">539</span>    const qword y           = si_from_float( arg0 );
<span class="line-number">540</span>    const qword x           = si_from_float( arg1 );
<span class="line-number">541</span>    const qword z           = _cp_fatan2( y, x );
<span class="line-number">542</span>    const float result      = si_to_float( z );
<span class="line-number">543</span>
<span class="line-number">544</span>    return( result );
<span class="line-number">545</span>}
<span class="line-number">546</span>
<span class="line-number">547</span>#endif /* CP_FATAN_CBE_SPU_H */
</macton@gmail.com></div>
<div class="code">
<span class="line-number">  0</span>// ## cp_fatan-cbe-spu.c (C99)
<span class="line-number">  1</span>// ## Version 1.0
<span class="line-number">  2</span>// ##                        
<span class="line-number">  3</span>// ## Copyright (c) 2006 Mike Acton <macton@gmail.com>
<span class="line-number">  4</span>// ##                        
<span class="line-number">  5</span>// ## SIGNIFICANT REFERENCES:
<span class="line-number">  6</span>// ##                        
<span class="line-number">  7</span>// ##    [1] Cephes Math Library Release 2.8:  June, 2000
<span class="line-number">  8</span>// ##        Copyright 1984, 1995, 2000, Stephen L. Moshier
<span class="line-number">  9</span>// ##    [2] Numerical Computation Guide (PDF)
<span class="line-number"> 10</span>// ##        Copyright 2000, Sun Microsystems, Inc.
<span class="line-number"> 11</span>// ##    [3] IEEE 754 Support in C99 (PDF)
<span class="line-number"> 12</span>// ##        Copyright 2001, Jim Thomas
<span class="line-number"> 13</span>// ##    [4] Solaris 10 Reference Manual : atan2(3M)
<span class="line-number"> 14</span>// ##        Copyright 1994-2005, Sun Microsystems, Inc.
<span class="line-number"> 15</span>// ##                        
<span class="line-number"> 16</span>// ## Permission is hereby granted, free of charge, to any person obtaining
<span class="line-number"> 17</span>// ## a copy of this software and associated documentation files 
<span class="line-number"> 18</span>// ## (the "Software"), to deal in the Software without restriction, including
<span class="line-number"> 19</span>// ## without limitation the rights to use, copy, modify, merge, publish, 
<span class="line-number"> 20</span>// ## distribute, sublicense, and/or sell copies of the Software, and to permit
<span class="line-number"> 21</span>// ## persons to whom the Software is furnished to do so, subject to the 
<span class="line-number"> 22</span>// ## following conditions:
<span class="line-number"> 23</span>// ##                        
<span class="line-number"> 24</span>// ## The above copyright notice and this permission notice shall be included 
<span class="line-number"> 25</span>// ## in all copies or substantial portions of the Software.
<span class="line-number"> 26</span>// ##                        
<span class="line-number"> 27</span>// ## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 
<span class="line-number"> 28</span>// ## OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
<span class="line-number"> 29</span>// ## FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
<span class="line-number"> 30</span>// ## AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
<span class="line-number"> 31</span>// ## LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
<span class="line-number"> 32</span>// ## OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
<span class="line-number"> 33</span>// ## THE SOFTWARE.
<span class="line-number"> 34</span>// ##                        
<span class="line-number"> 35</span>
<span class="line-number"> 36</span>// Loading these contants from (global) SPU local memory is going to be a win over building them
<span class="line-number"> 37</span>// or storing them locally near the function.
<span class="line-number"> 38</span>
<span class="line-number"> 39</span>const vector unsigned int _cp_f_pio4            = {+0x3F490FDA,+0x3F490FDA,+0x3F490FDA,+0x3F490FDA};
<span class="line-number"> 40</span>const vector unsigned int _cp_f_t3p8            = {+0x401A8279,+0x401A8279,+0x401A8279,+0x401A8279};
<span class="line-number"> 41</span>const vector unsigned int _cp_f_npio2           = {-0x4036F026,-0x4036F026,-0x4036F026,-0x4036F026};
<span class="line-number"> 42</span>const vector unsigned int _cp_f_pio2            = {+0x3FC90FDA,+0x3FC90FDA,+0x3FC90FDA,+0x3FC90FDA};
<span class="line-number"> 43</span>const vector unsigned int _cp_f_pt66            = {+0x3F28F5C2,+0x3F28F5C2,+0x3F28F5C2,+0x3F28F5C2};
<span class="line-number"> 44</span>const vector unsigned int _cp_f_pi              = {+0x40490fda,+0x40490fda,+0x40490fda,+0x40490fda};
<span class="line-number"> 45</span>const vector unsigned int _cp_f_npi             = {-0x3fb6f026,-0x3fb6f026,-0x3fb6f026,-0x3fb6f026};
<span class="line-number"> 46</span>
<span class="line-number"> 47</span>const vector unsigned int _cp_f_atan_q4         = {+0x43428CF7,+0x43428CF7,+0x43428CF7,+0x43428CF7};
<span class="line-number"> 48</span>const vector unsigned int _cp_f_atan_q3         = {+0x43F2B1F8,+0x43F2B1F8,+0x43F2B1F8,+0x43F2B1F8};
<span class="line-number"> 49</span>const vector unsigned int _cp_f_atan_q2         = {+0x43D870C6,+0x43D870C6,+0x43D870C6,+0x43D870C6};
<span class="line-number"> 50</span>const vector unsigned int _cp_f_atan_q1         = {+0x432506EA,+0x432506EA,+0x432506EA,+0x432506EA};
<span class="line-number"> 51</span>const vector unsigned int _cp_f_atan_q0         = {+0x41C6DE22,+0x41C6DE22,+0x41C6DE22,+0x41C6DE22};
<span class="line-number"> 52</span>const vector unsigned int _cp_f_atan_p4         = {-0x3D7E4CB1,-0x3D7E4CB1,-0x3D7E4CB1,-0x3D7E4CB1};
<span class="line-number"> 53</span>const vector unsigned int _cp_f_atan_p3         = {-0x3D0A3A07,-0x3D0A3A07,-0x3D0A3A07,-0x3D0A3A07};
<span class="line-number"> 54</span>const vector unsigned int _cp_f_atan_p2         = {-0x3D69FB9F,-0x3D69FB9F,-0x3D69FB9F,-0x3D69FB9F};
<span class="line-number"> 55</span>const vector unsigned int _cp_f_atan_p1         = {-0x3E7EBD5E,-0x3E7EBD5E,-0x3E7EBD5E,-0x3E7EBD5E};
<span class="line-number"> 56</span>const vector unsigned int _cp_f_atan_p0         = {-0x409FFC03,-0x409FFC03,-0x409FFC03,-0x409FFC03};
<span class="line-number"> 57</span>const vector unsigned int _cp_f_hmorebits       = {+0x240D3131,+0x240D3131,+0x240D3131,+0x240D3131};
<span class="line-number"> 58</span>const vector unsigned int _cp_f_morebits        = {+0x248D3131,+0x248D3131,+0x248D3131,+0x248D3131};
<span class="line-number"> 59</span>
</macton@gmail.com></div>]]>
    </content>
</entry>

<entry>
    <title>Open Source and Console Games</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/08/open-source-and-console-games.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.6</id>

    <published>2006-08-10T05:54:17Z</published>
    <updated>2009-08-05T05:56:30Z</updated>

    <summary> On August 16, 2006 I participated in a panel discussion on Open Source and media as part of Digital Hollywood&apos;s Building Blocks 2006 conference. Here is the description of the panel [from digitalhollywood.com] The Open Source movement began during...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note">
	On August 16, 2006 I participated in a <a href="http://www.digitalhollywood.com/%231BBlkSessions/BBWedFourWork.html">
	panel discussion on Open Source and media</a> as part of Digital Hollywood's <a href="http://www.digitalhollywood.com/BuildingBlocks.html">Building Blocks 2006</a>
	conference.<br />
	<br />
	Here is the description of the panel [from digitalhollywood.com]
	<div class="quote">
		The Open Source movement began during the dot.com rise with young companies
		developing great tools to deliver applications and services across
		multiple platforms. The consumer's appetite for new content driven
		experiences has expanded to include ways to view, manage, and share
		content across devices. With the changing landscape around the home,
		Open Source promises to power a new generation of applications running
		over today's high-speed networks and the systems used to create,
		manage, and distribute that content.<br /> 
		<br /> 
		Come join key leaders in the global electronics, online, and media 
		communities to discuss Open Source's definition, and learn how companies
		will create systems, infrastructure, and applications for the next 
		generation of the Consumer Entertainment Experience.
	</div><br />
	<br />
	For those of you who did not attend, I would like to take an
	opportunity to discuss here my personal opinions on these issues.
</div>
             ]]>
        <![CDATA[<div class="subtitle">Background</div>

From the description of the panel, some people might be lead to believe that free and
open source software are new phenomenons, somehow linked to the internet bubble. This
is definately not true.<br />
<br />

Certainly the history of "free software" can be traced back much further than the
"dot com rise", and because much of the software we use (for example, GNU/Linux) is a mix 
of both "open" and "free" software, we should consider the larger context.<br />
<br />

By most accounts, this begins with Richard Stallman making <a href="http://www.gnu.org/gnu/initial-announcement.html">announcing his plan</a>
to "free unix" on usenet in 1983. But of course, the distribution of source code
freely among programmers can be traced back much further than that.<br />
<br />

<div class="subtitle">Terminology</div>

No discussion of "open source" can be complete without distinguishing between 
the subtle differences of "open source software" and "free software":<br />
<br />
See: <a href="http://www.gnu.org/philosophy/free-sw.html">Definition of "free software"</a><br />
See: <a href="http://www.opensource.org/docs/definition_plain.php">Definition of "open source software"</a>
<br />
Here's an article which tries to <a href="http://www.itworld.com/AppDev/350/LWD010523vcontrol4/">clarify the differences.</a><br />
<br />
There has been some muddying of the waters by Microsoft's relatively recent 
<a href="http://en.wikipedia.org/wiki/Shared_source=">"shared source"</a> initiative
(but there is general agreement that this is not either really "open" or "free" and
so not part of this discussion.)<br />
<br />

<div class="subtitle">Licenses</div>

There are very many <a href="http://www.opensource.org/licenses/">open source licenses</a> 
and one canonical <a href="http://www.gnu.org/copyleft/gpl.html">free software license</a>.
Many companies (including IBM, SGI, Apple, ...) have also produced their own variants.<br />
<br />

I think applicability of license to product merits some discussion here. 
For example, in my field (console video games), we are restricted by the platform 
owners (Sony, Nintendo, etc.) by NDA and cannot release specific details to the public. 
This necessarily limits our choices when using "open source software" and nearly 
eliminates "free software" as an option, as we often cannot fully reciprocate our 
modifications to the public.<br />
<br />
I do not see this a problem, nor as a challange to be overcome. Authors of free
software are often willing to distribute their software without cost so that 
others make take advantage of the work that they've done and the only ask for
one thing in return - that the software remain free. Just as the case when a
middleware vendor may charge half a million dollars more than you're willing
to pay, if the price for the software is to steep, then something else must
be used instead. I wholeheartedly respect the work of the FSF 
(<a href="http://www.fsf.org/">Free Software Foundation</a>), but I understand
that the practical nature of our business makes it very difficult to directly
use the products of their hard work, and the work of so many other free software
developers, in our own products.<br />
<br />
Free software definitely has its place in game development, however. I do most
of my work on GNU/Linux desktops and GCC is my compiler of choice. Additionally, many
offline tools used directly or indirectly to develop the games themselves are
based on free or open software, and I'm grateful that those tools exist.<br />
<br />

<div class="subtitle">Reciprocity</div>

I think reciprocity is the most important thing we can be discussing in the 
context of open source software and console games. It cannot be a simply a 
matter of "how open source benefits us", but we must also discuss "how we can
participate in the open source community" and what responsibilities we have
for doing so.<br />
<br />

The free and open source software which we gladly take advantage of (if not in the games 
themselves, then certainly in the tools that develop them) can be thought
of as the proverbial "shoulder of giants". When we forget what brought us the 
advantages to get where we are, we do a disservice to ourselves and the health of 
our industry, and thus ultimately a disservice to our shareholders and customers.<br />
<br />

I think Yahoo Search's vision statement applies equally well to the role of open source
software:<br />
<br />
"Enable people to <b>find</b>, <b>use</b>, <b>share</b> and <b>expand</b> all human knowledge"<br />
<br />

To share and contribute not only benefits us now, but will continue to benefit us when our 
current products are forgotten and dusty.<br />
<br />

<div class="subtitle">Cost of Openness</div>

There is an ongoing debate on the cost of sharing your work with the world. Perhaps there
will be a higher cost in support when calls and emails arrive from users that have configured
the software in some strange environment. Maybe it will give competitors an edge when they
see can clearly read the "secrets" of your product in the source code. Most arguments, including
these, are never really so much about the costs involved (consider how many millions of dollars
are spent developing the typical console title) but rather question the value of sharing, i.e. the return
on the investment. <br />
<br />  

<div class="sticky-note">
Consider this: The console game industry is a fast-moving industry. Consoles
change, methods change and even the developers themselves change rapidly and constantly. Success of
a title is usually determined by the quality of the content, not the engine that drives it, although occasionally
the field of successful titles is punctuated by technical acheivement. But if competitors need access to the
source of a successful product in order to become successful themselves, <i>they are already behind</i>, and
no amount of access will allow them to gain on the continued developments of the leaders. And if it does help
to make their product a little better, that's a good thing - good games are good for the platform, and what's good
for the platform is good for developers wanting to sell their games on that platform.
</div>

<b>The value of openness is in the people, not the source code.</b><br />
<br />   
<ul><li><b>Invest in the future.</b> The programmers reading, modifying and commenting on the source may 
belong to the next-generation of coders in the industry. Help them learn by providing examples of real-world
challanges and their solutions.</li><li><b>Invest in your team.</b> The best way to learn is to teach. Simply by explaining what they've done, 
programmers will come up with new ideas and find areas that they've missed. This is no minor point - a 
studio's value is in it's people and since there are very few traditional training courses for the professional developer,
a good studio must find different ways of helping make those developers better each day at what they do.
</li></ul>

<div class="subtitle">Call to Arms</div>
 
Electronic Arts made a considerable difference to not only games but to many different industries when they released
the <a href="http://www.szonye.com/bradd/iff.html">EA IFF 85</a> Standard for Interchange Format Files. And it is
in that tradition, almost twenty-two years later that I hope game developers, studios and publishers will re-double their
efforts to share what they have created and learned with the community. Id software, the modern poster-child for
sharing their technology, certainly hasn't lost anything by releasing some of their older sources.<br />
<br />  

Start small - a function, a snippet even. But make if we make it a habit, we will all be rewarded. <br />
<br />  

<div class="sticky-note">
Has your studio released something into the wild? Tell me about it and I will happily list it here.
</div>


            ]]>
    </content>
</entry>

<entry>
    <title>Branch-free implementation of half-precision (16 bit) floating point</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/07/update-19-july-06-added.html" />
    <id>tag:cellperformance.beyond3d.com,2009:/articles//3.10</id>

    <published>2006-07-18T06:19:16Z</published>
    <updated>2009-08-05T06:22:24Z</updated>

    <summary>Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz(). Update! (17 July 06) The code has been considerably refactored. Decided to go with single function per expression. The expressions have been reduced as a first optimization pass....</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note">Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz().</div>
<div class="sticky-note">Update! (17 July 06) The code has been
considerably refactored. Decided to go with single function per
expression. The expressions have been reduced as a first optimization
pass.</div>

<div class="subtitle">Project</div>
The goal of this project is serve as an example of developing some
relatively complex operations completely without branches - a software
implementation of half-precision floating point numbers (That does not
use floating point hardware). This example should echo the IEEE 754
standard for floating point numbers as closely as reasonable, including
support for +/- INF, QNan, SNan, and denormalized numbers. However,
exceptions will not be implemented.<br />
<br />
Half-precision floats are used in cases where neither the range nor the
precision of 32 bit floating point numbers are needed, but where some
dynamic precision is required. Two common uses are for image
transformation, where the range of each component (e.g. red, green,
blue, alpha) is typically limited to or near [0.0,1.0] or vertex data
(e.g. position, texture coordinates, color values, etc.).<br />
<br />
The main advantage of half-precision floats is their size. Beyond the
considerable potential for memory savings, processing a large number of
half-precision values is more cache-friendly than using 32 bit values.<br /><br />The current released version (including tests) can be downloaded here: <a href="http://cellperformance-snippets.googlecode.com/files/half.c">half.c</a> <a href="http://cellperformance-snippets.googlecode.com/files/half.h">half.h</a><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>Increment And Decrement Wrapping Values</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/07/increment-and-decrement-wrapping-values.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.14</id>

    <published>2006-07-11T07:10:07Z</published>
    <updated>2009-08-05T07:11:43Z</updated>

    <summary>Small code, big impact Occasionally you have a set of values that you want to wrap around as you increment and decrement them. For example, in a GUI where the user keys right or left and you want to wrap...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="subtitle">Small code, big impact</div>
Occasionally you have a set of values that you want to wrap around as
you increment and decrement them. For example, in a GUI where the user
keys right or left and you want to wrap around the menu.<br />
<br />
A typical implementation:
<div class="code">
static inline int wrap_inc( int value, int min, int max )
{
  return ( value == max ) ? min : value + 1;
}

static inline int wrap_dec( int value, int min, int max )
{
  return ( value == min ) ? max : value - 1;
}
</div>
<br />
But on processors (such as the PowerPC) where compare and branch is
very costly these small one-liners can have a significant impact on
performance when used in critical code. They also make optimization
more difficult for the compiler for the surrounding code. ]]>
        <![CDATA[<div class="subtitle">Breakdown</div>

Store the desired result:
<div class="code">
const type result_inc   = val + 1;
</div>
This value may overflow if val == INT(SIZE)_MAX, but in that case the correct value will still be selected.<br />
<br />

Get the different between the max (or min) value and the current value:
<div class="code">
const type max_diff     = max - val;
</div>
It's only important if this value is zero or not zero. If it's zero, we
know we are at the max (or min) value; otherwise we can increment (or
decrement).<br />
<br />

Create a mask based on the difference:
<div class="code">
const type max_diff_nz  = (type)( (stype)( max_diff | -max_diff ) &gt;&gt; bit_mask );
</div>
i.e. 
<div class="code">
max_diff_nz = ( max_diff != 0 ) ? (type)-1 : (type)0;
</div>
(Remember that -1 is all bits on in two's complement.)<br />
<br />

Complement the mask:
<div class="code">
const type max_diff_eqz = ~max_diff_nz;
</div>
<br />

Select the correct result based on the masks:
<div class="code">
const type result       = ( result_inc &amp; max_diff_nz ) | ( min &amp; max_diff_eqz );
</div>
Only one of the two values can possibly be selected.<br />
i.e. 
<div class="code">
result = ( val == max ) ? min : val + 1;
</div>
<br />
<br />

<div class="subtitle">Final Code</div>
<div class="code">
//
// wrap_int.h
//
 
#ifndef WRAP_INT_H
#define WRAP_INT_H

//
// Increment wrapping value      
//
// val = { ( val == max ), min
//     = { otherwise,      val + 1
//
// uint8_t  wrap_inc_u8 ( const uint8_t  val, const uint8_t  min, const uint8_t  max );
// uint16_t wrap_inc_u16( const uint16_t val, const uint16_t min, const uint16_t max );
// uint32_t wrap_inc_u32( const uint32_t val, const uint32_t min, const uint32_t max );
// uint64_t wrap_inc_u64( const uint64_t val, const uint64_t min, const uint64_t max );
// int8_t   wrap_inc_s8 ( const int8_t   val, const int8_t   min, const int8_t   max );
// int16_t  wrap_inc_s16( const int16_t  val, const int16_t  min, const int16_t  max );
// int32_t  wrap_inc_s32( const int32_t  val, const int32_t  min, const int32_t  max );
// int64_t  wrap_inc_s64( const int64_t  val, const int64_t  min, const int64_t  max );

#define DECL_WRAP_INC( type_name, type, stype, bit_mask )                                  \
static inline type wrap_inc_##type_name( const type val, const type min, const type max )  \
{                                                                                          \
  const type result_inc   = val + 1;                                                       \
  const type max_diff     = max - val;                                                     \
  const type max_diff_nz  = (type)( (stype)( max_diff | -max_diff ) &gt;&gt; bit_mask );         \
  const type max_diff_eqz = ~max_diff_nz;                                                  \
  const type result       = ( result_inc &amp; max_diff_nz ) | ( min &amp; max_diff_eqz );         \
                                                                                           \
  return (result);                                                                         \
}

DECL_WRAP_INC( u8,  uint8_t,  int8_t,  7  ); 
DECL_WRAP_INC( u16, uint16_t, int16_t, 15 ); 
DECL_WRAP_INC( u32, uint32_t, int32_t, 31 ); 
DECL_WRAP_INC( u64, uint64_t, int64_t, 63 ); 
DECL_WRAP_INC( s8,  int8_t,   int8_t,  7  ); 
DECL_WRAP_INC( s16, int16_t,  int16_t, 15 ); 
DECL_WRAP_INC( s32, int32_t,  int32_t, 31 ); 
DECL_WRAP_INC( s64, int64_t,  int64_t, 63 ); 

//
// Decrementing wrapping value      
//
// val = { ( val == min ), max
//     = { otherwise,      val - 1
//
// uint8_t  wrap_dec_u8 ( const uint8_t  val, const uint8_t  min, const uint8_t  max );
// uint16_t wrap_dec_u16( const uint16_t val, const uint16_t min, const uint16_t max );
// uint32_t wrap_dec_u32( const uint32_t val, const uint32_t min, const uint32_t max );
// uint64_t wrap_dec_u64( const uint64_t val, const uint64_t min, const uint64_t max );
// int8_t   wrap_dec_s8 ( const int8_t   val, const int8_t   min, const int8_t   max );
// int16_t  wrap_dec_s16( const int16_t  val, const int16_t  min, const int16_t  max );
// int32_t  wrap_dec_s32( const int32_t  val, const int32_t  min, const int32_t  max );
// int64_t  wrap_dec_s64( const int64_t  val, const int64_t  min, const int64_t  max );

#define DECL_WRAP_DEC( type_name, type, stype, bit_mask )                                  \
static inline type wrap_dec_##type_name( const type val, const type min, const type max )  \
{                                                                                          \
  const type result_dec   = val - 1;                                                       \
  const type min_diff     = min - val;                                                     \
  const type min_diff_nz  = (type)( (stype)( min_diff | -min_diff ) &gt;&gt; bit_mask );         \
  const type min_diff_eqz = ~min_diff_nz;                                                  \
  const type result       = ( result_dec &amp; min_diff_nz ) | ( max &amp; min_diff_eqz );         \
                                                                                           \
  return (result);                                                                         \
}

DECL_WRAP_DEC( u8,  uint8_t,  int8_t,  7  ); 
DECL_WRAP_DEC( u16, uint16_t, int16_t, 15 ); 
DECL_WRAP_DEC( u32, uint32_t, int32_t, 31 ); 
DECL_WRAP_DEC( u64, uint64_t, int64_t, 63 ); 
DECL_WRAP_DEC( s8,  int8_t,   int8_t,  7  ); 
DECL_WRAP_DEC( s16, int16_t,  int16_t, 15 ); 
DECL_WRAP_DEC( s32, int32_t,  int32_t, 31 ); 
DECL_WRAP_DEC( s64, int64_t,  int64_t, 63 ); 

#endif /* #ifndef WRAP_INT_H */
</div>]]>
    </content>
</entry>

<entry>
    <title>Box Overlap</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/06/box-overlap.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.22</id>

    <published>2006-06-19T05:18:02Z</published>
    <updated>2009-08-06T05:19:53Z</updated>

    <summary>Interactive 3D applications frequently need to check whether one geometric object overlaps another. In this article, we&apos;ll look at a function to test for overlap between 3D boxes, and we&apos;ll show how to optimize this function for the CBE....</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        Interactive 3D applications frequently need to check whether one
geometric object overlaps another. In this article, we&apos;ll look at a
function to test for overlap between 3D boxes, and we&apos;ll show how to
optimize this function for the CBE. 
        <![CDATA[<div class="subtitle">Before Optimization</div>
Let's start with the example below, which is similar to solid-2.5.4/src/complex/DT_CBox.h in the SOLID library from <a href="http://dtecta.com/">Dtecta</a>,
but doesn't need to include a bunch of other stuff.<br />

<pre class="code">#include &lt;math.h&gt;<br />#include &lt;stdint.h&gt;<br /><br />struct Vector3<br />{<br />    float m_co[3];<br /><br />    Vector3() {}<br /><br />    Vector3(const float&amp; x, const float&amp; y, const float&amp; z)<br />    {<br />      m_co[0] = x;<br />      m_co[1] = y;<br />      m_co[2] = z;<br />    }<br /><br />    float&amp;       operator[]( int i )       { return m_co[i]; }<br />    const float&amp; operator[]( int i ) const { return m_co[i]; }<br /><br />    Vector3&amp; operator-=(const Vector3&amp; v)<br />    {<br />      this-&gt;m_co[0] -= v.m_co[0];<br />      this-&gt;m_co[1] -= v.m_co[1];<br />      this-&gt;m_co[2] -= v.m_co[2];<br /><br />      return (*this);<br />    }<br /><br />    Vector3&amp; operator+=(const Vector3&amp; v)<br />    {<br />      this-&gt;m_co[0] += v.m_co[0];<br />      this-&gt;m_co[1] += v.m_co[1];<br />      this-&gt;m_co[2] += v.m_co[2];<br /><br />      return (*this);<br />    }<br />};<br /><br />struct Box<br />{<br />    Vector3 m_center;<br />    Vector3 m_extent;<br /><br />    Box() {}<br /><br />    Box(const Vector3&amp; center, const Vector3&amp; extent) <br />      : m_center(center),<br />        m_extent(extent)<br />    {}<br /><br />    bool overlaps(const Box&amp; b) const<br />    {<br />        return ::fabs(m_center[0] - b.m_center[0]) &lt;= m_extent[0] + b.m_extent[0] &amp;&amp;<br />               ::fabs(m_center[1] - b.m_center[1]) &lt;= m_extent[1] + b.m_extent[1] &amp;&amp;<br />               ::fabs(m_center[2] - b.m_center[2]) &lt;= m_extent[2] + b.m_extent[2];<br />    }<br />};<br /><br />bool<br />test_overlap( const Box&amp; a, const Box&amp; b )<br />{<br />  return a.overlaps( b );<br />}<br /></pre>
We'll be looking at the compiler output for the test_overlap()
function. I made the data public, since it makes the optimization pass
much simpler.
I'm not going to debate what makes "better" C++ code. We're just
talking about what's <i>faster</i> here.<br />
<br />
Nothing is <i>obviously</i> wrong (i.e. slow) with this code.
Everything is inline, so we shouldn't expect any unneeded jumps for such small functions.
There's only one reference to each element, so we shouldn't expect any extra loads.<br />
<br />
But if we look at the compiler output, we see that almost every operation stalls waiting for the operands to load.<br />
<pre class="code">_Z12test_overlapRK3BoxS1_:<br />    stwu 1,-16(1)<br />    lfs 3,0(3)    -- LOAD(3)<br />    li 0,0<br />    lfs 11,0(4)   -- LOAD(11)<br />    fsubs 1,3,11  -- WAIT FOR LOAD(3), LOAD(11)<br />    lfs 2,16(3)   -- LOAD(2)<br />    lfs 0,16(4)   -- LOAD(0)<br />    fadds 12,2,0  -- WAIT FOR LOAD(2), LOAD(2)<br />    fabs 13,1<br />    fcmpu 7,13,12<br />    bgt- 7,.L2<br />    lfs 9,4(3)    -- LOAD(9)<br />    lfs 10,4(4)   -- LOAD(10)<br />    fsubs 6,9,10  -- WAIT FOR LOAD(9), LOAD(10)<br />    lfs 7,20(3)   -- LOAD(7)<br />    lfs 8,20(4)   -- LOAD(8)<br />    fadds 4,7,8   -- WAIT FOR LOAD(7), LOAD(8) <br />    fabs 5,6<br />    fcmpu 0,5,4<br />    bgt- 0,.L2<br />    lfs 0,8(3)    -- LOAD(0)<br />    lfs 11,8(4)   -- LOAD(11)<br />    fsubs 1,0,11  -- WAIT FOR LOAD(0), LOAD(11)<br />    lfs 2,24(3)   -- LOAD(2)<br />    lfs 3,24(4)   -- LOAD(3)<br />    fadds 13,2,3  -- WAIT FOR LOAD(2), LOAD(3)<br />    fabs 12,1<br />    fcmpu 1,12,13<br />    bgt- 1,.L2<br />    li 0,1<br />.L2:<br />    mr 3,0<br />    addi 1,1,16<br />    blr<br /></pre>
The compiler has built the dependency graph around the branches. You
might think we benefit by branching out immediately when we find a case
that fails, since we skip
the subsequent loads. But this turns out to be a bad idea for the
following reasons:<br />
<br />
1. Operands that are adjacent in memory probably lie on the same cache
line. The PPE is dual thread, so the longer the delay between loads of
adjacent operands,
the greater the chance of the other thread (or an interrupt) flushing
the cache line.<br />
2. The compiler has used "bgt-", meaning the branches are statically predicted <i>unlikely</i>. It doesn't make much sense for the compiler to hide loads behind
unlikely branches.<br />
<div class="subtitle">Separating Loads From Calculations</div>
What we want to do is queue up the loads as deep as we can before we start doing any calculations.<br />
<div class="rule-of-thumb">
Don't use class or struct fields (or array elements) directly in calculations. Always follow this pattern:<br />
1. Load everything you need into local variables of native types.<br />
2. Do all your calculations.<br />
3. Store your final result, while trying to avoid branches.<br />
</div>
Here's the second version of the overlaps() method:<br />
<pre class="code">    bool overlaps(const Box&amp; b) const<br />    {<br />      // <br />      // LOADS<br />      // <br /><br />      const float a_c0 = m_center[0];<br />      const float a_c1 = m_center[1];<br />      const float a_c2 = m_center[2];<br />      const float a_e0 = m_extent[0];<br />      const float a_e1 = m_extent[1];<br />      const float a_e2 = m_extent[2];<br />      const float b_c0 = b.m_center[0];<br />      const float b_c1 = b.m_center[1];<br />      const float b_c2 = b.m_center[2];<br />      const float b_e0 = b.m_extent[0];<br />      const float b_e1 = b.m_extent[1];<br />      const float b_e2 = b.m_extent[2];<br /><br />      // <br />      // CALCULATIONS<br />      // <br /><br />      const float delta_c0     = a_c0 - b_c0;<br />      const float delta_c1     = a_c1 - b_c1;<br />      const float delta_c2     = a_c2 - b_c2;<br />      const float abs_delta_c0 = ::fabs( delta_c0 );<br />      const float abs_delta_c1 = ::fabs( delta_c1 );<br />      const float abs_delta_c2 = ::fabs( delta_c2 );<br />      const float sum_e0       = a_e0 + b_e0;<br />      const float sum_e1       = a_e1 + b_e1;<br />      const float sum_e2       = a_e2 + b_e2;<br /><br />      // <br />      // COMPARES AND BRANCHES<br />      // <br /><br />      const bool  in_0     = abs_delta_c0 &lt;= sum_e0;<br />      const bool  in_1     = abs_delta_c1 &lt;= sum_e1;<br />      const bool  in_2     = abs_delta_c2 &lt;= sum_e2;<br />      const bool  result   = in_0 &amp;&amp; in_1 &amp;&amp; in_2;<br /><br />      return (result);<br />    }<br /></pre>
The results are not much better at this point. The compiler reorders
things, doing each subtraction as soon as the needed operands are
loaded.<br />
<pre class="code">_Z12test_overlapRK3BoxS1_:<br />    stwu 1,-16(1)<br />    lfs 9,4(3)<br />    li 9,0<br />    lfs 0,4(4)<br />    fsubs 2,9,0<br />    lfs 8,0(3)<br />    lfs 1,0(4)<br />    fsubs 5,8,1<br />    lfs 10,8(3)<br />    lfs 12,8(4)<br />    fsubs 1,10,12<br />    lfs 3,16(3)<br />    lfs 13,16(4)<br />    fadds 0,3,13<br />    fabs 9,2<br />    lfs 11,12(4)<br />    lfs 6,12(3)<br />    fadds 12,6,11<br />    lfs 4,20(3)<br />    lfs 7,20(4)<br />    fabs 8,5<br />    fadds 11,4,7<br />    fabs 10,1<br />    fcmpu 7,9,0<br />    crnot 30,29<br />    mfcr 0<br />    rlwinm 0,0,31,1<br />    fcmpu 1,8,12<br />    fcmpu 6,10,11<br />    crnot 26,25<br />    cmpwi 7,0,0<br />    mfcr 0<br />    rlwinm 0,0,27,1<br />    bgt- 1,.L14<br />    cmpwi 6,0,0<br />    beq- 7,.L14<br />    beq- 6,.L14<br />    li 9,1<br />.L14:<br />    mr 3,9<br />    addi 1,1,16<br />    blr<br /></pre>
We'll use a trick to prevent the compiler from mixing loads and calculations.  First we define this macro:<br />
<pre class="code">#define GCC_SPLIT_BLOCK __asm__ ("");<br /></pre>
An empty inline assembly statement doesn't add any code, but it splits
the basic block, forcing the compiler to schedule the code on either
side separately.
We'll add this macro after the loads but before the calculations. We'll
also add it just after the calculations so it's easier to see what's
happening,
but this second split isn't really important for optimization.<br />
<br />
Here's the third version of the overlaps() method:<br />
<pre class="code">    bool overlaps(const Box&amp; b) const<br />    {<br />      // <br />      // LOADS<br />      // <br /><br />      const float a_c0 = m_center[0];<br />      const float a_c1 = m_center[1];<br />      const float a_c2 = m_center[2];<br />      const float a_e0 = m_extent[0];<br />      const float a_e1 = m_extent[1];<br />      const float a_e2 = m_extent[2];<br />      const float b_c0 = b.m_center[0];<br />      const float b_c1 = b.m_center[1];<br />      const float b_c2 = b.m_center[2];<br />      const float b_e0 = b.m_extent[0];<br />      const float b_e1 = b.m_extent[1];<br />      const float b_e2 = b.m_extent[2];<br /><br />      GCC_SPLIT_BLOCK<br /><br />      // <br />      // CALCULATIONS<br />      // <br /><br />      const float delta_c0     = a_c0 - b_c0;<br />      const float delta_c1     = a_c1 - b_c1;<br />      const float delta_c2     = a_c2 - b_c2;<br />      const float abs_delta_c0 = ::fabs( delta_c0 );<br />      const float abs_delta_c1 = ::fabs( delta_c1 );<br />      const float abs_delta_c2 = ::fabs( delta_c2 );<br />      const float sum_e0       = a_e0 + b_e0;<br />      const float sum_e1       = a_e1 + b_e1;<br />      const float sum_e2       = a_e2 + b_e2;<br /><br />      GCC_SPLIT_BLOCK<br /><br />      // <br />      // COMPARES AND BRANCHES<br />      // <br /><br />      const bool  in_0     = abs_delta_c0 &lt;= sum_e0;<br />      const bool  in_1     = abs_delta_c1 &lt;= sum_e1;<br />      const bool  in_2     = abs_delta_c2 &lt;= sum_e2;<br />      const bool  result   = in_0 &amp;&amp; in_1 &amp;&amp; in_2;<br /><br />      return (result);<br />    }<br /></pre>
The new output clearly shows that the code was scheduled on either side of the splits.<br />
<pre class="code">_Z12test_overlapRK3BoxS1_:<br />    //<br />    // PUSH STACK<br />    //<br /><br />    stwu 1,-16(1)<br /><br />    //<br />    // LOADS<br />    //<br /><br />    lfs 4,20(3)<br />    lfs 3,20(4)<br />    lfs 1,0(3)<br />    lfs 13,4(3)<br />    lfs 12,8(3)<br />    lfs 11,12(3)<br />    lfs 10,16(3)<br />    lfs 9,0(4)<br />    lfs 8,4(4)<br />    lfs 7,8(4)<br />    lfs 6,12(4)<br />    lfs 5,16(4)<br /><br />    //<br />    // CALCULATIONS<br />    //<br /><br />    fsubs 0,1,9<br />    fsubs 2,13,8<br />    fsubs 1,12,7<br />    fadds 11,11,6<br />    fadds 10,10,5<br />    fadds 4,4,3<br />    fabs 0,0<br />    fabs 13,2<br />    fabs 12,1<br /><br />    //<br />    // COMPARES AND BRANCHES<br />    //<br /><br />    fcmpu 7,13,10<br />    li 3,0<br />    crnot 30,29<br />    fcmpu 1,0,11<br />    mfcr 0<br />    rlwinm 0,0,31,1<br />    fcmpu 6,12,4<br />    crnot 26,25<br />    cmpwi 7,0,0<br />    mfcr 0<br />    rlwinm 0,0,27,1<br />    bgt- 1,.L14<br />    cmpwi 6,0,0<br />    beq- 7,.L14<br />    beq- 6,.L14<br />    li 3,1<br /><br />    //<br />    // POP STACK AND RETURN<br />    //<br />.L14:<br />    addi 1,1,16<br />    blr<br /></pre>
We still have 3 branches and a lot of compares.  Let's see what we can do about that.<br />
<div class="subtitle">Removing Branches</div>
In the CBE (as with most pipelined architectures), it's good to reduce
or eliminate branches where possible. In this case, we can use the <i>fsel</i>
instruction to replace a compare and branch. This is an optional
PowerPC instruction, but the PPU implements it. Unfortunately, the
compiler
doesn't generate fsel calls for the PPU, so we'll have to call it
manually:<br />
<pre class="code">static inline float ppc_fsels( const float fra, const float frc, const float frb ) <br />{<br />    float frt;<br /><br />    // From: http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixassem/alangref/fsel.htm<br />    //     The double-precision floating-point operand in floating-point register (FPR) FRA <br />    //     is compared with the value zero. If the value in FRA is greater than or equal to <br />    //     zero, floating point register FRT is set to the contents of floating-point <br />    //     register FRC. If the value in FRA is less than zero or is a NaN, floating point <br />    //     register FRT is set to the contents of floating-point register FRB. The comparison <br />    //     ignores the sign of zero; both +0 and -0 are equal to zero. <br />    //     <br />    // i.e. frt = ( fra &gt;= 0.0 ) ? frc : frb;<br />    //     <br />    __asm__( "fsel %0, %1, %2, %3" : "=f"(frt) : "f"(fra), "f"(frc), "f"(frb) );<br /><br />    return (frt);<br />}<br /></pre>
Now let's focus on the compares and branches portion of the method:<br />
<pre class="code">      const bool  in_0     = abs_delta_c0 &lt;= sum_e0;<br />      const bool  in_1     = abs_delta_c1 &lt;= sum_e1;<br />      const bool  in_2     = abs_delta_c2 &lt;= sum_e2;<br />      const bool  result   = in_0 &amp;&amp; in_1 &amp;&amp; in_2;<br /></pre>
This code can be rewritten as follows:<br />
<pre class="code">      const float  overlap_0 = sum_e0 - abs_delta_c0;<br />      const float  overlap_1 = sum_e1 - abs_delta_c1;<br />      const float  overlap_2 = sum_e2 - abs_delta_c2;<br />      const double temp_01   = ( overlap_1 &gt;= 0.0f ) ? overlap_0 : overlap_1;<br />      const double temp_012  = ( overlap_2 &gt;= 0.0f ) ? temp_01   : overlap_2;<br />      const bool   result    = temp_012 &gt;= 0.0f;<br /></pre>
The calculations of temp_01 and temp_012 can be expressed using fsel.<br />
<pre class="code">      const float  overlap_0 = sum_e0 - abs_delta_c0;<br />      const float  overlap_1 = sum_e1 - abs_delta_c1;<br />      const float  overlap_2 = sum_e2 - abs_delta_c2;<br />      const double temp_01   = ppc_fsels( overlap_1, overlap_0, overlap_1 );<br />      const double temp_012  = ppc_fsels( overlap_2, temp_01,   overlap_2 );<br />      const bool   result    = temp_012 &gt;= 0.0f;<br /></pre>
Now take a look at the constant value 0.0f in the last statement above.
Keep in mind the PowerPC has no instruction to move an immediate value
into a
floating point register, so each constant appearing in an expression
means an additional load from memory. That's why it's a good idea to
restructure expressions
if possible to reduce or eliminate constants.<br />
<br />
Here we can't easily avoid comparing with zero, but we can get rid of
the load by doing something unconventional. We can replace 0.0f with a
parameter
named zero, which in this case will be passed in via FPR1. Then it's up
to the caller to find an optimal way to provide the value 0.0f. For
example, if
the calling function has plenty of register variables available, 0.0f
can be loaded into one of them near the top. Alternatively, the
constant can be put
some place in memory where it will be on the same cache line as some
other data that's needed anyway.<br />
<br />
You might think we could construct the constant 0.0f cheaply by
subtracting any float value (e.g., a_c0) from itself. But this doesn't
work if the value
is NaN, because you end up with NaN instead of 0.0f.<br />
<br />
Anyway, let's also change the GCC_SPLIT_BLOCK macro so that we can
inject comments into the asm output (to make it easier to track down
our changes).<br />
<pre class="code">#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );<br /></pre>
Here's the fourth version of the overlaps() method:<br />
<pre class="code">    bool overlaps(const Box&amp; b, float zero) const<br />    {<br />      GCC_SPLIT_BLOCK("LOADS")<br /><br />      const float a_c0 = m_center[0];<br />      const float a_c1 = m_center[1];<br />      const float a_c2 = m_center[2];<br />      const float a_e0 = m_extent[0];<br />      const float a_e1 = m_extent[1];<br />      const float a_e2 = m_extent[2];<br />      const float b_c0 = b.m_center[0];<br />      const float b_c1 = b.m_center[1];<br />      const float b_c2 = b.m_center[2];<br />      const float b_e0 = b.m_extent[0];<br />      const float b_e1 = b.m_extent[1];<br />      const float b_e2 = b.m_extent[2];<br /><br />      GCC_SPLIT_BLOCK("CALCULATIONS")<br /><br />      const float delta_c0     = a_c0 - b_c0;<br />      const float delta_c1     = a_c1 - b_c1;<br />      const float delta_c2     = a_c2 - b_c2;<br />      const float abs_delta_c0 = ::fabs( delta_c0 );<br />      const float abs_delta_c1 = ::fabs( delta_c1 );<br />      const float abs_delta_c2 = ::fabs( delta_c2 );<br />      const float sum_e0       = a_e0 + b_e0;<br />      const float sum_e1       = a_e1 + b_e1;<br />      const float sum_e2       = a_e2 + b_e2;<br />      const float overlap_0    = sum_e0 - abs_delta_c0;<br />      const float overlap_1    = sum_e1 - abs_delta_c1;<br />      const float overlap_2    = sum_e2 - abs_delta_c2;<br /><br />      GCC_SPLIT_BLOCK("SELECT RESULT")<br /><br />      const double temp_01   = ppc_fsels( overlap_1, overlap_0, overlap_1 );<br />      const double temp_012  = ppc_fsels( overlap_2, temp_01,   overlap_2 );<br />      const bool   result    = temp_012 &gt;= zero;<br /><br />      return (result);<br />    }<br /></pre>
We'll also change the test_overlap function to add zero as a parameter:<br />
<pre class="code">bool<br />test_overlap( const Box&amp; a, const Box&amp; b, float zero )<br />{<br />  return a.overlaps( b, zero );<br />}<br /></pre>
The output shows that we have reduced the cost for the comparisons significantly:<br />
<pre class="code">_Z12test_overlapRK3BoxS1_f:<br />    stwu 1,-16(1)<br /><br />    //<br />    // LOADS<br />    //<br /><br />    lfs 0,20(3)<br />    lfs 3,20(4)<br />    lfs 2,0(3)<br />    lfs 10,4(3)<br />    lfs 9,8(3)<br />    lfs 12,12(3)<br />    lfs 13,16(3)<br />    lfs 8,0(4)<br />    lfs 7,4(4)<br />    lfs 6,8(4)<br />    lfs 5,12(4)<br />    lfs 4,16(4)<br /><br />    //<br />    // CALCULATIONS<br />    //<br /><br />    fsubs 11,2,8<br />    fsubs 2,10,7<br />    fsubs 8,9,6<br />    fadds 7,12,5<br />    fadds 6,13,4<br />    fadds 5,0,3<br />    fabs 11,11<br />    fabs 10,2<br />    fabs 9,8<br />    fsubs 12,7,11<br />    fsubs 4,6,10<br />    fsubs 3,5,9<br /><br />    //<br />    // SELECT RESULT<br />    //<br /><br />    fsel 13, 4, 12, 4<br />    addi 1,1,16<br />    fsel 2, 3, 13, 3<br />    fmr 0,2<br />    fcmpu 7,1,0<br />    cror 30,28,30<br />    mfcr 3<br />    rlwinm 3,3,31,1<br />    blr<br /></pre>
Right now our main problem is that we have 12 loads and we're not doing
enough work to make up for that. Next we'll look at how to reduce loads.<br />
<div class="subtitle">Moving to VMX/Altivec</div>
<div class="rule-of-thumb">Always look for ways to reduce loads and stores.  It's one of the most effective techniques for improving performance.</div>
We're going to use the VXU (Altivec unit), which operates on 128-bit
(16-byte) operands. A typical operand is a vector of 4 float values, of
which we'll
use 3. The compiler recognizes a set of vector data types and vector
intrinsics.<br />
<br />
Here are some of the main advantages to using Altivec:<br />
<br />
<i>More available registers</i> - General purpose code will eat up most of
your fixed point registers, making it more likely you'll need to keep dumping
data on the stack.<br />
<br />
<i>Mixed integer and floating point</i> - Mixing integer and floating point
code, or converting between the two, is very expensive with scalar
operations. This is because the only method of moving between the FXU (fixed
point execution unit) and the FPU is through memory (typically the stack).
This often creates a Load-Hit-Store data hazard event which will cause your
processor to wait around until the register has been loaded. On the VXU you
can freely use vector integer instructions on vector floating point values
without penalty. There are also conversion instructions for your
convenience.<br />
<br />
<i>Much higher throughput</i> - This is really the whole point of a SIMD
instruction set. One instruction works with 128 bit wide registers, so much
more work can be done. Each instruction is also very fast.<br />
<br />
<i>Saturated arithmetic instructions</i> - Saturated instructions are
operations that basically cannot overflow or underflow. Any calculated value
that is greater than the maximum value for the type of the vector component
(8, 16 or 32 bits) is clamped to the maximum. Conversly for the minimum.
This is extremely handy for any kind of fixed point math.<br />
<br />
<i>Bit manipulation on all types (permute, shift, rotate)</i> - There is a
large set of instructions for bit manipulation which you can apply to all
the vector types. The permute instruction is a special instruction that lets
you shuffle around the bytes in a vector. By itself, this instruction makes
Altivec a win.<br />
<br />
For our current application (testing for overlap), I'm just going to
remove Vector3 completely and opt for using the vector types directly.
If I did have some reason to hide the vector types (cross platform
code?) I would completely remove the following methods:<br />
<pre class="code">    Vector3(const float&amp; x, const float&amp; y, const float&amp; z)<br />    {<br />      m_co[0] = x;<br />      m_co[1] = y;<br />      m_co[2] = z;<br />    }<br /><br />    float&amp;       operator[]( int i )       { return m_co[i]; }<br />    const float&amp; operator[]( int i ) const { return m_co[i]; }<br /></pre>
There's no <i>fast</i> way to implement these methods. They're the very antithesis of working with SIMD instructions.<br />
<br />
Anyway, the conversion to Altivec is quite straightforward in this
case. The fourth element (w) must be masked out or else initialized to
zero in each vector.<br />
<br />
The fuctions beginning with <i>vec_</i> are vector intrinsics. Note that <i>vec_all_ge</i>
returns an int, not a vector type value. Specifically,
it returns 1 if all elements of the first vector argument are greater
than or equal to the corresponding elements of the second vector
argument.<br />
<br />
Here, we don't need to pass in zero as a parameter, because we can easily build a zero vector using <i>vec_splat_u8</i>.  I've also used int instead
of bool as the return type of <i>overlaps</i> and <i>test_overlap</i>.  That way, a calling function that needs to test multiple boxes can use
bitwise logical operators (&amp; and |) to avoid branches.<br />
<br />
Here's the fifth version of the code:<br />
<pre class="code">#include &lt;altivec.h&gt;<br /><br />#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );<br /><br />struct Box<br />{<br />    vector float m_v[2];<br /><br />    enum<br />    {<br />      m_centerOffset = 0x00,<br />      m_extentOffset = 0x10<br />    };<br /><br />    Box() {}<br /><br />    Box(const vector float&amp; center, const vector float&amp; extent) <br />    {<br />      vec_st( center, m_centerOffset, (vector float*)m_v );<br />      vec_st( extent, m_extentOffset, (vector float*)m_v );<br />    }<br /><br />    int overlaps(const Box&amp; b) const<br />    {<br />      GCC_SPLIT_BLOCK("LOADS")<br />      const vector float zero             = (vector float)vec_splat_u8( 0x00 );<br />      const vector float a_c              = vec_ld( m_centerOffset, (vector float*)m_v );<br />      const vector float a_e              = vec_ld( m_extentOffset, (vector float*)m_v );<br />      const vector float b_c              = vec_ld( m_centerOffset, (vector float*)b.m_v );<br />      const vector float b_e              = vec_ld( m_extentOffset, (vector float*)b.m_v );<br />      GCC_SPLIT_BLOCK("CALCULATE RESULT")<br />      const vector float delta_c          = vec_sub( a_c, b_c );<br />      const vector float abs_delta_c      = vec_abs( delta_c );<br />      const vector float sum_e            = vec_add( a_e, b_e );<br />      const vector float overlap          = vec_sub( sum_e, abs_delta_c );<br />      const int          result           = vec_all_ge( overlap, zero );<br /><br />      return (result);<br />    }<br />};<br /><br />int<br />test_overlap( const Box&amp; a, const Box&amp; b )<br />{<br />  return a.overlaps( b );<br />}<br /></pre>
This straightforward translation to vector types reduces the number of loads from 12 to 4:<br />
<pre class="code">_Z12test_overlapRK3BoxS1_:<br />    stwu 1,-16(1)<br /><br />    //<br />    // LOADS<br />    //<br /><br />    li 0,16<br />    vspltisb 11,0<br />    lvx 12,4,0<br />    lvx 1,3,0<br />    lvx 0,0,3<br />    lvx 13,0,4<br /><br />    //<br />    // CALCULATE RESULT<br />    //<br /><br />    vsubfp 0,0,13<br />    addi 1,1,16<br />    vaddfp 1,1,12<br />    vspltisw 13,-1<br />    vslw 12,13,13<br />    vandc 0,0,12<br />    vsubfp 1,1,0<br />    vcmpgefp. 11,1,11<br />    mfcr 3<br />    rlwinm 3,3,25,1<br />    blr<br /></pre>
This is fine for doing a single overlap test. But what if we need to
perform a great many tests for overlap? That's where the Altivec really
shines.<br />
<div class="subtitle">Doing Four Overlap Tests At Once</div>
We'll be declaring a struct <i>box4</i> representing 4 boxes.  The following uniform vector layout will be used, where K is a box4 and J is the corresponding
array of 4 Box objects:<br />
<pre class="code">K.center_x = { J[0].m_center[0], J[1].m_center[0], J[2].m_center[0], J[3].m_center[0] }<br />K.center_y = { J[0].m_center[1], J[1].m_center[1], J[2].m_center[1], J[3].m_center[1] }<br />K.center_z = { J[0].m_center[2], J[1].m_center[2], J[2].m_center[2], J[3].m_center[2] }<br />K.extent_x = { J[0].m_extent[0], J[1].m_extent[0], J[2].m_extent[0], J[3].m_extent[0] }<br />K.extent_y = { J[0].m_extent[1], J[1].m_extent[1], J[2].m_extent[1], J[3].m_extent[1] }<br />K.extent_z = { J[0].m_extent[2], J[1].m_extent[2], J[2].m_extent[2], J[3].m_extent[2] }<br /></pre>
The new function box4_overlaps accepts two box4 pointers as parameters,
and returns a signed int vector of overlap test results. Specifically,
the Nth element
of the return vector will be -1 if the Nth element of the first box4
overlaps the Nth element of the second box4. It will be 0 otherwise.<br />
<br />
Once again we use vec_splat_u8 to build a zero vector, so we don't need zero passed in as a parameter.<br />
<br />
Here's the sixth version of the code:<br />
<pre class="code">#include &lt;altivec.h&gt;<br />#include &lt;stdint.h&gt;<br /><br />typedef struct box4 box4;<br /><br />struct box4<br />{<br />  vector float center_x;<br />  vector float center_y;<br />  vector float center_z;<br />  vector float extent_x;<br />  vector float extent_y;<br />  vector float extent_z;<br />};<br /><br />vector signed int<br />box4_overlaps( box4* const a, box4* const b )<br />{<br />  const vector float      zero       = (vector float)vec_splat_u8( 0x00 );<br />  const vector float      acx        = vec_ld( 0x00, &amp;a-&gt;center_x );<br />  const vector float      acy        = vec_ld( 0x00, &amp;a-&gt;center_y );<br />  const vector float      acz        = vec_ld( 0x00, &amp;a-&gt;center_z );<br />  const vector float      aex        = vec_ld( 0x00, &amp;a-&gt;extent_x );<br />  const vector float      aey        = vec_ld( 0x00, &amp;a-&gt;extent_y );<br />  const vector float      aez        = vec_ld( 0x00, &amp;a-&gt;extent_z );<br />  const vector float      bcx        = vec_ld( 0x00, &amp;b-&gt;center_x );<br />  const vector float      bcy        = vec_ld( 0x00, &amp;b-&gt;center_y );<br />  const vector float      bcz        = vec_ld( 0x00, &amp;b-&gt;center_z );<br />  const vector float      bex        = vec_ld( 0x00, &amp;b-&gt;extent_x );<br />  const vector float      bey        = vec_ld( 0x00, &amp;b-&gt;extent_y );<br />  const vector float      bez        = vec_ld( 0x00, &amp;b-&gt;extent_z );<br />  const vector float      dx         = vec_sub( acx, bcx );<br />  const vector float      dy         = vec_sub( acy, bcy );<br />  const vector float      dz         = vec_sub( acz, bcz );<br />  const vector float      abs_dx     = vec_abs( dx );<br />  const vector float      abs_dy     = vec_abs( dy );<br />  const vector float      abs_dz     = vec_abs( dz );<br />  const vector float      sum_ex     = vec_add( aex, bex );<br />  const vector float      sum_ey     = vec_add( aey, bey );<br />  const vector float      sum_ez     = vec_add( aez, bez );<br />  const vector float      overlap_x  = vec_sub( sum_ex, abs_dx );<br />  const vector float      overlap_y  = vec_sub( sum_ey, abs_dy );<br />  const vector float      overlap_z  = vec_sub( sum_ez, abs_dz );<br />  const vector signed int result_x   = vec_cmpge( overlap_x, zero );<br />  const vector signed int result_y   = vec_cmpge( overlap_y, zero );<br />  const vector signed int result_z   = vec_cmpge( overlap_z, zero );<br />  const vector signed int result_xy  = vec_and( result_x, result_y );<br />  const vector signed int result_xyz = vec_and( result_xy, result_z );<br /><br />  return (result_xyz);<br />}<br /></pre>
The compiler output shows 12 loads, but we're doing 4 overlap tests
instead of 1, so we've nearly quadrupled the performance compared to
the fourth version.<br />
<pre class="code">box4_overlaps:<br />    addi 12,3,16<br />    addi 5,4,16<br />    stwu 1,-16(1)<br />    lvx 1,0,4<br />    addi 9,3,32<br />    lvx 11,0,12<br />    addi 11,3,48<br />    lvx 7,0,5<br />    addi 10,3,64<br />    lvx 0,0,3<br />    vsubfp 2,11,7<br />    vsubfp 0,0,1<br />    addi 8,4,32<br />    addi 7,4,48<br />    lvx 7,0,10<br />    addi 6,4,64<br />    lvx 9,0,9<br />    lvx 10,0,6<br />    addi 3,3,80<br />    lvx 8,0,11<br />    addi 4,4,80<br />    lvx 13,0,8<br />    addi 1,1,16<br />    lvx 12,0,7<br />    vsubfp 1,9,13<br />    vaddfp 9,8,12<br />    lvx 13,0,4<br />    vaddfp 8,7,10<br />    lvx 12,0,3<br />    vaddfp 12,12,13<br />    vspltisw 10,-1<br />    vslw 7,10,10<br />    vandc 0,0,7<br />    vspltisw 13,-1<br />    vslw 10,13,13<br />    vandc 11,2,10<br />    vspltisw 13,-1<br />    vslw 10,13,13<br />    vandc 2,1,10<br />    vsubfp 10,9,0<br />    vsubfp 9,8,11<br />    vsubfp 1,12,2<br />    vspltisb 7,0<br />    vcmpgefp 8,10,7<br />    vcmpgefp 11,9,7<br />    vcmpgefp 2,1,7<br />    vand 0,8,11<br />    vand 2,0,2<br />    blr<br /></pre>
<div class="subtitle">Synergistic Processor Unit</div>
The CBE has eight Synergistic Processor Units (SPUs) that are designed
for computation-intensive tasks. Suppose we want our application to run
on an SPU.
How can we adapt the overlap test function for the SPU environment?<br />
<br />
We'll use the same box4 structure as in the last example.  The SPU compiler recognizes vector intrinsics (beginning with <i>si_</i>) that are similar to
those of the VXU, but not identical. Here are some of the differences that have a direct bearing on the problem at hand:<br />
<br />
1. The return from a vector comparison is a vector unsigned int, instead of a vector signed int. A value of <i>true</i> is represented as 1 instead of -1.<br />
2. The SPU has no instruction for absolute value. We can calculate the
absolute value of a vector float operand via a sequence of two
instructions:
<i>si_shli</i> (shift left immediate) and <i>si_rotmi</i> (rotate and mask immediate).  The term <i>rotate</i> is misleading.  In effect, si_rotmi(v, -n)
is a logical shift right of each element of v by n bits.<br />
3. The SPU can't directly test whether one vector float operand is greater than or equal to another.  So we'll use <i>si_fcgt</i> (vector float greater than)
with operands reversed to perform a "less than" test.  Then we'll invert the result with <i>si_nor</i>.<br />
<br />
The data type <i>qword</i> means a quadword (128 bits = 16 bytes)
with unspecified structure. It could be a vector float, or a vector
unsigned int, or
some other vector type. It could even be a scalar kept in the first 32
bits, with the remaining 96 bits unused. For example, the first
parameter of <i>si_lqd</i>
is a qword with an address in the first 32 bits and the remaining 96 bits unused.  The function <i>si_from_uint</i> casts an unsigned int to a qword.  It
doesn't generate any actual machine instructions.<br />
<br />
Here's the seventh and final version of the code:<br />
<pre class="code">#include &lt;spu_intrinsics.h&gt;<br /><br />typedef struct box4 box4;<br /><br />struct box4<br />{<br />  vector float center_x;<br />  vector float center_y;<br />  vector float center_z;<br />  vector float extent_x;<br />  vector float extent_y;<br />  vector float extent_z;<br />};<br /><br />vector unsigned int<br />box4_overlaps( box4* const a, box4* const b )<br />{<br />  const qword zero       = si_il( 0 );<br />  const qword a_addr     = si_from_uint( (unsigned int) a );<br />  const qword b_addr     = si_from_uint( (unsigned int) b );<br />  const qword acx        = si_lqd( a_addr, 0x00 );<br />  const qword acy        = si_lqd( a_addr, 0x10 );<br />  const qword acz        = si_lqd( a_addr, 0x20 );<br />  const qword aex        = si_lqd( a_addr, 0x30 );<br />  const qword aey        = si_lqd( a_addr, 0x40 );<br />  const qword aez        = si_lqd( a_addr, 0x50 );<br />  const qword bcx        = si_lqd( b_addr, 0x00 );<br />  const qword bcy        = si_lqd( b_addr, 0x10 );<br />  const qword bcz        = si_lqd( b_addr, 0x20 );<br />  const qword bex        = si_lqd( b_addr, 0x30 );<br />  const qword bey        = si_lqd( b_addr, 0x40 );<br />  const qword bez        = si_lqd( b_addr, 0x50 );<br />  const qword dx         = si_fs( acx, bcx ); <br />  const qword dy         = si_fs( acy, bcy );<br />  const qword dz         = si_fs( acz, bcz );<br />  const qword uns_dx     = si_shli( dx, 1 );<br />  const qword uns_dy     = si_shli( dy, 1 );<br />  const qword uns_dz     = si_shli( dz, 1 );<br />  const qword abs_dx     = si_rotmi( uns_dx, -1 );<br />  const qword abs_dy     = si_rotmi( uns_dy, -1 );<br />  const qword abs_dz     = si_rotmi( uns_dz, -1 );<br />  const qword sum_ex     = si_fa( aex, bex );<br />  const qword sum_ey     = si_fa( aey, bey );<br />  const qword sum_ez     = si_fa( aez, bez );<br />  const qword overlap_x  = si_fs( sum_ex, abs_dx );<br />  const qword overlap_y  = si_fs( sum_ey, abs_dy );<br />  const qword overlap_z  = si_fs( sum_ez, abs_dz );<br />  const qword result_x   = si_fcgt( zero, overlap_x );<br />  const qword result_y   = si_fcgt( zero, overlap_y );<br />  const qword result_z   = si_fcgt( zero, overlap_z );<br />  const qword result_xy  = si_and( result_x, result_y );<br />  const qword result_xyz = si_and( result_xy, result_z );<br />  const qword inv_result = si_nor( result_xyz, result_xyz );<br /><br />  return (vector unsigned int)(inv_result);<br />}<br /></pre>
The SPU compiler output shows that there's practically a one to one
correspondence between C statements and machine instructions. The
compiler
has done some reordering, but that shouldn't be a problem here.<br />
<pre class="code">box4_overlaps:<br />    hbr .L2,$lr<br />    lnop<br />    il $14,0<br />    lqd $34,16($3)<br />    lqd $35,16($4)<br />    lqd $32,0($3)<br />    lqd $33,0($4)<br />    lqd $30,32($3)<br />    lqd $31,32($4)<br />    lqd $27,64($3)<br />    fs $29,$34,$35<br />    lqd $28,64($4)<br />    nop $127<br />    lqd $24,48($3)<br />    fs $26,$32,$33<br />    lqd $25,48($4)<br />    lqd $20,80($3)<br />    fs $23,$30,$31<br />    lnop<br />    lnop<br />    shli $22,$29,1<br />    lqd $21,80($4)<br />    fa $16,$27,$28<br />    shli $19,$26,1<br />    fa $13,$24,$25<br />    shli $18,$23,1<br />    rotmi $17,$22,-1<br />    fa $11,$20,$21<br />    rotmi $15,$19,-1<br />    rotmi $12,$18,-1<br />    fs $10,$16,$17<br />    fs $8,$13,$15<br />    fs $7,$11,$12<br />    fcgt $6,$14,$10<br />    fcgt $5,$14,$8<br />    fcgt $9,$14,$7<br />    and $4,$6,$5<br />    and $3,$4,$9<br />    nor $2,$3,$3<br />    ori $3,$2,0<br />    nop $127<br />.L2:<br />    bi $lr<br /></pre>
<div class="subtitle">Additional Reading</div>
Basic Altivec references:<br />
<a href="http://developer.apple.com/hardware/ve/instruction_crossref.html">Altivec Instruction Cross Reference, Apple (HTML)</a><br />
<a href="http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPEM.pdf">Altivec Programming Environments Manual, Freescale (PDF)</a><br />
<a href="http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf">Altivec Programmer's Interface Manual, Freescale (PDF)</a><br />
<br />
Useful Altivec introductions and tutorials:<br />
<a href="http://developer.apple.com/hardware/ve/simd.html">Understanding SIMD, Apple (HTML)</a><br />
<a href="http://developer.apple.com/hardware/ve/tutorial.html">Altivec Tutorial, Apple (HTML)</a><br />
<a href="http://www.simdtech.org/apps/group_public/download.php/26/Altivec%20formatted.1.2.pdf">Altivec Tutorial, Ian Ollman (PDF)</a><br />
<a href="http://www.adhocconference.com/papers/2001/2001Ollmann.pdf">Pratical Altivec Strategies, Ian Ollman (PDF)</a><br />
<a href="http://www-128.ibm.com/developerworks/power/library/pa-unrollav1/index.html">Unrolling Altivec, Peter Seebach (HTML)</a><br />
<a href="http://www.mactech.com/articles/mactech/Vol.15/15.07/AltiVecRevealed/">AltiVec Revealed, Tom Thompson</a><br />
<br />
Basic SPU references:<br />
<a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E/$file/SPU_language_extensions_2.1.pdf">SPU C/C++ Language Extensions (PDF)</a><br />
<a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44/$file/SPU_ISA_1.1_pub.pdf">SPU Instruction Set Architecture (PDF)</a><br />
    
  


      

		           

    
  




    ]]>
    </content>
</entry>

<entry>
    <title>A 4x4 Matrix Inverse</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/06/a-4x4-matrix-inverse.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.23</id>

    <published>2006-06-04T05:25:26Z</published>
    <updated>2009-08-07T05:31:02Z</updated>

    <summary> GUEST ARTICLE! Cédric Lallain is a Frenchman who has been working with me on Cell/PS3 research at Highmoon Studios in Carlsbad, CA.. I hope that this is only the first of many contributions to the community by Cédric. Welcome...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note">
<b>GUEST ARTICLE!</b> Cédric Lallain is a Frenchman who has been
working with me on Cell/PS3 research at Highmoon Studios in Carlsbad,
CA.. I hope that this is only the first of many contributions to the
community by Cédric. Welcome aboard! -- Mike.
</div>
<div class="subtitle">
Inverse matrix on PPU and on SPU using SIMD instructions.
</div>
<p>This article will talk about how to convert some scalar code to SIMD
code for the PPU and SPU using the inverse matrix as an example.
</p><p>Most of the time in the video games, programmers are not doing
a standard inverse matrix. It is too expensive. Instead, to inverse a
matrix, they consider it as orthonormal and they just do a 3x3
transpose of the rotation part with a dot product for the translation.
Sometimes the full inverse algorithm is necessary. </p><p>
The main goal is to be able to do it as fast as possible. 
This is why the code should use SIMD instructions as much as possible.
  </p><div class="quote">A
vector is an instruction operand containing a set of data elements
packed into a one-dimensional array. The elements can be fixed-point or
floating-point values. Most Vector/SIMD Multimedia Extension and SPU
instructions operate on vector operands. Vectors are also called
Single-Instruction, Multiple-Data (SIMD) operands, or packed operands.<br />
SIMD processing exploits data-level parallelism. Data-level parallelism
means that the operations required to transform a set of vector
elements can be performed on all elements of the vector at the same
time. That is, a single instruction can be applied to multiple data
elements in parallel.</div>
<div class="quote-cite">[Chapter 2.5.1 in the released pdf by IBM: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>]. </div>

  <div class="quote">Each SPE is a 128-bit RISC processor specialized for data-rich, compute-intensive SIMD and scalar applications.</div>
  <div class="quote-cite">[Chapter 3 in the released pdf by IBM: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>]. </div>
<p><br />
Also the number of branches should stay to the strict minimum. Any extra branches will slow down the final solution. The first step is to choose the most suitable algorithm in order to reach the objectives.
Different algorithms exist to inverse a matrix:

</p>]]>
        <![CDATA[<p><b> The Gauss-Jordan elimination: </b>
The Gauss-Jordan elimination is a method to find the inverse matrix solving a system of linear equations.
 A good explanation about how this algorithm work can be found in the book <a href="http://www.library.cornell.edu/nr/cbookcpdf.html"> "Numerical Recipes in C" [library.cornell.edu] </a> chapter 2.1. <br />
For a visual demonstration using a java applet see: <a href="http://www.cse.uiuc.edu/eot/modules/linear_equations/gauss_jordan/"> Gauss-Jordan Elimination [cse.uiuc.edu]</a>.
In
this algorithm, the choice of a good pivot is a critical part. To do
it, all floating point values of a specific column need to be tested
with each other, one by one. This, by definition, doesn't suit very
well in SIMD code. <br />
Performing the algorithm, some multiplications are be done between columns 
(e.g.: to apply the pivot) and some other operations between rows
(e.g.: to apply the multiplier to the rest of the matrix). 
This requires extra code to swap rows and columns in order to use SIMD instructions.
</p><p>
<b> Inversion using LU decomposition: </b>
The description of the inverse calculation can be found in <a href="http://www.library.cornell.edu/nr/cbookcpdf.html"> "Numerical Recipes in C" [library.cornell.edu] </a> chapter 2.3.
</p><div class="quote">
In linear algebra, a block LU decomposition is a decomposition of a
block matrix into a lower block triangular matrix L and an upper block
triangular matrix U. This decomposition is used in numerical analysis
to reduce the complexity of the block matrix formula.</div>
<div class="quote-cite">[<a href="http://en.wikipedia.org/wiki/Block_LU_decomposition">Block LU decomposition [wikipedia.org]</a>]</div>
This algorithm would probably be very useful if the size of the matrix was 8x8. 
In this case, it requires doing the calculation two floating points at a time 
where a vector type contains four.<p>

  <b> Inversion by Partitioning: </b>
To inverse a matrix A (size N) by partitioning, the matrix is partitioned into:
</p><pre>       |  A0    A1  |<br />   A = |            | with A0 and A3 squared matrix with the respective size<br />       |  A2    A3  |                s0 and s3 following the rule: s0 + s3 = N<br /></pre>
The inverse is
<pre>          |  B0    B1  |<br />   InvA = |            |<br />          |  B2    B3  |<br /></pre>
with:
<pre>  B0 = Inv(A0 - A1 * InvA3 * A2)<br />  B1 = - B0 * (A1 * InvA3)<br />  B2 = - (InvA3 * A2) * B0<br />  B3 = InvA3 + B2 * (A1 * InvA3)<br /></pre>
More information can be found in <a href="http://www.library.cornell.edu/nr/cbookcpdf.html"> "Numerical Recipes in C" [library.cornell.edu] </a> chapter 2.7<p>
The issue related above is also present here; the idea is to work four floating points at a time and not only two. </p><p>

  <b> Using the inverse formula ( (1/det(M)) * Transpose(Cofactor(M))): </b>
Check the article about <a href="http://mathworld.wolfram.com/MatrixInverse.html"> Matrix Inverse [mathworld.wolfram.com]</a> for more information about this formula.
</p><p>
This is the algorithm which will be used to inverse the matrix. Each
step presents a very good factorization ratio; it's possible to group
the operations in order to replace them by SIMD instructions. <br />
The most critical part in this algorithm is the calculation of all
cofactors. This part has also two great advantages for our objectives.
It's 100% calculation; this allows writing code without branching. All
cofactor values are computed the same way and can be computed in
parallel and independently of each other. This is a perfect place to
use the SIMD instructions.
</p><p>This article will start with a basic implementation of the
inverse formula using scalar instructions. Then this code will be
modified to prepare the SIMD version. The first SIMD version will be
done for the PPU. The final one will be conversion using the SPU
intrinsic instruction set. </p><div class="subtitle">
A 4x4 matrix inverse
</div>

The general formula is:
<pre class="code">   InvM = (1/det(M)) * Transpose(Cofactor(M))<br /></pre>

which can also be written:
<pre class="code">   InvM = (1/det(M)) * Adjoint(M) with<br />   Adjoint(M) = Transpose(Cofactor(M))<br /></pre>

For the scalar version, the matrix is defined as follow:
<pre class="code">  typedef struct s_vector<br />  {<br />    float row[4];<br />  } s_vector;<br /><br />  typedef struct s_matrix<br />  {<br />    s_vector cols[4];<br />  } s_matrix;<br /></pre>
<p>
The first version of the code does a standard implementation of the
formula. The inverse function calls the cofactor function which
computes and returns the cofactor matrix. </p><div class="quote">
Definition 1 - If A is a square matrix then the minor of a(i,j),
denoted by M(i,j), is the determinant of the submatrix that results
from removing the ith row and jth column of A.<br />
Definition 2 - If A is a square matrix then the cofactor of a(i,j), denoted by C(i,j), is the number ((-1)^(i+j))*M(i,j).
</div>
<div class="quote-cite">
[from <a href="http://tutorial.math.lamar.edu/AllBrowsers/2318/MethodOfCofactors.asp">The method of Cofactors [tutorial.math.lamar.edu]</a>]
</div>
 
Once the cofactor matrix is computed, the result is used to calculate the determinant and also the adjoint matrix.
<div class="quote">
Theorem 1 - if A is a matrix.
<ul><li> Choose any row, say row i, then,
    <ul><li>det(A) = a(i,1)C(i,1) + a(i,2)C(i,2) + ... + a(i,n)C(i,n)</li></ul></li><li> Choose any column, say column j, then,
    <ul><li>det(A) = a(1,j)C(1,j) + a(2,j)C(2,j) + ... + a(n,j)C(n,j)</li></ul></li></ul>
</div>
<div class="quote">
The adjoint of A is the transpose of the matrix of cofactors and is denoted 
by adj(A).
</div>
<div class="quote-cite">
[from <a href="http://tutorial.math.lamar.edu/AllBrowsers/2318/MethodOfCofactors.asp">The method of Cofactors [tutorial.math.lamar.edu]</a>]
</div>

From there, the inverse matrix is just a division of the adjoint matrix by the determinant.
<p>
The full code is available here: <a href="http://cellperformance-snippets.googlecode.com/files/inverse_v1.h">inverse_v1.h</a>

</p><div class="subtitle">
Toward the SIMD
</div>

Even for the scalar code, it's better to unroll the loop, 
this give more options to the compiler for optimization. 
This gets also rid of the branches. 
This rule is especially true for the small loops with little iteration, like:

<pre class="code">    for ( col = 0 ; col &lt; 4 ; col++ )<br />    {<br />        for ( row = 0; row &lt; 4; row++ )<br />        {<br />            output-&gt;cols[col].row[row] =  source-&gt;cols[col].row[row] * factor;<br />        }<br />    }<br /></pre>
<p>
The second reason to do this refactorization is to locate the SIMD blocks. 
Unrolling the multiplication is straight forward. 
The same changes can be applied to the transpose and the determinant functions.
</p><p>
The following chapter will detail the code of the cofactor matrix.
The second scalar version can be found here: <a href="http://cellperformance-snippets.googlecode.com/files/inverse_v2.h">inverse_v2.h</a>

</p><div class="subtitle">
The case of the cofactor matrix
</div>
To avoid too much confusion, in the second scalar version, the new
helper function is now called 'cofactor_column_v2' instead of
'cofactor_ij_v1'. It takes care of a whole column of cofactors and not
just one at a time.
<p>
The new cofactor code is:

</p><pre class="code">    cofactor_column_v2(output-&gt;cols[0].row, source, col);<br />    cofactor_column_v2(output-&gt;cols[1].row, source, col);<br />    cofactor_column_v2(output-&gt;cols[2].row, source, col);<br />    cofactor_column_v2(output-&gt;cols[3].row, source, col);<br /></pre>

Inside cofactor_column_v2, the rows are grouped together to have a better view of what to do to convert this into SIMD code: 

<pre class="code">    const float r0_pos_part1 = mat-&gt;cols[col0].row[1] * <br />                               mat-&gt;cols[col1].row[2] * <br />                               mat-&gt;cols[col2].row[3];<br />                               <br />    const float r1_pos_part1 = mat-&gt;cols[col0].row[2] * <br />                               mat-&gt;cols[col1].row[3] * <br />                               mat-&gt;cols[col2].row[0];<br />                               <br />    const float r2_pos_part1 = mat-&gt;cols[col0].row[3] * <br />                               mat-&gt;cols[col1].row[0] * <br />                               mat-&gt;cols[col2].row[1];<br />                               <br />    const float r3_pos_part1 = mat-&gt;cols[col0].row[0] * <br />                               mat-&gt;cols[col1].row[1] * <br />                               mat-&gt;cols[col2].row[2];<br /></pre>

The row indices clearly show a relation between them.
By noting the r0_pos_part1 as follow:
<pre class="code">    r[0]_pos_part1 = mat-&gt;cols[c0]-&gt;row[r0] * <br />                     mat-&gt;cols[c1]-&gt;row[r1] * <br />                     mat-&gt;cols[c1]-&gt;row[r2]<br /></pre>
the next rows can be written like this:
<pre class="code">    r[N]_pos_part1 = mat-&gt;cols[c0]-&gt;row[(r0+N)&amp;3] * <br />                     mat-&gt;cols[c1]-&gt;row[(r1+N)&amp;3] * <br />                     mat-&gt;cols[c1]-&gt;row[(r2+N)&amp;3]<br /></pre>The same relation is present in all positive and negative parts
of the calculation.
Basically, in order to calculate the different parts of the 3x3
determinants for a defined column, all three other columns need to be
multiply together after being rotated by a specific value.
<p>
Those 3x3 determinants also called minor of the matrix need to have their signs adjusted.
</p><p>

Following the idea of converting the code using SIMD instructions, two variables have been created:
</p><pre class="code">    static const unsigned int znzn[] = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };<br />    static const unsigned int nznz[] = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };<br /></pre>
They contain the two possible mask signs for a whole column.
  When the column number is even, nznz will be the mask to select. In the other case, znzn will be the one to choose.
<p>To select the correct variable, the basic way (and probably also the
most common used nowadays) would probably use an 'if'. As indicated at
the beginning of this article, the 'if' statement is something to avoid
as much as possible. It generates branches. The solution to avoid it is
to use a mask (col_mask) and to do a selection with it:
</p><pre class="code">    const unsigned int col_mask   = (const unsigned int)(((const int)((col &amp; 1) &lt;&lt; 31)) &gt;&gt; 31);<br />    const unsigned int u_znzn     = (const unsigned int)(&amp;znzn[0]);<br />    const unsigned int u_nznz     = (const unsigned int)(&amp;nznz[0]);<br />    union <br />    {<br />        unsigned int  u;<br />        unsigned int *p;<br />    } mask;<br />    mask.u = (u_nznz&amp;col_mask)|(u_znzn&amp;~col_mask);<br /></pre>

<p>The union is here to ensure the strict aliasing rule.
</p><p>
Once the correct pointer selected, the final calculation is simple:

</p><pre class="code">    r0_cofactor.i ^= mask.p[0];<br />    r1_cofactor.i ^= mask.p[1];<br />    r2_cofactor.i ^= mask.p[2];<br />    r3_cofactor.i ^= mask.p[3];<br /></pre>


The next step is the conversion of this scalar version for the PPU using the altivec instruction set.

<div class="subtitle">
Altivec version
</div>

A new definition for the matrix type is required:

<pre class="code">  typedef struct s_matrix<br />  {<br />      vector float cols[4];<br />  } s_matrix;<br /></pre>In the version 2, the rows were grouped together. This showed the
ideas of rotating their index to do the calculation.
This one is used in the SIMD code. Different rotations for each column
are required. Those rotations are computed first. The variable names
are defined as follow: cXuY, where X is the column number rotated up by
Y floats:
<pre class="code">    const vector float c0u1 = vec_sld(c0, c0,  4);<br />    const vector float c0u2 = vec_sld(c0, c0,  8);<br />    const vector float c0u3 = vec_sld(c0, c0, 12);<br />    ....<br /></pre>

<div class="rule-of-thumb">
The order to do the calculation is really important to minimize the number of operations. 
</div>
<p>The calculation of each cofactor is based on the determinant of the
3x3 matrix created by removing the cofactor's column and row from the
source matrix. That is why for the first column, the multiplication is
done in the reverse order (i.e.: the third and fourth column will be
multiply together before doing the operation using the second one).
This way, the result will be available to compute the cofactors of the
second column.<br />
With the third column, the multiplication is done in the initial order 
(Multiplying the first and second column together first) to share the results with fourth column.
<br />
Note: in the source code, the fourth column has been computed before
the third one, for convenience only (to avoid the mistakes working with
the column 0, 1, and 2 instead of 0, 1, and 3). The final result is
identical.
</p><p>
The same masking operation for the sign bit is done using SIMD instructions.
</p><p>In order to calculate the adjoint matrix, the transpose code has
also been converted. The unrolled version wasn't really helpful. The
knowledge of the altivec instructions was required, especially the one
which manipulates the data: vec_mergeh and vec_mergel.
</p><p>The determinant function now returns a vector float, each
element is nearly equal (nearly due to the floating point precision) to
the determinant of the matrix.
<br />The algorithm is the same as before. A multiplication is computed
between a row or a column with the corresponding value in the cofactor
matrix, all values are added together.
<br />The multiplication of each value is a simple SIMD instruction;
unfortunately no instruction exists to dispatch the sum of all values
in a vector. The solution is to rotate the result vector twice and add
it with itself as follow:
</p><pre class="code">   (  A   B   C   D  )<br /> + (  C   D   A   B  )<br /> =====================<br />   ( A+C B+D C+A B+D )<br /> + ( B+D C+A B+D A+C )<br /> =====================<br /> = the vector with the determinant store in each element.<br /> </pre>
 
<div class="rule-of-thumb">It's important to know that the values are
not necessary the same along the vector, this is due to the order of
the calculation and to the lack of precision of the floating point,
those values can be slightly different; a vec_splat can be apply to
this vector to force them to be identical.
</div>The final multiplication (by one over the determinant) can be
easily performed considering function already has a vector filled with
the determinant.
<p>
The code of this first PPU version of the inverse matrix can be found here: <a href="http://cellperformance-snippets.googlecode.com/files/inverse_v3.h">inverse_v3.h</a>


</p><div class="subtitle">
Optimization Altivec
</div>

Once the code is working on PPU, the next step is the optimization.
<p>The altivec instruction set doesn't include any instruction to
define constants, every constant will have to be constructed and loaded
from the memory. The following lines:
</p><pre class="code">    const vector unsigned int u_znzn = { 0x00000000, 0x80000000, 0x00000000, 0x80000000 };<br />    const vector unsigned int u_nznz = { 0x80000000, 0x00000000, 0x80000000, 0x00000000 };<br /></pre>
will generate the following code (using ppu-gcc from mambo):
<pre class="code">ld 4, .LC18@toc(2)<br />ld 11, .LC16@toc(2)<br /></pre>

If one operation is known from being slow, this is the access to the memory.
To avoid the loading of constant values, they are built on the fly as follow:

<pre class="code">    const vector unsigned int u_zero     = (vector unsigned int)zero;<br />    const vector unsigned int u_two      = vec_splat_u32(2);<br />    const vector unsigned int u_fifteen  = vec_splat_u32(15);<br />    const vector unsigned int u_2shift15 = vec_sl(u_two, u_fifteen);<br />    const vector unsigned int u_signmask = vec_sl(u_2shift15, u_fifteen);<br />    const vector unsigned int u_nznz     = vec_mergeh(u_signmask, u_zero);<br />    const vector unsigned int u_znzn     = vec_mergeh(u_zero, u_signmask);<br /></pre>

u_signmask, after its initialization, contains the vector:
  { 0x80000000, 0x80000000, 0x80000000, 0x80000000 }
The use of vec_mergeh inserts some zero values in the middle of it to finalize the constant value.
<p>
Another part of the code can be also improved using shift instructions instead of multiplication:
</p><pre class="code">    const vector float m_c2u1_c3u2 = vec_madd(c2u1, c3u2, zero);<br /></pre>
can be replaced by:
<pre class="code">    const vector float m_c2u1_c3u2 = vec_sld(m_c2u2_c3u3, m_c2u2_c3u3, 12);<br /></pre>

On the PPU, vec_madd and vec_sld aren't on the same pipeline. They might now be executed in parallel.
<div class="quote-site">
[cf: Appendix A.3.2 in the released pdf by IBM: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>].
</div>
<p>

The optimized version of the previous PPU code can be found here: <a href="http://cellperformance-snippets.googlecode.com/files/inverse_v4.h">inverse_v4.h</a>
</p><p>The final version of the inverse matrix for PPU where the whole
code has been placed in a single function can be downloaded here: <a href="http://cellperformance-snippets.googlecode.com/files/inverse_v5.h">inverse_v5.h</a>

</p><div class="subtitle">
SPU version
</div>
The SPU is a very powerful calculator with a lot of strong intrinsic
instructions, and even if some altivec functions don't have direct
equivalent, they can be replaced by the intrinsic set.
<p>
Some altivec instructions have a direct equivalent as SPU intrinsic instructions:
</p><ul><li> vec_madd is replaced by either spu_madd or simply by spu_mul when the  third parameter is zero.</li><li> vec_xor, vec_re, vec_sub respectively becomes spu_xor, spu_re, spu_sub
</li></ul>
Some others require a work around:
<ul><li> vec_sld </li><li> vec_mergeh </li><li> vec_mergel </li></ul>For the PPU, the need to build the constant values was clearly
present. On the contrary, the loading time in the SPU is almost
nothing; the SPU even have instructions to extract, insert values and
create constant values on the fly without going through the memory.
<div class="quote-cite">
[Table B-1 in the Appendix B.1.2 in the released pdf by IBM: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>]
</div>There is no need to worry too much creating constant values.
Constants can replace the PPU calculation for nznz and znzn. The
instruction spu_shuffle with associated constant values will be used to
replace vec_mergeh, vec_mergel and vec_sld.
Five different shuffling patterns have to be created. To explain them,
the four floats of the first vector will be designated by the letters:
X, Y, Z, and W. The four floats of the second vector will be: A, B, C,
D.
<p>
To replace the sld function, three patterns are required:
</p><ul><li> YZWX to replace: sld(v, v, 4)
</li><li> ZWXY to replace: sld(v, v, 8)
</li><li> WXYZ to replace: sld(v, v, 12)
</li></ul>

To replace vec_mergeh, and vec_mergel, only two other patterns are defined: XAYB, ZCWD
<p>
The SPU version of the inverse matrix can be found here: <a href="http://cellperformance-snippets.googlecode.com/files/inverse_v6.h">inverse_v6.h</a>

</p><div class="subtitle">Summary</div>
<ul><li> Avoid the algorithms which deal with special cases (like the Gauss-Jordan elimination).</li><li> Start with a simple scalar implementation. </li><li> Unroll the loops and group the code which can be executed in
parallel and which follow the same patterns (like in the cofactor
function). </li><li> Get use to the data manipulation instructions (vec_mergeh, vec_sld, spu_shuffle...). </li><li> Look at the generated assemble code. </li><li> Prefer to build the PPU data on the fly instead of loading them from the memory. </li></ul>

<div class="subtitle">About The Author</div>
Cedric Lallain is a Senior Programmer working on PS3/Cell
research at Highmoon Studios (Vivendi Games).<br />
In the past years, Cedric was mostly an working on AI. He also optimized 
(high and low level optimization) some PS2 code to help his game reaching a 
correct frame rate.<br />
The last game he worked on is Darkwatch for Highmoon Studios.<br />
Previously he was lead AI programmer on Street Racing Syndicate at Eutechnyx.

]]>
    </content>
</entry>

<entry>
    <title>Understanding Strict Aliasing</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.2</id>

    <published>2006-06-02T06:53:12Z</published>
    <updated>2009-08-07T07:39:18Z</updated>

    <summary>UPDATED! (08 Aug 06) More Clarifications! Special thanks to Nicolas Riesch, André de Leiradella and pinskia for their comments and suggestions. UPDATED! (28 Dec 06) Minor fixes. Special thanks to Kobi Cohen-Arazi and Chris Pickett. Aliasing One pointer is said...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<ul><div class="sticky-note"><strong>UPDATED! (08 Aug 06) More Clarifications! 
Special thanks to <span class="monospace-strong">Nicolas Riesch</span>, <span class="monospace-strong">André de Leiradella</span> and <span class="monospace-strong">pinskia</span> for their comments and suggestions.
</strong></div><div class="sticky-note"><strong>UPDATED! (28 Dec 06) Minor fixes. 
Special thanks to <span class="monospace-strong">Kobi Cohen-Arazi</span> and <span class="monospace-strong">Chris Pickett</span>.
</strong></div><div class="subtitle">Aliasing</div>

One pointer is said to <i>alias</i> another pointer when both refer to the same location or object. In this example,
<pre class="code"><span class="line-number">  0</span>uint32_t <br /><span class="line-number">  1</span>swap_words( uint32_t arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  uint16_t* const sp = (uint16_t*)&amp;arg;<br /><span class="line-number">  4</span>  uint16_t        hi = sp[0];<br /><span class="line-number">  5</span>  uint16_t        lo = sp[1];<br /><span class="line-number">  6</span>  <br /><span class="line-number">  7</span>  sp[1] = hi;<br /><span class="line-number">  8</span>  sp[0] = lo;<br /><span class="line-number">  9</span><br /><span class="line-number"> 10</span>  return (arg);<br /><span class="line-number"> 11</span>} <br /></pre><div class="rule-of-thumb">
Using GCC 3.4.1 and above, the above code will generate <b>warning: dereferencing type-punned pointer will break strict-aliasing rules</b> on line 3.
</div>
The memory referred to by <b>sp</b> is an alias of <b>arg</b> because they refer to the same address in memory. In C99, it is <i>illegal</i> to create an alias of a different type than the original. This is often refered to as the <b>strict aliasing</b>
rule. The rule is enabled by default in GCC at optimization levels at
or above O2. Although the above example would compile, the results are
undefined. More than likely, <strong>arg</strong> would be returned unchanged because a pointer to uint16_t cannot be an alias
to a pointer to uint32_t when applying the strict aliasing rule.<br /><br /><div class="rule-of-thumb">
Dereferencing a cast of a variable from one type of pointer to a different type is <em>usually</em> in violation of the strict aliasing rule.
</div>
However, having multiple representations of the same location in memory
is often beneficial. Properly balancing the compiler's memory
optimizations and the programmer's optimizations based on real-world
context and data is a bit of a black art. It requires an understanding
of the tradeoffs among what's permitted by the standard, what's the
reality of compilers and the value of a particular transformation based
on the architecture and the data. It's worth it in the end though when
the results speak for themselves.<br /><div class="sticky-note"> All of the examples in this article have been
tested with various versions of GCC. Although you can expect most of
the examples to generate similar results across the major compilers,
programmers' expectations should always be validated for the compilers
and compiler revisions required. </div><br /> 

Read on for details on the strict aliasing rule and some common pitfalls.<br /><br /></ul>]]>
        <![CDATA[<div id="introduction" class="subtitle">What is strict aliasing?</div>

<span class="monospace-strong">Strict aliasing is an assumption, made by the C (or C++) compiler, 
that dereferencing pointers to objects of different types will never refer to the same memory location 
(i.e. alias eachother.)</span>
<br /><br />

Here are some basic examples of assumptions that may be made by the compiler when strict aliasing is
enabled:<br />
<br />

<b>Pointers to different built in types do not alias:</b>
<pre class="code"><span class="line-number">  0</span>int16_t* foo;<br /><span class="line-number">  1</span>int32_t* bar;<br /></pre>
The compiler will assume that <span class="monospace-strong">*foo</span> and <span class="monospace-strong">*bar</span>
never refer to the same location.
<br /><br />

<b>Pointers to aggregate or union types with differing tags do not alias:</b>
<pre class="code"><span class="line-number">  0</span>typedef struct<br /><span class="line-number">  1</span>{<br /><span class="line-number">  2</span>  uint16_t a;<br /><span class="line-number">  3</span>  uint16_t b;<br /><span class="line-number">  4</span>  uint16_t c;<br /><span class="line-number">  5</span>} Foo;<br /><span class="line-number">  6</span><br /><span class="line-number">  7</span>typedef struct<br /><span class="line-number">  8</span>{<br /><span class="line-number">  9</span>  uint16_t a;<br /><span class="line-number"> 10</span>  uint16_t b;<br /><span class="line-number"> 11</span>  uint16_t c;<br /><span class="line-number"> 12</span>} Bar;<br /><span class="line-number"> 13</span><br /><span class="line-number"> 14</span>Foo* foo;<br /><span class="line-number"> 15</span>Bar* bar;<br /></pre>
The compiler will assume that <span class="monospace-strong">*foo</span> and <span class="monospace-strong">*bar</span>
never refer to the same location, even though the contents of the structures are the same.
<br /><br />

<b>Pointers to aggregate or union types which differ only by name may alias:</b>
<pre class="code"><span class="line-number">  0</span>typedef struct<br /><span class="line-number">  1</span>{<br /><span class="line-number">  2</span>  uint16_t a;<br /><span class="line-number">  3</span>  uint16_t b;<br /><span class="line-number">  4</span>  uint16_t c;<br /><span class="line-number">  5</span>} Foo;<br /><span class="line-number">  6</span><br /><span class="line-number">  7</span>typedef Foo Bar;<br /><span class="line-number">  8</span><br /><span class="line-number">  9</span>Foo* foo;<br /><span class="line-number"> 10</span>Bar* bar;<br /></pre>
The compiler will assume that <span class="monospace-strong">*foo</span> and <span class="monospace-strong">*bar</span>
may refer to the same location, and will not perform the optimizations decribed below.
<br /><br />


<div id="benefits" class="subtitle">Benefits to The Strict Aliasing Rule</div>
When the compiler cannot assume that two object are not aliased, it
must act very conservatively when accessing memory. For example:
<pre class="code"><span class="line-number">  0</span>typedef struct<br /><span class="line-number">  1</span>{<br /><span class="line-number">  2</span>  uint16_t a;<br /><span class="line-number">  3</span>  uint16_t b;<br /><span class="line-number">  4</span>  uint16_t c;<br /><span class="line-number">  5</span>} Sample;<br /><span class="line-number">  6</span><br /><span class="line-number">  7</span>void<br /><span class="line-number">  8</span>test( uint32_t* values,<br /><span class="line-number">  9</span>      Sample*   uniform,<br /><span class="line-number"> 10</span>      uint64_t  count )<br /><span class="line-number"> 11</span>{<br /><span class="line-number"> 12</span>  uint64_t i;<br /><span class="line-number"> 13</span><br /><span class="line-number"> 14</span>  for (i=0;i&lt;count;i++)<br /><span class="line-number"> 15</span>  {<br /><span class="line-number"> 16</span>    values[i] += (uint32_t)uniform-&gt;b;<br /><span class="line-number"> 17</span>  }<br /><span class="line-number"> 18</span>}<br /></pre>

Compiled with <b><span style="color: rgb(255, 0, 0);">-fno-strict-aliasing</span> -O3 -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code"><span class="line-number">  0</span>test:<br /><span class="line-number">  1</span>  li     10, 0      # i      = 0<br /><span class="line-number">  2</span>  cmpld  7,  10, 5  # done   = (i==count)<br /><span class="line-number">  3</span>  bgelr- 7          # if (done) return<br /><span class="line-number">  4</span>  mtctr  5          # ctr    = count<br /><span class="line-number">  5</span>.L8:<br /><span class="line-number">  6</span>  sldi   11, 10, 2  # offset = i * 4<br /><span class="line-number">  7</span><span style="color: rgb(255, 0, 0);">  lwz    9,  4(4)   # b      = *(uniform+4)</span><br /><span class="line-number">  8</span>  addi   10, 10, 1  # i++<br /><span class="line-number">  9</span>  lwzx   5,  11, 3  # value  = *(values+offset)<br /><span class="line-number"> 10</span>  add    0,  5,  9  # value  = value + b<br /><span class="line-number"> 11</span>  stwx   0,  11, 3  # *(values+offset) = value<br /><span class="line-number"> 12</span>  bdnz  .L8         # if (ctr--) goto .L8<br /><span class="line-number"> 13</span>  blr               # return<br /></pre>

In this case <b>uniform-&gt;b</b> <i>must</i> be loaded during each iteration of the loop. This is because the compiler cannot be certain that <b>values</b> does not overlap <b>b</b> in memory. If, in fact, they do overlap, the programmer would expect that <b>uniform-&gt;b</b> would be properly updated and the values stored into the <b>values</b> array adjusted accordingly. The only method for the compiler to guarantee these results is reloading <b>uniform-&gt;b</b> at every iteration.<br />
<br />

It was noted that this case is extremely uncommon in <i>most</i> code and the decision was made to <i>presume</i>
objects of different types are not aliased and to be more aggresive
with optimizations. It is certain the fact this presumption would break
some existing code was discussed in detail. It must have been decided
that those most likely to use memory aliasing techniques for
optimization are are few and those that do use it are the most willing
and capable of making the necessary changes. <br />
<br />

The result, even for this small case, can make a significant performance impact. Compiled with <b><span style="color: rgb(255, 0, 0);">-fstrict-aliasing</span> -Wstrict-aliasing=2 -O3 -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code"><span class="line-number">  0</span>test:<br /><span class="line-number">  1</span>  li     11,0     # i      = 0<br /><span class="line-number">  2</span>  cmpld  7,11,5   # done   = (i == count)<br /><span class="line-number">  3</span>  bgelr- 7        # if (done) return<br /><span class="line-number">  4</span><span style="color: rgb(255, 0, 0);">  lhz    4,2(4)   # b      = uniform.b</span><br /><span class="line-number">  5</span>  mtctr  5        # ctr    = count<br /><span class="line-number">  6</span>.L8:<br /><span class="line-number">  7</span>  sldi   9,11,2   # offset = i * 4<br /><span class="line-number">  8</span>  addi   11,11,1  # i++<br /><span class="line-number">  9</span>  lwzx   5,9,3    # value  = *(values+offset)<br /><span class="line-number"> 10</span>  add    0,5,4    # value  = value + b<br /><span class="line-number"> 11</span>  stwx   0,9,3    # *(values+offset) = value<br /><span class="line-number"> 12</span>  bdnz   .L8      # if (ctr--) goto .L8<br /><span class="line-number"> 13</span>  blr             # return<br /></pre>

The load of <b>b</b> is now only done once, outside the loop. For more examples of optimizations for non-aliasing memory see: <a href="http://web.archive.org/web/20071223232457/http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html">Demystifying The Restrict Keyword</a>

<div id="compatible_type" class="subtitle">Casting Compatible Types</div>
Aliases are permitted for types that only differ by qualifier or sign.
<pre class="code"><span class="line-number">  0</span>uint32_t<br /><span class="line-number">  1</span>test( uint32_t a )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  uint32_t* const       a0 = &amp;a;<br /><span class="line-number">  4</span>  uint32_t* volatile    a1 = &amp;a;<br /><span class="line-number">  5</span>  int32_t*              a2 = (int32_t*)&amp;a;<br /><span class="line-number">  6</span>  int32_t* const        a3 = (int32_t*)&amp;a;<br /><span class="line-number">  7</span>  int32_t* volatile     a4 = (int32_t*)&amp;a;<br /><span class="line-number">  8</span>  const int32_t* const  a5 = (int32_t*)&amp;a;<br /><span class="line-number">  9</span><br /><span class="line-number"> 10</span>  (*a0)++;<br /><span class="line-number"> 11</span>  (*a1)++;<br /><span class="line-number"> 12</span>  (*a2)++;<br /><span class="line-number"> 13</span>  (*a3)++;<br /><span class="line-number"> 14</span>  (*a4)++;<br /><span class="line-number"> 15</span><br /><span class="line-number"> 16</span>  return (*a5);<br /><span class="line-number"> 17</span>}<br /></pre>
In this case <b>a0</b>-<b>a5</b> are all valid aliases of <b>a</b> and this function will return <b>(a + 5)</b>.

<div class="sticky-note">
GCC has two flags to enable warnings related to strict aliasing. <b>-Wstrict-aliasing</b> enables warnings for most common errors related to type-punning. <b>-Wstrict-aliasing=2</b> attempts to warn about a larger class of cases, however false positives may be returned.</div>

<div id="union_1" class="subtitle">Casting through a union (1)</div>

The most commonly accepted method of converting one type of object to another is by using a union type as in this example:
<pre class="code"><span class="line-number">  0</span>typedef union<br /><span class="line-number">  1</span>{<br /><span class="line-number">  2</span>  uint32_t u32;<br /><span class="line-number">  3</span>  uint16_t u16[2];<br /><span class="line-number">  4</span>}<br /><span class="line-number">  5</span>U32;<br /><span class="line-number">  6</span><br /><span class="line-number">  7</span>uint32_t<br /><span class="line-number">  8</span>swap_words( uint32_t arg )<br /><span class="line-number">  9</span>{<br /><span class="line-number"> 10</span>  U32      in;<br /><span class="line-number"> 11</span>  uint16_t lo;<br /><span class="line-number"> 12</span>  uint16_t hi;<br /><span class="line-number"> 13</span><br /><span class="line-number"> 14</span>  in.u32    = arg;<br /><span class="line-number"> 15</span>  hi        = in.u16[0];<br /><span class="line-number"> 16</span>  lo        = in.u16[1];<br /><span class="line-number"> 17</span>  in.u16[0] = lo;<br /><span class="line-number"> 18</span>  in.u16[1] = hi;<br /><span class="line-number"> 19</span><br /><span class="line-number"> 20</span>  return (in.u32);<br /><span class="line-number"> 21</span>}<br /></pre>
This method is not properly called <i>casting</i> at all (although it may be called <em>type-punning</em>)
as the value is simplied copied into a union which permits aliasing
among its members. From a performance point of view, this method relies
on the ability of the optimizer to remove the redundant stores and
loads. When using recent versions of GCC, if the transformation is
reasonably simple, it is very likely that the compiler will be able to
remove the redundancies and produce an optimal code sequence.<br />

<div class="sticky-note">
Strictly speaking, reading a member of a union different from the one
written to is undefined in ANSI/ISO C99 except in the special case of
type-punning to a <b>char*</b>, similar to the example below: <a href="#cast_to_char_pointer">Casting to <b>char*</b></a>.
However, it is an extremely common idiom and is well-supported by all
major compilers. As a practical matter, reading and writing to any
member of a union, in any order, is acceptable practice.</div>

For example, when compiled with <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>, the argument is simply rotated 16 bits.
<pre class="code"><span class="line-number">  0</span>swap_words:<br /><span class="line-number">  1</span>  rlwinm r3,r3,16,0xffffffff<br /><span class="line-number">  2</span>  blr<br /></pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU, the loads and stores are removed but the instruction sequence is less than optimal.
<pre class="code"><span class="line-number">  0</span>swap_words:<br /><span class="line-number">  1</span>  slwi    4,3,16     ; hi    = arg &lt;&lt; 16<br /><span class="line-number">  2</span>  rldicl  3,3,48,48  ; lo    = arg &gt;&gt; 16<br /><span class="line-number">  3</span>  or      0,4,3      ; out   = hi | lo;<br /><span class="line-number">  4</span>  rldicl  3,0,0,32   ; final = out &amp; 0xffffffff<br /><span class="line-number">  5</span>  blr<br /></pre>
<br />

In order to generate reasonably good code across both the GCC3 and GCC4 families, use C99 style intializers:
<pre class="code"><span class="line-number">  0</span>uint32_t<br /><span class="line-number">  1</span>swap_words( uint32_t arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  U32    in  = { .u32=arg };<br /><span class="line-number">  4</span>  U32    out = { .u16[0]=in.u16[1], <br /><span class="line-number">  5</span>                 .u16[1]=in.u16[0] };<br /><span class="line-number">  6</span><br /><span class="line-number">  7</span>  return (out.u32);<br /><span class="line-number">  8</span>}<br /></pre>

Compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the 32 bit build of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code"><span class="line-number">  0</span>swap_words:<br /><span class="line-number">  1</span>  stwu 1,-16(1)              ; Push stack<br /><span class="line-number">  2</span>  rlwinm 3,3,16,0xffffffff   ; Rotate 16 bits<br /><span class="line-number">  3</span>  addi 1,1,16                ; Pop stack<br /><span class="line-number">  4</span>  blr<br /></pre>

<div class="sticky-note">
It is a parculiarity of the 32 bit build of GCC 3.4.1 for the Cell PPU that the stack is <i>always</i> pushed and popped regardless of whether or not it is used. 

</div>

<div class="rule-of-thumb">
This method is most valuable for use with primitive types which can be returned <i>by value</i>.
This is because it relies on doing a complete copy of the object (by value) and removing the redundancies. 
With more complex aggregate or union types copying may be done on the stack or through the memcpy function
and redundancies are harder to eliminate.
</div>

<div id="union_2" class="subtitle">Casting through a union (2)</div>

Casting proper may be done between a pointer to a type and a pointer to an aggregate or union type which contains a member of a <a href="#compatible_type">compatible type</a>, as in the following example:
<pre class="code"><span class="line-number">  0</span>uint32_t<br /><span class="line-number">  1</span>swap_words( uint32_t arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  U32*     in = (U32*)&amp;arg;<br /><span class="line-number">  4</span>  uint16_t lo = in-&gt;u16[0];<br /><span class="line-number">  5</span>  uint16_t hi = in-&gt;u16[1];<br /><span class="line-number">  6</span><br /><span class="line-number">  7</span>  in-&gt;u16[0] = hi;<br /><span class="line-number">  8</span>  in-&gt;u16[1] = lo;<br /><span class="line-number">  9</span><br /><span class="line-number"> 10</span>  return (in-&gt;u32);<br /><span class="line-number"> 11</span>}<br /></pre>

<b>in</b> is a pointer to a <b>U32</b> type, which contains the member <b>u32</b> which is of type <b>uint32_t</b> which is compatible with <b>arg</b>, which is also of type <b>uint32_t</b>.


<div class="sticky-note">
The above source when compiled with GCC 4.0 with the <b>-Wstrict-aliasing=2</b> flag enabled will generate a warning. This warning is an example of a <b>false positive</b>. This type of cast is  allowed and will generate the appropriate code (see below). It is documented clearly that <b>-Wstrict-aliasing=2</b> may return false positives.</div>

Compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>,

<pre class="code"><span class="line-number">  0</span>swap_words:<br /><span class="line-number">  1</span>  stw r3,24(r1)  ; Store arg<br /><span class="line-number">  2</span>  lhz r0,24(r1)  ; Load hi<br /><span class="line-number">  3</span>  lhz r2,26(r1)  ; Load lo<br /><span class="line-number">  4</span>  sth r0,26(r1)  ; Store result[1] = hi<br /><span class="line-number">  5</span>  sth r2,24(r1)  ; Store result[0] = lo<br /><span class="line-number">  6</span>  lwz r3,24(r1)  ; Load result<br /><span class="line-number">  7</span>  blr            ; Return<br /></pre>
GCC is extremely poor at combining loads and stores done through a
pointer to a union type as can be seen from the generated code above.
The output is a very naive interpretation of the source and would
perform badly compared to the previous examples on most architectures.<br /> 
<br />

However, once this fact is accounted for, this method can be very useful. Rather than copying the argument <i>by value</i>,
which is problematic on large or complex structures, a pointer can be
passed in and the value modified directly. If the loads and stores can
be combined in the source the results will usually be excellent.
<div class="sticky-note">
<i>"But when the address of a variable is taken, 
doesn't the compiler force it to be stored in memory rather than in a register?"</i>
<br /><br />
Yes, both a store and a load may then generated as part of the trace. However, when alias analysis is done it
can be determined that the object cannot be changed another mechanism so the load and store may be marked as
redundant and removed.
</div>

<div class="rule-of-thumb">
Do not rely on the compiler to combine loads and stores. The programmer is <i>always</i> better equipted to make those decisions based on alignment concerns and complex instruction penalty rules.
</div>

<pre class="code"><span class="line-number">  0</span>uint16_t*<br /><span class="line-number">  1</span>swap_words( uint16_t* arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  U32*     combined = (U32*)arg;<br /><span class="line-number">  4</span>  uint32_t start    = combined-&gt;u32;<br /><span class="line-number">  5</span>  uint32_t lo       = start &gt;&gt; 16;<br /><span class="line-number">  6</span>  uint32_t hi       = start &lt;&lt; 16;<br /><span class="line-number">  7</span>  uint32_t final    = lo | hi;<br /><span class="line-number">  8</span><br /><span class="line-number">  9</span>  combined-&gt;u32 = final;<br /><span class="line-number"> 10</span>}<br /></pre>

Compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>,
<pre class="code"><span class="line-number">  0</span>swap_words:<br /><span class="line-number">  1</span>  lwz r0,0(r3)                ; Load arg<br /><span class="line-number">  2</span>  rlwinm r0,r0,16,0xffffffff  ; Rotate 16 bits<br /><span class="line-number">  3</span>  stw r0,0(r3)                ; Store arg<br /><span class="line-number">  4</span>  blr                         ; Return<br /></pre>

<div class="rule-of-thumb">
If the above source is called as a <i>non-inline</i>
function, there will be a signficant penalty on most architectures
waiting for the load before the rotate and the store on return.<br />

If the above source is called as a <i>inline</i> function, it can be safely assumed the load and store will be removed by the compiler as redundant.
</div>

<div class="sticky-note">
In C99, a <b>static inline</b> function,
which may be included in a header file, differs from automatic inlining
in that the function may be defined multiple times (e.g. included by
multiple source files). Each definition of a <b>static inline</b> function must be identical.
</div>

<pre class="code"><span class="line-number">  0</span>static inline void<br /><span class="line-number">  1</span>swap_words( uint16_t* arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  U32*     combined = (U32*)arg;<br /><span class="line-number">  4</span>  uint32_t start    = combined-&gt;u32;<br /><span class="line-number">  5</span>  uint32_t lo       = start &gt;&gt; 16;<br /><span class="line-number">  6</span>  uint32_t hi       = start &lt;&lt; 16;<br /><span class="line-number">  7</span>  uint32_t final    = lo | hi;<br /><span class="line-number">  8</span><br /><span class="line-number">  9</span>  combined-&gt;u32 = final;<br /><span class="line-number"> 10</span>}<br /></pre>

<div class="rule-of-thumb">
With some care, this method is the most appropriate for modifying large or complex structures by multiple types.
</div>

<div id="union_3" class="subtitle">Casting through a union (3)</div>

Occasionally a programmer may encounter the following <span style="color: rgb(255, 0, 0);">INVALID</span> method for creating an alias with a pointer of a different type:
<pre class="code"><span class="line-number">  0</span>typedef union <br /><span class="line-number">  1</span>{<br /><span class="line-number">  2</span>  uint16_t* sp; <br /><span class="line-number">  3</span>  uint32_t* wp;<br /><span class="line-number">  4</span>} U32P;<br /><span class="line-number">  5</span><br /><span class="line-number">  6</span>uint32_t <br /><span class="line-number">  7</span>swap_words( uint32_t arg )<br /><span class="line-number">  8</span>{<br /><span class="line-number">  9</span>  U32P             in = { .wp = &amp;arg };<br /><span class="line-number"> 10</span>  const uint16_t   hi = in.sp[0];<br /><span class="line-number"> 11</span>  const uint16_t   lo = in.sp[1];<br /><span class="line-number"> 12</span>  <br /><span class="line-number"> 13</span>  in.sp[0] = lo;<br /><span class="line-number"> 14</span>  in.sp[1] = hi;<br /><span class="line-number"> 15</span><br /><span class="line-number"> 16</span>  return ( arg ); <span style="color: rgb(255, 0, 0);">&lt;-- RESULT IS UNDEFINED</span><br /><span class="line-number"> 17</span>} <br /></pre>
The problem with this method is although <b>U32P</b> does in fact say that <b>sp</b> is an alias for <b>wp</b>, it does not say anything about the relationship between the values pointed to by <b>sp</b> and <b>wp</b>. This differs in a critical way from <a href="#union_1">"Casting Through a Union (1)"</a> and <a href="#union_2">"Casting Through a Union (2)"</a> which both define aliases for the <i>values being pointed to</i>, not the pointers themselves.<br />
<br />
The presumption of strict aliasing remains true: Two pointers of
different types are assumed, except in a few very limited conditions <a href="#c99_standard">specified in the C99 standard</a>, not to alias. This is <b>not</b> one of those exceptions.

<div class="sticky-note">
The above source when compiled with GCC 3.4.1 or GCC 4.0 with the <b>-Wstrict-aliasing=2</b> flag enabled will <b>NOT</b> generate a warning. This should serve as an example to <i>always</i>
check the generated code. Warnings are often helpful hints, but they
are by no means exaustive and do not always detect when a programmer
makes an error. Like any peice of software, a compiler has limits.
Knowing them can <i>only</i> be helpful.</div>

For example, when compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on <span class="monospace-strong">GNU C version 4.0.0 (Apple Computer, Inc. build 5026) (powerpc-apple-darwin8)</span>,
<pre class="code"><span class="line-number">  0</span>swap_words:      ; <span style="color: rgb(255, 0, 0);">RETURNS ARG UNCHANGED</span><br /><span class="line-number">  1</span>  lhz r0,24(r1)  ; Load lo from stack (<i>What value?!</i>)<br /><span class="line-number">  2</span>  lhz r2,26(r1)  ; Load hi from stack (<i>What value?!</i>)<br /><span class="line-number">  3</span>  stw r3,24(r1)  ; Store arg to stack<br /><span class="line-number">  4</span>  sth r0,26(r1)  ; Store hi to stack<br /><span class="line-number">  5</span>  sth r2,24(r1)  ; Store lo to stack<br /><span class="line-number">  6</span>  blr            ; Return<br /></pre>

In this case notice that because <b>hi</b>, <b>lo</b> and <b>arg</b> are assumed not to alias, 
the resulting order of instruction has no value:
<ul><li><span class="monspace-strong">[Line 1]: </span><b>lo</b> is loaded from the stack before anything is stored to the stack</li><li><span class="monspace-strong">[Line 2]: </span><b>hi</b> is loaded from the stack before anything is stored to the stack</li><li><span class="monspace-strong">[Line 3]: </span><b>arg</b> is stored to the stack, but this value will not be read.</li><li><span class="monspace-strong">[Line 4]: </span><b>hi</b> is stored to the stack, but this value will not be read.</li><li><span class="monspace-strong">[Line 5]: </span><b>lo</b> is stored to the stack, but this value will not be read.</li></ul>

Or when compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code"><span class="line-number">  0</span>swap_words:     # <span style="color: rgb(255, 0, 0);">RETURNS ARG UNCHANGED</span><br /><span class="line-number">  1</span>  stw 3,48(1)   # Store arg to stack<br /><span class="line-number">  2</span>  lhz 9,48(1)   # Load hi<br /><span class="line-number">  3</span>  lhz 0,50(1)   # Load lo<br /><span class="line-number">  4</span>  lwz 3,48(1)   # Load arg<br /><span class="line-number">  5</span>  sth 0,48(1)   # Store hi to stack<br /><span class="line-number">  6</span>  sth 9,50(1)   # Store lo to stack<br /><span class="line-number">  7</span>  blr           # Return<br /></pre>

Or when compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code"><span class="line-number">  0</span>swap_words:     # <span style="color: rgb(255, 0, 0);">RETURNS ARG UNCHANGED</span><br /><span class="line-number">  1</span>  stwu 1,-16(1) # Push stack<br /><span class="line-number">  2</span>  addi 1,1,16   # Pop stack<br /><span class="line-number">  3</span>  blr           # Return <br /></pre>


<div id="cast_to_char_pointer" class="subtitle">Casting to <i>char*</i></div>
  
It is always presumed that a <b>char*</b> may refer to an alias of any object. It is therefore quite safe, if perhaps a bit <i>unoptimal</i> (for architecture with wide loads and stores) to cast any pointer of any type to a <b>char*</b> type.
<pre class="code"><span class="line-number">  0</span>uint32_t <br /><span class="line-number">  1</span>swap_words( uint32_t arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  char* const cp = (char*)&amp;arg;<br /><span class="line-number">  4</span>  const char  c0 = cp[0];<br /><span class="line-number">  5</span>  const char  c1 = cp[1];<br /><span class="line-number">  6</span>  const char  c2 = cp[2];<br /><span class="line-number">  7</span>  const char  c3 = cp[3];<br /><span class="line-number">  8</span><br /><span class="line-number">  9</span>  cp[0] = c2;<br /><span class="line-number"> 10</span>  cp[1] = c3;<br /><span class="line-number"> 11</span>  cp[2] = c0;<br /><span class="line-number"> 12</span>  cp[3] = c1;<br /><span class="line-number"> 13</span><br /><span class="line-number"> 14</span>  return (arg);<br /><span class="line-number"> 15</span>} <br /></pre>

The converse is not true. Casting a <b>char*</b> to a pointer of any type other than a <b>char*</b> and dereferencing it is <em>usually</em> in volation of the strict aliasing rule.
<div class="rule-of-thumb">In other words, casting from a pointer of one type to pointer of an unrelated type through a <b>char*</b> is <b>undefined</b>. </div>

<pre class="code"><span class="line-number">  0</span>uint32_t<br /><span class="line-number">  1</span>test( uint32_t arg )<br /><span class="line-number">  2</span>{<br /><span class="line-number">  3</span>  char*     const cp = (char*)&amp;arg;<br /><span class="line-number">  4</span>  uint16_t* const sp = (uint16_t*)cp;<br /><span class="line-number">  5</span><br /><span class="line-number">  6</span>  sp[0] = 0x0001;<br /><span class="line-number">  7</span>  sp[1] = 0x0002;<br /><span class="line-number">  8</span><br /><span class="line-number">  9</span>  return (arg);<br /><span class="line-number"> 10</span>}<br /></pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code"><span class="line-number">  0</span>test:<br /><span class="line-number">  1</span>  stw 3, 48(1)   # arg stored to stack<br /><span class="line-number">  2</span>  li  0, 1       # hi = 0x0001<br /><span class="line-number">  3</span>  li  9, 2       # lo = 0x0002<br /><span class="line-number">  4</span>  lwz 3, 48(1)   # result = loaded from stack<br /><span class="line-number">  5</span>  sth 0, 48(1)   # store hi to stack<br /><span class="line-number">  6</span>  sth 9, 50(1)   # store lo to stack<br /><span class="line-number">  7</span>  blr            # return (result) <span style="color: rgb(255, 0, 0);">&lt;-- RETURNS ARG UNCHANGED</span><br /></pre>

As noted by Pinskla it is not deferencing a <b>char*</b> per se that is specifically recognized
as a potential alias of any object, but any address referring to a <b>char</b> object. This includes an array of <b>char</b>
objects, as in the following example which will also break the strict aliasing assumption.

<pre class="code"><span class="line-number">  0</span>  char      const cp[4] = { arg0, arg1, arg2, arg3 };<br /><span class="line-number">  1</span>  uint16_t* const sp    = (uint16_t*)cp;<br /><span class="line-number">  2</span><br /><span class="line-number">  3</span>  sp[0] = 0x0001;<br /><span class="line-number">  4</span>  sp[1] = 0x0002;<br /></pre>


<div id="gcc_rule_breaking" class="subtitle">GCC RULE BREAKING</div>
GCC allows type-punned values to be deferenced at independent locations
in memory (i.e. different objects) when the source of the lvalue is not
directly known.<br />

<pre class="code"><span class="line-number">  0</span>void<br /><span class="line-number">  1</span>set_value( uint64_t* c, <br /><span class="line-number">  2</span>           uint32_t  a_val, <br /><span class="line-number">  3</span>           uint16_t  b_val ) <br /><span class="line-number">  4</span>{<br /><span class="line-number">  5</span>  uint32_t* a = (uint32_t*)c;<br /><span class="line-number">  6</span>  uint16_t* b = (uint16_t*)c;<br /><span class="line-number">  7</span>  <br /><span class="line-number">  8</span>  a[0] = a_val; // &lt;--- Address of c + 0<br /><span class="line-number">  9</span>  b[2] = b_val; // &lt;--- Address of c + 4<br /><span class="line-number"> 10</span>  b[3] = b_val; // &lt;--- Address of c + 6<br /><span class="line-number"> 11</span>}<br /></pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code"><span class="line-number">  0</span>set_value:<br /><span class="line-number">  1</span>  stw 4,0(3)   # (c+0) = a_val<br /><span class="line-number">  2</span>  sth 5,6(3)   # (c+6) = b_val<br /><span class="line-number">  3</span>  sth 5,4(3)   # (c+4) = b_val<br /><span class="line-number">  4</span>  blr          # return (c)<br /></pre>

Note any use of <b>c[0]</b> here would be (more?) undefined because it would alias the uses of <b>a</b> and <b>b</b>.

<pre class="code"><span class="line-number">  0</span>void<br /><span class="line-number">  1</span>set_value( uint64_t* c, <br /><span class="line-number">  2</span>           uint32_t  a_val, <br /><span class="line-number">  3</span>           uint16_t  b_val ) <br /><span class="line-number">  4</span>{<br /><span class="line-number">  5</span>  uint32_t* a = (uint32_t*)c;<br /><span class="line-number">  6</span>  uint16_t* b = (uint16_t*)c;<br /><span class="line-number">  7</span>  <br /><span class="line-number">  8</span>  a[0] = a_val; // &lt; Address of c + 0<br /><span class="line-number">  9</span>  b[2] = b_val; // &lt; Address of c + 4<br /><span class="line-number"> 10</span>  b[3] = b_val; // &lt; Address of c + 6<br /><span class="line-number"> 11</span>  <br /><span class="line-number"> 12</span>  <span style="color: rgb(255, 0, 0);">// WHAT VALUE THIS WOULD PRINT IS UNDEFINED</span><br /><span class="line-number"> 13</span>  printf("c = 0x%08x\n", c[0] ); <br /><span class="line-number"> 14</span>}<br /></pre>

However, when <b>set_value</b> is compiled inline (perhaps automatically), the source of <b>c</b> may be known and GCC will assume the values do <b>not</b> alias and may reduce the expression differently and generate completely different code.

<pre class="code"><span class="line-number">  0</span>static inline void<br /><span class="line-number">  1</span>set_value( uint64_t* c, <br /><span class="line-number">  2</span>           uint32_t  a_val, <br /><span class="line-number">  3</span>           uint16_t  b_val ) <br /><span class="line-number">  4</span>{<br /><span class="line-number">  5</span>  uint32_t* a = (uint32_t*)c;<br /><span class="line-number">  6</span>  uint16_t* b = (uint16_t*)c;<br /><span class="line-number">  7</span>  <br /><span class="line-number">  8</span>  a[0] = a_val; // &lt;--- Address of c + 0<br /><span class="line-number">  9</span>  b[2] = b_val; // &lt;--- Address of c + 4<br /><span class="line-number"> 10</span>  b[3] = b_val; // &lt;--- Address of c + 6<br /><span class="line-number"> 11</span>}<br /></pre>

<pre class="code"><span class="line-number">  0</span>int64_t<br /><span class="line-number">  1</span>test( int64_t  a<br /><span class="line-number">  2</span>     ,int64_t  b<br /><span class="line-number">  3</span>     ,uint32_t hi32<br /><span class="line-number">  4</span>     ,uint16_t lo16 )<br /><span class="line-number">  5</span>{<br /><span class="line-number">  6</span>  int64_t c = a + b;<br /><span class="line-number">  7</span><br /><span class="line-number">  8</span>  set_value( &amp;c, hi32, lo16 );<br /><span class="line-number">  9</span><br /><span class="line-number"> 10</span>  return (c);<br /><span class="line-number"> 11</span>}<br /></pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre id="test_set_value_original" class="code"><span class="line-number">  0</span>test:<br /><span class="line-number">  1</span>  add 3,3,4    # c = (a+b)<br /><span class="line-number">  2</span>  blr          # return (c)<br /></pre>

In this case because the object <b>c</b> is never accessed through any <i>valid</i> aliases in <b>set_value</b>, the expression is reduced out.

<div class="sticky-note"> The above example will <strong>NOT</strong> currently generate any warnings with <b>-Wstrict-aliasing=2</b> and will simply generate <i>different</i>
results depending on whether or not the expression is inlined. This is
another good reason to always double check the generated code. Also,
when writing unit tests, it is a good idea to test a function both as
an inline function and an extern function.</div>

<div class="sticky-note"> With GCC, strict aliasing warnings are <em>more likely</em> to be generated at the point where an address is taken (e.g. <span class="monospace-strong">uint16_t* a = (uint16_t*)&amp;b;</span>) than with pre-existing pointers (e.g. <span class="monospace-strong">uint16_t* a = (uint16_t*)b_ptr;</span>). Take special care when type-punning pre-existing pointers. </div>
Perhaps surprisingly, illegal aliasing within a loop generates
completely different results. It is probably not completely accidental
though, as most of the historical arguments <i>against</i> strict aliasing have revolved around optimized versions of functions like <b>memset</b> and <b>memcpy</b> which would cast the data to the widest available register size to minimize the trips to and from memory.

<pre class="code"><span class="line-number">  0</span>void<br /><span class="line-number">  1</span>set_value( uint64_t* c,<br /><span class="line-number">  2</span>           uint32_t  a_val,<br /><span class="line-number">  3</span>           uint16_t  b_val,<br /><span class="line-number">  4</span>           uint32_t  count )<br /><span class="line-number">  5</span>{<br /><span class="line-number">  6</span>  uint32_t* a  = (uint32_t*)c;<br /><span class="line-number">  7</span>  uint16_t* b  = (uint16_t*)c;<br /><span class="line-number">  8</span>  uint32_t  i  = 0;<br /><span class="line-number">  9</span><br /><span class="line-number"> 10</span>  for (i=0;i&lt;count;i++,a++,b+=2)<br /><span class="line-number"> 11</span>  {<br /><span class="line-number"> 12</span>    a[0]  = a_val;<br /><span class="line-number"> 13</span>    b[2]  = b_val;<br /><span class="line-number"> 14</span>    b[3]  = b_val;<br /><span class="line-number"> 15</span>  }<br /><span class="line-number"> 16</span>}<br /></pre>

As expected from the previous example above, this should still generate the "expected" result:<br />
<br />

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code"><span class="line-number">  0</span>set_value:<br /><span class="line-number">  1</span>  cmpwi 0, 6, 0   # done = (count == 0)<br /><span class="line-number">  2</span>  stwu  1, -16(1) # Push stack<br /><span class="line-number">  3</span>  mr    9, 3      # Copy c<br /><span class="line-number">  4</span>  beq-  0, .L7    # if (done) goto .L7<br /><span class="line-number">  5</span>  mtctr 6         # i = count<br /><span class="line-number">  6</span>.L8:<br /><span class="line-number">  7</span>  stw   4, 0(9)   # a[0] = a_val<br /><span class="line-number">  8</span>  addi  9, 9, 4   # a++<br /><span class="line-number">  9</span>  sth   5, 4(3)   # b[2] = b_val<br /><span class="line-number"> 10</span>  sth   5, 6(3)   # b[3] = b_val<br /><span class="line-number"> 11</span>  addi  3, 3, 4   # b+=2<br /><span class="line-number"> 12</span>  bdnz  .L8       # if (i) goto .L8<br /><span class="line-number"> 13</span>.L7:<br /><span class="line-number"> 14</span>  addi  1, 1, 16  # Pop stack<br /><span class="line-number"> 15</span>  blr             # return<br /></pre>

When called inline, the previous example would suggest that the compiler, assuming <b>c</b> is not aliased would also return <span class="monospace-strong">(a + b)</span>:<br />
<br />

<pre class="code"><span class="line-number">  0</span>int64_t<br /><span class="line-number">  1</span>test_loop( int64_t  a,<br /><span class="line-number">  2</span>           int64_t  b,<br /><span class="line-number">  3</span>           uint32_t hi32,<br /><span class="line-number">  4</span>           uint16_t lo16,<br /><span class="line-number">  5</span>           uint32_t count )<br /><span class="line-number">  6</span>{<br /><span class="line-number">  7</span>  static int64_t c[ C_COUNT ];<br /><span class="line-number">  8</span><br /><span class="line-number">  9</span>  c[0] = a + b;<br /><span class="line-number"> 10</span><br /><span class="line-number"> 11</span>  set_value( c, hi32, lo16, count );<br /><span class="line-number"> 12</span><br /><span class="line-number"> 13</span>  return (c[0]);<br /><span class="line-number"> 14</span>}<br /></pre>

When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.
<pre class="code"><span class="line-number">  0</span>test_loop:<br /><span class="line-number">  1</span>  lis   12, c.0@ha      # cloc     = location of c<br /><span class="line-number">  2</span>  mr.   0,  9           # i        = count<br /><span class="line-number">  3</span>  la    11, c.0@l(12)   # c        = *cloc<br /><span class="line-number">  4</span>  addc  10, 4, 6        # c1       = addlo (a,b)<br /><span class="line-number">  5</span>  adde  9,  3, 5        # c2       = addhi (a,b)<br /><span class="line-number">  6</span>  stwu  1, -16(1)       # Push stack<br /><span class="line-number">  7</span>  stw   9,  0(11)       # c[0].hi  = c2<br /><span class="line-number">  8</span>  mr    6,  11          # a        = c<br /><span class="line-number">  9</span>  stw   10, 4(11)       # c[0].lo  = c1<br /><span class="line-number"> 10</span>  mr    9,  11          # b        = c<br /><span class="line-number"> 11</span>  beq-  0,  .L19        # if (i==0) goto .L19<br /><span class="line-number"> 12</span>  mtctr 0               # i        = count<br /><span class="line-number"> 13</span>.L20:<br /><span class="line-number"> 14</span>  stw   7,  0(9)        # a[0]     = hi32<br /><span class="line-number"> 15</span>  addi  9,  9, 4        # a++<br /><span class="line-number"> 16</span>  sth   8,  4(6)        # b[2]     = lo16<br /><span class="line-number"> 17</span>  sth   8,  6(6)        # b[3]     = lo16<br /><span class="line-number"> 18</span>  addi  6,  6, 4        # b+=2<br /><span class="line-number"> 19</span>  bdnz  .L20            # if (i) goto .L20<br /><span class="line-number"> 20</span>.L19:<br /><span class="line-number"> 21</span>  la    9,  c.0@l(12)   # c        = *cloc<br /><span class="line-number"> 22</span>  addi  1,  1, 16       # Pop stack<br /><span class="line-number"> 23</span>  lwz   3,  0(9)        # result.hi = c[0].hi<br /><span class="line-number"> 24</span>  lwz   4,  4(9)        # result.lo = c[0].lo<br /><span class="line-number"> 25</span>  blr                   # return (result)<br /></pre>

The result is clearly different from the <a href="#test_set_value_original">original version</a> without the loop.<br />
<br />

It is not the existance of the loop in the source that changes the transformation, but rather the existance of a loop <i>after</i>
the initial optimization passes. For example, GCC is fairly good at
optimizing (unrolling) loops with a fixed iteration count. Examine the
following example:
<pre class="code"><span class="line-number">  0</span>int64_t<br /><span class="line-number">  1</span>test_noloop( int64_t  a,<br /><span class="line-number">  2</span>             int64_t  b,<br /><span class="line-number">  3</span>             uint32_t hi32,<br /><span class="line-number">  4</span>             uint16_t lo16 )<br /><span class="line-number">  5</span>{<br /><span class="line-number">  6</span>  int64_t c = a + b;<br /><span class="line-number">  7</span><br /><span class="line-number">  8</span>  set_value( &amp;c, hi32, lo16, 1 );<br /><span class="line-number">  9</span><br /><span class="line-number"> 10</span>  return (c);<br /><span class="line-number"> 11</span>}<br /></pre>It wouldn't be completely outrageous to expect the above example
to generate similar, albeit unrolled, code. That is unless you know to
expect simple loop transformations to be done fairly early in the
compilation process and alias analysis to be done later. When compiled
with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">32 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code"><span class="line-number">  0</span>test_noloop:      # &lt;--- RETURNS (A+B)<br /><span class="line-number">  1</span>  stwu 1,-16(1)   # Push stack<br /><span class="line-number">  2</span>  addc 4,4,6      # c.lo = addlo(a,b)<br /><span class="line-number">  3</span>  adde 3,3,5      # c.hi = addhi(a,b)<br /><span class="line-number">  4</span>  addi 1,1,16     # Pop stack<br /><span class="line-number">  5</span>  blr             # return (c)<br /></pre>

<div class="sticky-note">
The existance of a loop around accessed aliases and whether or not the
iteration count is known at compile time may impact the generated code.
Tests should include both constant and <b>extern</b>'d iteration counts.
</div>
What is surprising is that the 64 bit build of the same version of the
same compiler generates different results. When compiled with <b>-fstrict-aliasing -O3 -Wstrict-aliasing -std=c99</b> on the <span style="color: rgb(255, 0, 255);">64 bit build</span> of <span class="monospace-strong">GNU C version 3.4.1 (CELL 2.3, Jul 21 2005) (powerpc64-linux)</span> for the Cell PPU.

<pre class="code"><span class="line-number">  0</span>test_loop:<br /><span class="line-number">  1</span>  li     10, 0           # i = 0<br /><span class="line-number">  2</span>  cmplw  7,  10, 7       # done = (i==count)<br /><span class="line-number">  3</span>  add    4,  3, 4        # sum  = a + b<br /><span class="line-number">  4</span>  ld     3,  .LC0@toc(2) # cloc = location of c<br /><span class="line-number">  5</span>  std    4,  0(3)        # c[0] = sum<br /><span class="line-number">  6</span>  mr     9,  3           # a    = c<br /><span class="line-number">  7</span>  mr     11, 3           # b    = c<br /><span class="line-number">  8</span>  bge-   7,  .L18        # if (done) goto .L18<br /><span class="line-number">  9</span>.L22:<br /><span class="line-number"> 10</span>  addi   0,  10, 1       # i++<br /><span class="line-number"> 11</span>  stw    5,  0(11)       # a[0] = hi32<br /><span class="line-number"> 12</span>  rldicl 10, 0, 0, 32    # i    = i &amp; 0xffffffff<br /><span class="line-number"> 13</span>  sth    6,  4(9)        # b[2] = lo16<br /><span class="line-number"> 14</span>  sth    6,  6(9)        # b[3] = lo16<br /><span class="line-number"> 15</span>  cmplw  7,  10, 7       # done = (i==count)<br /><span class="line-number"> 16</span>  addi   11, 11, 4       # a++<br /><span class="line-number"> 17</span>  addi   9,  9, 4        # b+= 2<br /><span class="line-number"> 18</span>  blt+   7,  .L22        # if (!done) goto .L22<br /><span class="line-number"> 19</span>.L18:<br /><span class="line-number"> 20</span>  ld     3,0(3)          # result = c[0]<br /><span class="line-number"> 21</span>  blr                    # return (result)<br /></pre>

This indicates that there are significant <b>non-obvious</b> side-effects to building GCC as 32 bits versus 64 bits that <em>someone might want to look into</em>.
<div class="sticky-note">
The platform, version number and build data (i.e. the output of <span class="monospace-strong">gcc --version</span>)
is not sufficient information for compatibility testing. To be
thorough, units tests should be run across all versions of the same
compiler, if more than one is known to exist.</div>

<div id="c99_standard" class="subtitle">C99 Standard</div>
This article has been pretty relaxed with the use of terminology and
there is always room for some interpretation when reading a standard.
There are many additional cases not covered above and compiler specific
issues to consider. But for those interested in up-to-date definitive
information on the C standard refer to <a href="http://web.archive.org/web/20071223232457/http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf">ISO/IEC 9899:TC2 [open-std.org]</a>. Here is the most relevant text from section "6.5 Expressions":<br />
<br />
<br />
<div class="monospace-strong">
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:
<ul><li>a type compatible with the effective type of the object,</li><li>a qualified version of a type compatible with the effective type of the object,</li><li>a type that is the signed or unsigned type corresponding to the effective type of the
object,</li><li>a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,</li><li>an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or</li><li>a character type.</li></ul>
</div>

<div class="sticky-note">
Note the use of types like <b>uint64_t</b> and <b>uint32_t</b>
in the above examples. For decades programmers have been creating their
own integer types and reworking their header files for each platform
simply to get consistant integer sizes across multiple architectures.
This is because the standard does not guarantee types like <b>int</b> or <b>short</b> to be of any <i>particular</i>
width, it only guarantees their sizes relative to eachother. But
finally, with C99, the debate is over. Standard width integers are now
defined in <b>stdint.h</b>. <i>Always</i> use this header, and if your
implementation does not have it (e.g. Microsoft), there are portable
public domain versions available (e.g. This <a href="http://web.archive.org/web/20071223232457/http://www.cs.colorado.edu/%7Emain/cs1300/include/stdint.h">stdint.h</a> can be used for Win32).
</div>

<div id="summary" class="subtitle">Summary</div>

<ul><li>Strict aliasing means that two objects of different types cannot
refer to the same location in memory. Enable this option in GCC with
the <strong>-fstrict-aliasing</strong> flag. Be sure that <i>all</i> code can safely run with this rule enabled. Enable strict aliasing related warnings with <strong>-Wstrict-aliasing</strong>, but do not expect to be warned in all cases. </li><li>In order to discover aliasing problems as quickly as possible, <b>-fstrict-aliasing</b>
should always be included in the compilation flags for GCC. Otherwise
problems may only be visible at the highest optimization levels where
it is the most difficult to debug.</li></ul>

<div class="sticky-note">
Be wary of code that <i>requires</i> the use of <b>-fno-strict-aliasing</b>
(turns off strict aliasing at any level) in order to work. This is a
very good indication that the code relies on aliased memory access and
is likely to be dominated by poor memory access patterns. At the very
least only the minimum amount of files should have it disabled, and
only because time has not permitted their repair <i>yet</i>. Although
it may seem complex to properly alias memory, the tests where it is
really necessary for performance are actually quite few and should
already be tested rigorously. It is unlikely that code that does not
enable strict aliasing would be able to take advantage of the <b>restrict</b>
keyword. Using the restrict keyword allows a significant class of
memory access optimizations critical to high performance code. For more
information on the restrict keyword see: <a href="http://web.archive.org/web/20071223232457/http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html">Demystifying The Restrict Keyword</a>
</div>

]]>
    </content>
</entry>

<entry>
    <title>Demystifying The Restrict Keyword</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.4</id>

    <published>2006-05-30T05:38:59Z</published>
    <updated>2009-08-26T04:18:09Z</updated>

    <summary> UPDATED! More examples! More detailed explainations! Contract The restrict keyword can be considered an extension to the strict aliasing rule. It allows the programmer to declare that pointers which share the same type (or were otherwise validly created) do...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="sticky-note"><strong>
UPDATED! More examples! More detailed explainations! </strong></div>
<div class="subtitle">Contract</div>
The restrict keyword can be considered an extension to the strict
aliasing rule. It allows the programmer to declare that pointers which
share the same type (or were otherwise validly created) <b>do not</b>
alias eachother. By using restrict the programmer can declare that any
loads and stores through the qualified pointer (or through another
pointer copied either directly or indirectly from the restricted
pointer) are the <b>only</b> loads and stores to the same address
during the lifetime of the pointer. In other words, the pointer is not
aliased by any pointers other than its own copies.<br />
<br />
<div class="rule-of-thumb">
Restrict is a "no data hazards will be generated" contract between the
programmer and the compiler. The compiler relies on this information to
make optimizations. If the data is, in fact, aliased, the results are
undefined and a programmer should not expect the compiler to output a
warning. The compiler assumes the programmer is not <i>lying</i>.</div>
<br />
<br />
<div class="contract-header">THE RESTRICT CONTRACT</div>
<div class="contract">
I, [insert your name], a PROFESSIONAL or AMATEUR [circle one] programmer recognize that there are
limits to what a compiler can do. I certify that, to the best of my knowledge, there are no magic
elves or monkeys in the compiler which through the forces of fairy dust can always make code faster.
I understand that there are some problems for which there is not enough information to solve. I 
hereby declare that given the opportunity to provide the compiler with sufficient information,
perhaps through some key word, I will gladly use said keyword and not bitch and moan about how 
"the compiler should be doing this for me."<br />
<br />
In this case, I promise that the pointer declared along with the restrict qualifier is not aliased.
I certify that writes through this pointer will not effect the values read through any other pointer
available in the same context which is also declared as restricted.<br />
<br />
* Your agreement to this contract is implied by use of the restrict keyword ;)
</div>
<br />
<br />
Read on for more information on the practical use and benefits to using the restrict keyword...
            ]]>
        <![CDATA[<div class="subtitle">Restrict is a type qualifier</div>

<div class="quote"> A new feature of C99: The restrict type qualifier
allows programs to be written so that translators can produce
significantly faster executables. [...] Anyone for whom this is not a
concern can safely ignore this feature of the language.</div>
<div class="quote-cite"> -- <a href="http://std.dkuug.dk/JTC1/SC22/WG14/www/C99RationaleV5.10.pdf">From Rationale for International Standard - Programming Languages - C [std.dkuug.dk]</a> (6.7.3.1 Formal definition of restrict)</div>
<br />

The restrict keyword is a type qualifier for pointers and is a formal part of the C99 standard.<br />
<br />
Example usage:
<div class="code">
int* restrict foo;
</div>
Notice that the restrict keyword qualifies the pointer and not the object being pointed to.

<div class="sticky-note">
Not all compilers are compliant with the C99 standard. For example Microsoft's compiler, does not support the C99 standard <i>at all</i>. If you are using MSVC on a x86 platform you will not have access to this critical optimization option.<br />
</div>

<div class="sticky-note">
When using GCC, remember to enable the C99 standard by adding <b>-std=c99</b> to your compilation flags. In code that cannot be compiled with C99, use either <b>__restrict</b> or <b>__restrict__</b> to enable the keyword as a GCC extension.<br />
</div>

<div class="sticky-note">
The restrict keyword was not included as part of the C++98 standard. However some C++ compilers <i>may</i> support it as an extension. It's important that when restrict is used in C++ to remember that the implicit <i>this</i> pointer should also be restricted. Consult your compiler's manual for how to do this, if possible.
</div>

<div class="rule-of-thumb">An understanding the <a href="http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html">strict aliasing rule</a> will provide good context for  problems related to the restrict keyword. </div>

<div class="subtitle">Why was restrict introduced into C99?</div>
<div class="quote">
The problem that the restrict qualifier addresses is that potential
aliasing can inhibit optimizations. Specifically, if a translator
cannot determine that two different pointers are being used to
reference different objects, then it cannot apply optimizations such as
maintaining the values of the objects in registers rather than in
memory, or reordering loads and stores of these values. This problem
can have a significant effect on a program that, for example, performs
arithmetic calculations on large arrays of numbers. The effect can be
measured by comparing a program that uses pointers with a similar
program that uses file scope arrays (or with a similar Fortran
program). The array version can run faster by a factor of ten or more
on a system with vector processors. Where such large performance gains
are possible, implementations have of course offered their own
solutions, usually in the form of compiler directives that specify
particular optimizations. Differences in the spelling, scope, and
precise meaning of these directives have made them troublesome to use
in a program that must run on many different systems. This was the
motivation for a standard solution.</div>
<div class="quote-cite"> -- <a href="http://std.dkuug.dk/JTC1/SC22/WG14/www/C99RationaleV5.10.pdf">From Rationale for International Standard - Programming Languages - C [std.dkuug.dk]</a> (6.7.3.1 Formal definition of restrict)</div>
<br />
In other words, proper use of the restrict keyword gives the compiler
enough information to select a more optimal order of loads and stores
to/from memory and to potentially make better use of registers to store
non-aliased objects.<br />

<div class="subtitle">Non-aliased Memory Windows</div>

Given the following structure, there is a significant difference in performance in even the smallest update loops.

<pre class="code">typedef struct vector3  vector3;<br /><br />struct vector3<br />{<br />  float x;<br />  float y;<br />  float z;<br />};<br /></pre>
What follows is a simple example function that updates some "particles"
with unrestricted pointers. Note that the pointers share the same type,
so the compiler will assume they can be aliased, per the strict
aliasing rule.
<div class="sticky-note">The example code sections in the article are
not meant to serve as examples of real production code, but rather as
examples of real <em>patterns</em> often found in production code.</div>

<pre class="code">void<br />move( vector3* velocity, <br />      vector3* position, <br />      vector3* acceleration, <br />      float    time_step, <br />      size_t   count )<br />{<br />  for (size_t i=0;i&lt;count;i++)<br />  {<br />    velocity[i].x += acceleration[i].x * time_step;<br />    velocity[i].y += acceleration[i].y * time_step;<br />    velocity[i].z += acceleration[i].z * time_step;<br />    position[i].x += velocity[i].x     * time_step;<br />    position[i].y += velocity[i].y     * time_step;<br />    position[i].z += velocity[i].z     * time_step;<br />  }<br />}<br /></pre>
<br />

<div class="sticky-note">This article will examine the assembly output
generated for the PowerPC. However, the principles and suggestions
presented are applicable to many common architectures.</div>
<pre class="code"># This code was compiled with GCC 3.4.1 for PowerPC,<br /># with the following options: <b>-O3 -fstrict-aliasing -std=c99</b><br />#<br />move:<br />  cmpwi  0,6,0<br />  stwu   1,-16(1)<br />  beq-   0,.L7<br />  li     8,0<br />  mtctr  6<br />.L8:<br />  add    9,8,3<br />  lfsx   13,8,5<br />  add    10,8,5<br />  lfsx   0,8,3<br />  lfs    8,4(9)<br />  add    11,8,4<br />  lfs    5,8(10)<br />  lfs    7,4(10)<br />  lfs    6,8(9)<br />  fmadds 4,13,1,0<br />  fmadds 3,7,1,8<br />  fmadds 2,5,1,6<br />  <span style="color: rgb(255, 0, 0);">stfsx  4,8,3      # Store velocity_x<br />  stfs   3,4(9)     # Store velocity_y<br />  stfs   2,8(9)     # Store velocity_z</span><br />  <span style="color: rgb(0, 0, 255);">lfsx   11,8,4     # Load position_x<br />  lfs    10,4(11)   # Load position_y<br />  lfs    9,8(11)    # Load position_z</span><br />  fmadds 12,4,1,11<br />  fmadds 0,3,1,10<br />  fmadds 13,2,1,9<br />  stfsx  12,8,4<br />  addi   8,8,12<br />  stfs   0,4(11)<br />  stfs   13,8(11)<br />  bdnz   .L8<br />.L7:<br />  addi   1,1,16<br />  blr<br /></pre>

Notice above that <b>position</b> must wait for <b>velocity</b> to be stored. This is because the compiler cannot gaurantee that the two are not aliased and must assume that the write to <b>velocity</b> can overwrite the location where <b>position</b> will be read. Because the compiler must <i>effectively</i> perform the operations in the order declared in the source, it must assume this is the behavior the programmer intended.<br />

<div class="rule-of-thumb">
The use of unrestricted pointers inhibits the compiler's ability to
schedule loads and may cause redundant loads in many cases. With few
exceptions, accessing any value through a pointer will force the
compiler to load, or reload, the value after any store. This is because
the compiler cannot gaurantee that the value being loaded was not
aliased by the value that was stored.</div>

For instance, there is no reason (other than sanity) why the programmer could not call the function in this way:
<pre class="code">void <br />call_move( vector3* some_data, float time_step, count )<br />{<br />  move( some_data, some_data, some_data, time_step, count );<br />}<br /></pre>
The use of restricted pointers would specifically disallow this.<br />
<br />
Compare this to the same function working with arrays of file scope.
Working with file scope arrays represents the best case for the
compiler with regard to alias analysis and should be used as the
baseline for implementing functions with restricted pointers.
<pre class="code">vector3 velocity     [ PARTICLE_COUNT ];<br />vector3 position     [ PARTICLE_COUNT ];<br />vector3 acceleration [ PARTICLE_COUNT ];<br />&nbsp;<br />void<br />move( float time_step )<br />{<br />  for (size_t i=0;i&lt;PARTICLE_COUNT;i++)<br />  {<br />    velocity[i].x += acceleration[i].x * time_step;<br />    velocity[i].y += acceleration[i].y * time_step;<br />    velocity[i].z += acceleration[i].z * time_step;<br />    position[i].x += velocity[i].x     * time_step;<br />    position[i].y += velocity[i].y     * time_step;<br />    position[i].z += velocity[i].z     * time_step;<br />  }<br />}<br /></pre>With the above code the compiler knows the arrays will be stored
seperately and can determine that they are three independent data <i>windows</i>, or <i>stripes</i> and there can be no aliasing among them. A data stripe can be thought of as a <i>data channel</i> made up of indexable elements. <br />
<br />

<table width="400" border="1">
  <tbody><tr>
    <th scope="col">Data Channel </th>
    <th scope="col">Channel Elements (by Index) </th>
    </tr>
  <tr>
    <td>velocity</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N] </td>
    </tr>
  <tr>
    <td>position</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
    </tr>
  <tr>
    <td>acceleration</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
    </tr>
</tbody></table>

<div class="rule-of-thumb">
An element in a restricted data stripe can be a function of one or more elements of any other restricted data stripes, but <b>cannot</b> be a function of a <i>change</i> in an element of a data stripe.</div>

<pre class="code"># This code was compiled with GCC 3.4.1 for PowerPC,<br /># with the following options: <b>-O3 -fstrict-aliasing -std=c99</b><br />#<br />move:<br />  lis    3,velocity@ha<br />  lis    11,acceleration@ha<br />  lis    9,position@ha<br />  la     6,velocity@l(3)<br />  la     5,acceleration@l(11)<br />  la     7,position@l(9)<br />  li     8,0<br />  stwu   1,-16(1)<br />  li     0,8192<br />  mtctr  0<br />.L18:<br />  add    12,8,6<br />  <span style="color: rgb(0, 0, 255);">lfsx   12,8,6     # Load  velocity     + 0</span><br />  add    10,8,5<br />  <span style="color: rgb(0, 0, 255);">lfsx   13,8,5     # Load  acceleration + 0<br />  lfs    8,4(12)    # Load  velocity     + 4</span><br />  add    4,8,7<br />  <span style="color: rgb(0, 0, 255);">lfs    5,8(10)    # Load  acceleration + 8<br />  lfs    6,8(12)    # Load  velocity     + 8<br />  lfs    7,4(10)    # Load  acceleration + 4</span><br />  fmadds 9,13,1,12<br />  fmadds 10,7,1,8<br />  fmadds 11,5,1,6<br />  <span style="color: rgb(0, 0, 255);">lfsx   4,8,7      # Load  position     + 0<br />  lfs    3,4(4)     # Load  position     + 4<br />  lfs    2,8(4)     # Load  position     + 8</span><br />  fmadds 0,9,1,4<br />  fmadds 13,10,1,3<br />  fmadds 12,11,1,2<br />  <span style="color: rgb(255, 0, 0);">stfsx  9,8,6      # Store velocity     + 0<br />  stfs   11,8(12)   # Store velocity     + 8<br />  stfs   10,4(12)   # Store velocity     + 4<br />  stfsx  0,8,7      # Store position     + 0</span><br />  addi   8,8,12<br />  <span style="color: rgb(255, 0, 0);">stfs   13,4(4)    # Store position     + 4<br />  stfs   12,8(4)    # Store position     + 8</span><br />  bdnz   .L18<br />  addi   1,1,16<br />  blr<br /></pre>

All the stores are completed at the end of the loop. More specifically, the load for <strong>position</strong> is scheduled <em>before</em> the store of <strong>velocity</strong>. This validates that the compiler has enough information to determine that the values stored do not alias the values loaded. <br />
<br />
In order to get this same behavior with non-file scope pointers, use
the restrict keyword to declare that every location which is either
loaded or stored has no aliases.
<pre class="code">void<br />move( vector3* velocity, <br />      vector3* position, <br />      vector3* acceleration, <br />      float    time_step, <br />      size_t   count, <br />      size_t   stride )<br />{<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration_x = &amp;acceleration-&gt;x;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> velocity_x     = &amp;velocity-&gt;x;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> position_x     = &amp;position-&gt;x;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration_y = &amp;acceleration-&gt;y;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> velocity_y     = &amp;velocity-&gt;y;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> position_y     = &amp;position-&gt;y;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration_z = &amp;acceleration-&gt;z;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> velocity_z     = &amp;velocity-&gt;z;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> position_z     = &amp;position-&gt;z;<br /><br />  for (size_t i=0;i&lt;count*stride;i+=stride)<br />  {<br />    velocity_x[i] += acceleration_x[i] * time_step;<br />    velocity_y[i] += acceleration_y[i] * time_step;<br />    velocity_z[i] += acceleration_z[i] * time_step;<br />    position_x[i] += velocity_x[i]     * time_step;<br />    position_y[i] += velocity_y[i]     * time_step;<br />    position_z[i] += velocity_z[i]     * time_step;<br />  }<br />}<br /></pre>Nine (9) non-aliased memory stipes were declared in the above
code. This completely defines the aliasing relationships between all
the loads and stores.<br />
<br />

<table width="400" border="1">
  <tbody><tr>
    <th scope="col">Data Channel </th>
    <th scope="col">Channel Elements (by Index) </th>
  </tr>
  <tr>
    <td>velocity_x</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N] </td>
  </tr>
  <tr>
    <td>velocity_y</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>velocity_z</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>position_x</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>position_y</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>position_z</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>acceleration_x</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>acceleration_y</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
  <tr>
    <td>acceleration_z</td>
    <td>[0] ---&gt; [1] ---&gt; [2] ---&gt; [N]</td>
  </tr>
</tbody></table>
<br />
By copying addresses from from pointer to another, an implicit
hierarchy (or tree) of pointers is created. The child pointers are
usually completely aliased by the parent pointer and it's important not
to use them both at the same time (i.e. in the same scope). When
restricted child pointers are created, consider the parent pointer to
be <i>out of scope</i> and do not make an accesses through it. Note that in this case, any use of <b>velocity</b>, <b>position</b> or <b>acceleration</b> would invalidate the restrict contract and the results would be undefined.

<pre class="ascii-art">                |---&gt; velocity_x<br />velocity -------|---&gt; velocity_y<br />                |---&gt; velocity_z<br /><br />                |---&gt; position_x<br />position -------|---&gt; position_y<br />                |---&gt; position_z<br /><br />                |---&gt; acceleration_x<br />acceleration ---|---&gt; acceleration_y<br />                |---&gt; acceleration_z<br /></pre>

<div class="rule-of-thumb">
Typically, only the leaf nodes in a hierarchy of restricted pointers should be used.</div> 

This code was compiled with GCC 3.4.1 for PowerPC with the following options: <b>-O3 -fstrict-aliasing -std=c99</b>
<pre class="code"># This code was compiled with GCC 3.4.1 for PowerPC,<br /># with the following options: <b>-O3 -fstrict-aliasing -std=c99</b><br />#<br />move:<br />  stwu   1,-32(1)<br />  stw    31,28(1)<br />  mullw  31,6,7<br />  stw    30,24(1)<br />  cmplwi 7,31,0<br />  mr     30,7<br />  addi   12,3,4<br />  addi   6,5,4<br />  addi   8,4,4<br />  addi   7,5,8<br />  addi   10,3,8<br />  addi   11,4,8<br />  li     9,0<br />  ble-   7,.L27<br />.L31:<br />  slwi   0,9,2<br />  <span style="color: rgb(0, 0, 255);">lfsx   13,3,0</span><span style="color: rgb(0, 0, 255);">     # Load  velocity_x</span><br />  add    9,9,30<br />  <span style="color: rgb(0, 0, 255);">lfsx   8,12,0</span><span style="color: rgb(0, 0, 255);">     # Load  velocity_y</span><br />  cmplw  7,31,9<br /> <span style="color: rgb(0, 0, 255);"> lfsx   6,10,0<span style="color: rgb(0, 0, 255);">     # Load  velocity_z</span><br />  lfsx   12,5,0<span style="color: rgb(0, 0, 255);">     # Load  acceleration_x</span><br />  lfsx   7,6,0<span style="color: rgb(0, 0, 255);">      # Load  acceleration_y</span><br />  lfsx   5,7,0<span style="color: rgb(0, 0, 255);">      # Load  acceleration_z</span></span><br />  fmadds 11,12,1,13<br />  fmadds 10,7,1,8<br />  fmadds 9,5,1,6<br />  <span style="color: rgb(0, 0, 255);">lfsx   4,4,0      # Load  position_x<br />  lfsx   3,8,0      # Load  position_y<br />  lfsx   2,11,0</span><span style="color: rgb(0, 0, 255);">     # Load  position_z</span><br />  fmadds 0,11,1,4<br />  fmadds 13,10,1,3<br />  fmadds 12,9,1,2<br />  <span style="color: rgb(255, 0, 0);">stfsx  11,3,0     # Store velocity_x<br />  stfsx  10,12,0    # Store velocity_y<br />  stfsx  9,10,0     # Store velocity_z<br />  stfsx  0,4,0      # Store position_x<br />  stfsx  13,8,0     # Store position_y<br />  stfsx  12,11,0</span><span style="color: rgb(255, 0, 0);">    # Store position_z</span><br />  bgt+   7,.L31<br />.L27:<br />  lwz    30,24(1)<br />  lwz    31,28(1)<br />  addi   1,1,32<br />  blr<br /></pre>
This version has all the flexibility of the first (unrestricted)
version and the performance of the second (file scope arrays) version.
You should expect code where all aliasing information is declared with
the restrict keyword to <i>almost always</i> perform significantly better, and <em>never</em>
worse, than with unrestricted pointers. This is especially true on
superscalar RISC, or RISC-like architectures with large register files,
like the PowerPC or MIPS R4000. <br />
<br />
The asute reader may also have noticed that because nine (9) restricted
stripes were used instead of three (3) file scope arrays, the compiler
has been able to select a much simplier addressing scheme. Much of the
pointer arithmetic has been hoisted out of the loop. The version with
the restricted pointers is actually <i>more</i> efficient than the one with file scope arrays.

<div class="subtitle">Non-aliased Memory Access Patterns</div>

An important distinction to make is that the restrict keyword is not restricting anything. It  is in fact <i>allowing</i>
the compiler to do more than it could previously. It should also be
noted that the type of the pointer that is qualified with restrict is
not important, it is only important what location and size was used
when loading or storing from the pointer. The restrict keyword does not
declare that the object being pointed to is completely without aliases,
only that the addresses that are loaded and stored from are unaliased.<br />
<br />
For example, the following setup would be a completely valid use of restricted pointers:
<pre class="code">struct particle<br />{<br />  vector3 position;<br />  vector3 velocity;<br />  vector3 acceleration;<br />};<br />&nbsp;<br />[ ... ]<br />&nbsp;<br />void <br />call_move( particle* particles, float time_step, count )<br />{<br />  move( &amp;particles-&gt;position, <br />        &amp;particles-&gt;velocity, <br />        &amp;particles-&gt;acceleration, <br />        time_step, <br />        count, <br />        sizeof(particle) );<br />}<br /></pre>
Although each stripe of data is part of the same "object", none of the
accesses would be aliased. Some runtime systems try to determine
whether or not pointers are aliased by simply checking to see if the
memory windows overlap. That is not sufficient. <div class="rule-of-thumb">
Memory windows <i>can</i> overlap and still be non-aliased.
</div>

<div class="subtitle">Usage and Suggestions</div>
Use of the restrict keyword should be very common. It should be used as
a standard part of all new code. Older code should be revisited as
possible to take advantage of the new optimization opportunities. It is
somewhat difficult to refactor restricted requirements into
pre-existing code as a certain amount of alias analysis must be done by
the programmer. However, for the majority of live code in typical
applications, memory access is not aliased (nor are memory windows
overlapping) and aliasing hazards will be limited to a small fraction
of the code base.<br />

<div class="rule-of-thumb">
Before modifying code to use the restrict keyword, ensure that all code can compile safely with strict aliasing enabled.
</div>
Programmers using functions that make assumptions about aliasing must
know what those assumptions are. Certainly, if at all possible, memory
usage patterns should be documented. However, at the very least,
aliasing assumptions in the parameters passed to the functions should
be declared. In the above examples, the parameters <b>velocity</b>, <b>position</b> and <b>acceleration</b> must not be aliased and the restrict contract should be made public by <i>also</i> declaring those parameters restricted.

<pre class="code">void <br />move( vector3* restrict velocity, <br />      vector3* restrict position, <br />      vector3* restrict acceleration, <br />      float             time_step, <br />      size_t            count, <br />      size_t            stride );<br /></pre>
Not publishing aliasing assumptions will lead to very difficult to find
bugs. Programmers will not know that the data must be independent and
someone, someday will find a reason to use the same array in two or
more pointers.<br />
<br />
Take for example <b>memcpy</b>, which has been officially changed to have the following declaration:
<pre class="code">void* <br />memcpy(void*       restrict s1, <br />       const void* restrict s2, <br />       size_t               n );<br /></pre>
<i>Can you guess why?</i><br />

<div class="rule-of-thumb">
Use restrict in function prototypes and in structure definitions to publish the assumptions made about aliasing.
</div>
Restricted pointers can be copied from one to another to create a
hierarchy of pointers. However there is one limitation defined in the
C99 standard. The child pointer <b>must not</b>
be in the same block-level scope as the parent pointer. The result of
copying restricted pointers in the same block-level scope is undefined.
<pre class="code">{<br />  vector3* restrict position   = &amp;obj_a-&gt;position;<br />  float*   restrict position_x = &amp;position-&gt;x; &lt;-- UNDEFINED<br />  {<br />    float* restrict position_y = &amp;position-&gt;y; &lt;-- VALID<br />  }<br />}<br /></pre>  

<div class="rule-of-thumb">
Restricted child pointers must be in a different block-level scope than the parent pointer.
</div>

<br />There is one additional problem in the assembly output above which
is somewhat particular to the GCC scheduler. Notice that the load for <b>position </b>
happens immediately before its update and store. The first multiply-add
will stall waiting the first load to be completed before executing. The
first float (<b>position_x</b>) <i>will not</i> be ready in three (3)
cycles. It would be considerably better (and faster) if the load could
be pushed closer to the top of the loop so that it is more likely to be
completed by the time it is needed.
<pre class="code">  <span style="color: rgb(0, 0, 255);">lfsx   4,4,0      # Load   position_x<br />  lfsx   3,8,0      # Load   position_y<br />  lfsx   2,11,0     # Load   position_z</span><br />  <span style="color: rgb(255, 0, 255);">fmadds 0,11,1,4   # Update position_y<br />  fmadds 13,10,1,3  # Update position_x<br />  fmadds 12,9,1,2   # Update position_z</span><br /></pre>Due to the order in which scheduling is done in GCC, it is always
better to simplify expressions. Do not mix memory access with
calculations. The code can be re-written as follows:
<pre class="code">void<br />move( vector3* <span style="color: rgb(0, 0, 255);">restrict</span> velocity, <br />      vector3* <span style="color: rgb(0, 0, 255);">restrict</span> position, <br />      vector3* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration, <br />      float             time_step,  <br />      size_t            count, <br />      size_t            stride )<br />{<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration_x = &amp;acceleration-&gt;x;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> velocity_x     = &amp;velocity-&gt;x;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> position_x     = &amp;position-&gt;x;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration_y = &amp;acceleration-&gt;y;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> velocity_y     = &amp;velocity-&gt;y;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> position_y     = &amp;position-&gt;y;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> acceleration_z = &amp;acceleration-&gt;z;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> velocity_z     = &amp;velocity-&gt;z;<br />  float* <span style="color: rgb(0, 0, 255);">restrict</span> position_z     = &amp;position-&gt;z;<br /><br />  for (size_t i=0;i&lt;count*stride;i+=stride)<br />  {<br />    const float ax  = acceleration_x[i];<br />    const float ay  = acceleration_y[i];<br />    const float az  = acceleration_z[i];<br />    const float vx  = velocity_x[i];<br />    const float vy  = velocity_y[i];<br />    const float vz  = velocity_z[i];<br />    const float px  = position_x[i];<br />    const float py  = position_y[i];<br />    const float pz  = position_z[i];<br /><br />    const float nvx = vx + ( ax * time_step );<br />    const float nvy = vy + ( ay * time_step );<br />    const float nvz = vz + ( az * time_step );<br />    const float npx = px + ( vx * time_step );<br />    const float npy = py + ( vy * time_step );<br />    const float npz = pz + ( vz * time_step );<br /><br />    velocity_x[i]   = nvx;<br />    velocity_y[i]   = nvy;<br />    velocity_z[i]   = nvz;<br />    position_x[i]   = npx;<br />    position_y[i]   = npy;<br />    position_z[i]   = npz;<br />  }<br />}<br /></pre>

<pre class="code"># This code was compiled with GCC 3.4.1 for PowerPC,<br /># with the following options: <b>-O3 -fstrict-aliasing -std=c99</b><br />#<br />move:<br />  stwu   1,-32(1)<br />  stw    31,28(1)<br />  mullw  31,6,7<br />  stw    30,24(1)<br />  cmplwi 7,31,0<br />  mr     30,7<br />  addi   12,3,4<br />  addi   6,5,4<br />  addi   8,4,4<br />  addi   7,5,8<br />  addi   10,3,8<br />  addi   11,4,8<br />  li     9,0<br />  ble-   7,.L47<br />.L51:<br />  slwi   0,9,2<br />  <span style="color: rgb(0, 0, 255);">lfsx   8,3,0       # Load   vx</span><br />  add    9,9,30<br />  <span style="color: rgb(0, 0, 255);">lfsx   7,12,0      # Load   vy</span><br />  cmplw  7,31,9<br />  <span style="color: rgb(0, 0, 255);">lfsx   6,10,0      # Load   vz<br />  lfsx   10,4,0      # Load   px<br />  lfsx   9,8,0       # Load   py<br />  lfsx   5,11,0      # Load   pz<br />  lfsx   4,5,0       # Load   ax<br />  lfsx   3,6,0       # Load   ay<br />  lfsx   2,7,0       # Load   az</span><br />  <span style="color: rgb(255, 0, 255);">fmadds 0,8,1,10    # Update npx<br />  fmadds 13,7,1,9    # Update npy<br />  fmadds 12,6,1,5    # Update npz<br />  fmadds 11,4,1,8    # Update nvx<br />  fmadds 10,3,1,7    # Update nvy<br />  fmadds 9,2,1,6     # Update nvz</span><br />  <span style="color: rgb(255, 0, 0);">stfsx  0,4,0       # Store  npx<br />  stfsx  13,8,0      # Store  npy<br />  stfsx  12,11,0     # Store  npz<br />  stfsx  11,3,0      # Store  nvx<br />  stfsx  10,12,0     # Store  nvy<br />  stfsx  9,10,0      # Store  nvz</span><br />  bgt+   7,.L51<br />.L47:<br />  lwz    30,24(1)<br />  lwz    31,28(1)<br />  addi   1,1,32<br />  blr<br /></pre>The loads are now properly scheduled and moved as far in advance
as possible. The pattern [Load --&gt; Update --&gt; Store] is usually
the optimal pattern for simple memory transformations on a superscalar
RISC-like architecture, and is exactly what is being emitted. This is
reasonably close to good hand-written assembly for the same code
(without re-defining the problem), and the code now very suitable for
unrolling.<br />

<div class="rule-of-thumb">
Simplify expressions. Do not mix memory access with calculations. Use the [ Load --&gt; Update --&gt; Store ] pattern.
</div>
 
<div class="subtitle">Summary</div>
<ul><li>Strict aliasing means that two objects of different types cannot
refer to the same location in memory. Enable this option in GCC with
the <strong>-fstrict-aliasing</strong> flag. Be sure that <i>all</i> code can safely run with this rule enabled. Enable strict aliasing related warnings with <strong>-Wstrict-aliasing</strong>, but do not expect to be warned in all cases. </li><li>Compare the assembly output of the function with restricted
pointers and file scope arrays to ensure that all of the possible
aliasing information has been used.</li><li>Only use restricted leaf pointers. Use of parent pointers may break the restrict contract.</li><li>Publish as many assumptions as possible about aliasing information in the function declaration.</li><li>Memory windows may be overlapping and still be without aliases. Do not limit the data design to non-overlapping windows.</li><li>Begin using the restrict keyword immediately. Retrofit old code as soon as possible.</li><li>Keep loads and stores separated from calculations. This results in
better scheduling in GCC, and makes the relationship between the output
assembly and the original source clearer.</li></ul>

<div class="subtitle">Additional Reading</div>
<ul><li><a href="http://en.wikipedia.org/wiki/Aliasing_%28computing%29">Aliasing (computing) [wikipedia.org]</a></li><li><a href="http://mail-index.netbsd.org/tech-kern/2003/08/11/0001.html">Aliasing, Krister Walfridsson [netbsd.org]</a></li><li><a href="http://www.intel.com/software/products/compilers/clin/docs/main_cls/mergedprojects/optaps_cls/common/optaps_perf_run.htm">Memory Aliasing on Itanium®-based Systems [intel.com]</a></li><li><a href="http://www.cs.princeton.edu/%7Ejqwu/Memory/survey.html">Survey of Alias Analysis [princeton.edu]</a></li><li><a href="http://realtimecollisiondetection.net/pubs/GDC03_Ericson_Memory_Optimization.ppt">Memory Optimization, Christer Ericson [realtimecollisiondetection.net]</a></li><li><a href="http://www.cs.pitt.edu/%7Emock/papers/clei2004.pdf">Why Programmer-specified Aliasing is a Bad Idea, Markus Mock [pitt.edu]</a></li><li><a href="http://www.hlrs.de/organization/tsc/services/tools/docu/kcc/UserGuide/chapter_4.html">KAI C++ User's Guide, 4.1 Writing Optimizable Code [hlrs.de]</a></li></ul>

            ]]>
    </content>
</entry>

<entry>
    <title>Avoiding Microcoded Instructions On The PPU</title>
    <link rel="alternate" type="text/html" href="http://cellperformance.beyond3d.com/articles/2006/04/avoiding-microcoded-instructions-on-the-ppu.html" />
    <id>tag:cellperformance.beyond3d.com,2006:/articles//3.12</id>

    <published>2006-04-29T06:31:44Z</published>
    <updated>2009-08-05T06:40:19Z</updated>

    <summary>What are microcoded instructions? Microcode is a special instruction set that is (usually) only available to the hardware. On the PPU (PowerPC Unit), small microprograms made up of microcode are stored in ROM and executed in the place of those...</summary>
    <author>
        <name>Mike Acton</name>
        <uri>http://cellperformance.beyond3d.com/mt/mt-cp.cgi?__mode=view&amp;blog_id=3&amp;id=1</uri>
    </author>
    
    
    <content type="html" xml:lang="en-us" xml:base="http://cellperformance.beyond3d.com/articles/">
        <![CDATA[<div class="subtitle">What are microcoded instructions?</div>

Microcode is a special instruction set that is (usually) only available
to the hardware. On the PPU (PowerPC Unit), small microprograms made up
of microcode are stored in ROM and executed in the place of those
PowerPC instructions that were too costly to implement directly in
hardware or do not fit into the pipeline design very well. The size of
a microprogram is measured in microwords. <br />

<br />

The PowerPC instructions for which a microprogram is executed are often called <i>microcoded instructions</i>.<br />

<br />

Microcoded instructions may be <i>conditionally executed</i> or <i>unconditionally executed</i>. Unconditionally executed microcoded instructions <i>always</i>
execute the microprogram. Conditionally executed microcoded
instructions will only execute the microprogram when the values of the
register operands are exceptional in some way. Microcoded instructions
are a special case of normal instructions and conditionally executed
microcoded instructions are a special case of those. ]]>
        <![CDATA[<br />

<div class="subtitle">Why avoid microcoded instructions?</div>

<div class="quote-open">
<div class="quote">The G5 core implements several instructions in
microcode. These instructions cause a pipeline bubble during decode.
The most commonly used microcoded instructions are load and store
multiple -- lmw and stmw. These are often generated by the compiler to
save space when saving and restoring registers on the stack. You can
force GCC to avoid these instructions by specifying -mnomultiple.
Indexed forms and/or algebraic forms of updating load and stores are
also executed as microcode. You can force GCC to avoid these
instructions by specifying -mno-update.</div>
<div class="quote-cite">-- From <a href="http://developer.apple.com/technotes/tn/tn2087.html">G5 Performance Primer [apple.com]</a></div>
</div>
<br />
Like the G5, the PPU contains microcoded instructions. Microcoded
instructions are implemented in order to maintain compatibility with
the PowerPC standard (a processor can only be called a PowerPC
processor if it adheres to <a href="http://www-128.ibm.com/developerworks/eserver/articles/archguide.html?S_TACT=105AGX16&amp;S_CMP=DWPA">the standard [ibm.com]</a>.)
When one of these instructions is decoded, the current pipeline is
flushed, the microded program is then fetched from ROM and executed as
a single atomic unit. The process of flushing the pipeline, fetching
the microcode and executing the program takes quite a long time
compared to other instructions. Additionally, because the instruction
must be executed atomically in order to remain as transparent to the
user as possible, any resources needed by the microcode program must be
locked.<br />

<br />

<div class="quote">
;; micr insns will stall at least 7 cycles to get the first instr from ROM, micro instructions are not dual issued. 
</div>
<div class="quote-cite">
-- From <a href="file:///home/macton/plannednonoperational/www.cellperformance.com/public/attachments/cellpu.md">cellpu.md</a> (<a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/gcctoolchain_cbe.html">CBE Toolchain 2.3 source code [bsc.es]</a>)
</div>

<div class="rule-of-thumb">
The minimum seven (7) cycle stall for microcoded instructions is
derived from the fixed stages of the microcode section of the
instruction pipeline. Microcoded stages are inserted after the last
instruction buffer stage and before the first instruction decode stage.
The actual penalty is determined by the complexity and length of the
instruction. <br />
<br />
For more information on the PPU pipeline stages see: <a href="http://www.research.ibm.com/journal/rd/494/kahle.html">Introduction to the Cell multiprocessor [ibm.com]</a>
</div>

<br />
The details on which instructions are microcoded and the associated
penalties are specific to each PowerPC device and are outlined in the
User's Guide for the individual processor. For example, see the <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/AE818B5D1DBB02EC87256DDE00007821/$file/970FX_user_manual.v1.6.2006FEB09.pdf">IBM PowerPC 970FX RISC Microprocessor User's Guide [ibm.com]</a> paying particular attention to <i>Section 6.3.3 Instruction Decode, Cracking, and Microcode</i>.<br />

<br />
The PPU User's Guide has not been released publically. So how is a
programmer to know which instructions are microcoded and how to avoid
them?<br />

<br />
Read on to find out.

<div class="sticky-note">
<b>UPDATE: 11 MAY 2006</b><br />
<br />
On May 10, 2006 IBM released the <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>.
Section A.1.3.1 (Unconditionally Microcoded Instructions) has a
detailed list of those instructions which are always microcoded,
including latency information and microword count. Before this document
was released there were no public documents which described in detail,
the penalties for using microcoded instructions. This article has been
updated to reflect those details.<br />
<br />
From the document:<br />
<br />
<div class="quote">
<b>Note:</b> A minimum of 11 cycles are required before the first
instruction is received from the microcode ROM, so microcoded
instructions should be avoided if possible.<br />
<br />
Most microcoded instructions are decoded into two or three simple
PowerPC instructions, and they can be avoided in most cases. The
microcoded instructions are typically decomposed into an integer and a
load or store operation, with a dependency between them. Although most
microcoded PowerPC instructions are decoded into only a few simple
instructions, it is important to keep in mind that there are typically
dependencies between the internal operations of the microcode, which
generate stalls at the issue stage. Replacing the microcoded
instructions with PowerPC instructions not only avoids stalling but
also gives more latitude in scheduling instructions to avoid stalls, as
well as potentially improving multithreaded performance.<br />
</div>
</div>
        <div class="subtitle">Microcoded instruction scheduling</div>
Like many of the specific details of the processor, any good compiler
needs to understand (and take advantage of) the predicted latency and
throughput information on each instruction. So a programmer need look
no further than the <a href="http://www.bsc.es/projects/deepcomputing/linuxoncell/gcctoolchain_cbe.html">CBE GCC source code [bsc.es]</a> for a list of microcoded instructions.<br />
<br />
Here's an example extry from the <a href="file:///home/macton/plannednonoperational/www.cellperformance.com/public/attachments/rs6000.md">rs6000.md</a>
file (which is used by the cell-ppu target) which flags the first
instruction in the replacement ("rldicl.") as being microcoded.
<div class="code">
(define_insn ""
  [(set (match_operand:CC 0 "cc_reg_operand" "=x,?y")
	(compare:CC (zero_extend:DI (match_operand:QI 1 "gpc_reg_operand" "r,r"))
		    (const_int 0)))
   (clobber (match_scratch:DI 2 "=r,r"))]
  "TARGET_64BIT"
  "@
   <span style="color: rgb(255, 0, 0);">rldicl. %2,%1,0,56</span>
   #"
  [(set_attr "type" "compare")
  <span style="color: rgb(255, 0, 0);"> (set_attr "microcode" "mc,*")</span>
   (set_attr "length" "4,8")])
</div>
<div class="code-cite">The above snippet is written in <a href="http://gcc.gnu.org/onlinedocs/gccint/RTL.html">RTL [gnu.org]</a>, which stands for <i>Register Transfer Language</i>
and is used to describe the processor specific assembly output in GCC.
Assembly-level transformations, such as peephole optimizations, are
also described in RTL. For an introduction to RTL see: <a href="http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc.html">Using and Porting the GNU Compiler Collection (GCC) [gnu.org]</a> and <a href="ftp://ftp.axis.se/pub/users/hp/pgccfd/pgccfd-0.5.pdf">Porting GCC For Dummies [axis.se]</a></div>
<br />
Here is a partial list of the microcoded instructions explicitly flagged in the same file:
<div class="code">
and.     andi.   andil.   andis.
andiu.   doz*.   lhau     lhaux
lm       lmw     lsi      lswi
mr.      mullw.  muls.    neg.
nor.     or.     rldic*.  rlinm.
rlwinm.  s*i.    s*wi.    sf.
sl       sle.    sli.     slw
slwi.    sr      sre.     srw
stm      stmw    stsi     stswi
subf.    subfc.
</div>
<br />
<div class="sticky-note">
For the definitive list of microcoded instructions see the <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>. The corresponding instructions have been added to the sections below.
</div>
<br />
<div class="subtitle">Avoiding microcoded instructions</div>
<div class="quote">
Microcoded instructions, such as load/store multiple, were designed to
save space in compiled code and offer no performance advantage over
using multiple instructions. Because of the way these instructions are
handled inside the processor, they might have a greater latency and
take longer to execute than a sequence of individual instructions that
produces the same results. Some compilers (gcc for example) have
options that prevent generation of these instructions.
</div>
<div class="quote-cite">-- From <a href="http://www-128.ibm.com/developerworks/power/library/pa-nl17-tip.html">PowerPC processor tips: Improve PowerPC 970FX performance [ibm.com]</a></div>
<br />
Fortunately, there is a GCC flag that will warn the programmer if a known microcoded instruction is emitted. Simply add <b>-mwarn-microcode</b> to your compilation flags. <br />

<br />
This flag is defined in <a href="file:///home/macton/plannednonoperational/www.cellperformance.com/public/attachments/rs6000.h">rs6000.h</a>:
<div class="code">
{"warn-microcode", &amp;rs6000_warn_microcode_switch,                     \
  N_("Emitting warning of microcode") },                              \
{"no-warn-microcode", &amp;rs6000_warn_microcode_switch, "" },            \
</div>

And is processesed in <a href="file:///home/macton/plannednonoperational/www.cellperformance.com/public/attachments/rs6000.c">rs6000.c</a>:
<div class="code">
/* Handle -m(no-)warn-microcode similarly.  */
if (rs6000_warn_microcode_switch)
  {
    const char *base = rs6000_warn_microcode_switch;
    while (base[-1] != 'm') base--;
                                                                                                      
    if (*rs6000_warn_microcode_switch != '\0')
      error ("invalid option `%s'", base);
    rs6000_warn_microcode = (base[0] != 'n');
  }
</div>

And is used in <a href="file:///home/macton/plannednonoperational/www.cellperformance.com/public/attachments/final.c">final.c</a>:
<div class="code">
#ifdef RS6000_GENERATE_MICROCODE /* 0 - notmicrocode, 1 - conditional
microcode, 2 - microcode */ if (rs6000_warn_microcode) { if
(get_attr_microcode(insn) == 2) pedwarn ("emitting microcode insn
%s\t[%s] #%d",template,
insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); else if
(get_attr_microcode(insn) == 1) pedwarn ("emitting conditional
microcode insn %s\t[%s] #%d",template,
insn_data[INSN_CODE(insn)].name,INSN_UID(insn)); }
#endif
</div>

The other PowerPC specific compilation flags can also be found in these files.

<div class="rule-of-thumb">The compiler source is the best source for
information on compiler flags and processor specific options. Some
flags do not make it into the help output. </div>

Note that <b>-mwarn-microcode</b> is not in the gcc help list of flags.<br />
<br />
How does this affect code in practice? From the above list, there are
three main classes of microcoded instructions to watch out for.
<div class="subtitle">Avoid multiple load/store instructions</div>These
instructions are handy to load or store a small contiguous area of
memory. However, it will always be faster to simply load each
individual value into a register. GCC will not emit these instructions
if the <b>-mno-multiple</b> flag is passed to the compiler.<br />
<br />
List of microcoded load and store instructions, including load/store multiple. From: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>
<div class="code">
|--------------------------------------------------------------------------------------------------------------------|
| Unconditionally Microcoded Loads and Stores                                                                        |
|--------------------------------------------------------------------------------------------------------------------|
|                                                                                                                    |
|   A microcode load or store operation can access an 8-bit byte or a 32-bit word, indicated as "by byte" or         |
|   "by word" respectively.                                                                                          |
|                                                                                                                    |
|--------------------------------------------------------------------------------------------------------------------|
| INSTRUCTION         | CLASS           | LATENCY  | MICROWORD SIZE              | COMMENT                           |
|---------------------|-----------------|----------|-----------------------------|-----------------------------------|
| lha                 | load algebraic  | 11       | 7                           | Handled by byte.                  |
| lhau                | load algebraic  | 11       | 8                           | Handled by byte.                  |
| lhaux               | load algebraic  | 11       | 8                           | Handled by byte.                  |
| lhax                | load algebraic  | 11       | 8                           | Handled by byte.                  |
| lmw                 | load multiple   | 11       |(2 + 1 × words)              | This instruction is broken down   |
|                     |                 |          |                             | into a series of load words.      |
| lswi                | load string /   | 10       | By word:                    | Optimized instruction[1]          |
|                     | optimized       |          | (1 × words + 2 × bytes)     |                                   |
|                     |                 |          | By byte:                    |                                   |
|                     |                 |          | (2 × bytes)                 |                                   |
| lswx                | load string /   | By word: | By word:                    | Optimized instruction[1]          |
|                     | optimized       | 10       | 4 + (1 × words + 2 × bytes) |                                   |
|                     |                 | By byte: | By byte:                    |                                   |
|                     |                 | 7        | 4 + (2 × bytes)             |                                   |
|                     |                 |          |                             |                                   |         
|                     |                 |          |                             |                                   | 
| lwa                 | load algebraic  | 11       | 13                          | Handled by byte.                  |
| lwaux               | load algebraic  | 11       | 12                          | Handled by byte.                  |
| lwax                | load algebraic  | 11       | 12                          | Handled by byte.                  |
| stmw                | store multiple  | 11       | (2 + 1 × words)             | Broken into a series of store     |
|                     |                 |          |                             | words.                            |
| stswi               | store string /  | 10       | By word:                    |                                   |
|                     | optimized       |          | (1 × words + 2 × bytes)     | Optimized instruction[1]          |
|                     |                 |          | By byte:                    |                                   |
|                     |                 |          | (2 × bytes)                 |                                   |
| stswx               | store string /  | 7        | By word:                    | Optimized instruction[1]          |
|                     | optimized       |          | 4 + (1 × words + 2 × bytes) |                                   |
|                     |                 |          | By byte:                    |                                   |
|                     |                 |          | 4 + (2 × bytes)             |                                   |
|                     |                 |          |                             |                                   |         
|--------------------------------------------------------------------------------------------------------------------|

    [1] The instruction is first broken down into a series of load-word instructions 
        (odd bytes are handled by byte). If this does not cause an alignment exception,
        then the instruction is complete. If an alignment exception occurs, the first
        attempt is flushed. When the instruction is returned to microcode it is then 
        handled a byte at a time. Odd bytes, if any, are defined as the remainder of
        string_count / 4. For store instructions, it is a series of store words.
</div>

<div class="subtitle">Avoid Condition Register recording integer instructions</div>
Many of the integer functions, when the Condition Register (CR) modify
bit is set (denoted by a "dot" at the end of the instruction), are
microcoded. With this bit set, fixed-point instructions will
automatically set the first field (field zero) in the Condition
Register with the value's compare-with-zero result. For example, if the
result of the "or." instruction is greater than zero, the GT bit will
be set in CR[0].<br /> 
<br />
In general, this makes branching on integer expressions more expensive and an effort should be made to eliminate them.<br /><br />
List of CR recording microcoded instructions. From: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>
<div class="code">
|-------------------------------------------------|
| Unconditionally Microcoded Instructions         |
| (CR recording)                                  |
|                                                 |
|     Record instructions are all handled the     |
|     same way. The "root" instruction is issued  |
|     followed by the cmpi_x instruction.         |
|                                                 |
|    The nonrecord form used in the microcode     |
|    sequence is only available to microcode.     |
|                                                 |
|-------------------------------------------------|
| INSTRUCTION         | LATENCY | MICROWORD SIZE  |
|---------------------|---------|-----------------|
| and.                | 11      | 2               |
| andc.               | 11      | 2               |
| andi.               | 11      | 2               |
| andis.              | 11      | 2               |
| nand.               | 11      | 2               |
| nor.                | 11      | 2               |
| nego.               | 11      | 2               |
| or.                 | 11      | 2               |
| orc.                | 11      | 2               |
| xor.                | 11      | 2               |
| cntlzd.             | 11      | 2               |
| cntlzw.             | 11      | 2               |
| divd.               | 11      | 2               |
| divdo.              | 11      | 2               |
| divdu.              | 11      | 2               |
| divduo.             | 11      | 2               |
| divw.               | 11      | 2               |
| divwo.              | 11      | 2               |
| divwu.              | 11      | 2               |
| divwuo.             | 11      | 2               |
| eqv.                | 11      | 2               |
| extsb.              | 11      | 2               |
| extsh.              | 11      | 2               |
| extsw.              | 11      | 2               |
| mulhd.              | 11      | 2               |
| mulhdu.             | 11      | 2               |
| mulhw.              | 11      | 2               |
| mulhwu.             | 11      | 2               |
| mulld.              | 11      | 2               |
| mulldo.             | 11      | 2               |
| mullw.              | 11      | 2               |
| mullwo.             | 11      | 2               |
| rldcl.              | 11      | 5               |
| rldcr.              | 11      | 5               |
| rldic.              | 11      | 2               |
| rldicl.             | 11      | 2               |
| rldicr.             | 11      | 2               |
| rldimi.             | 11      | 2               |
| rlwimi.             | 11      | 2               |
| rlwinm.             | 11      | 2               |
| rlwnm.              | 11      | 5               |
| sld.                | 11      | 5               |
| slw.                | 11      | 5               |
| srad.               | 11      | 5               |
| sradi.              | 11      | 2               |
| sraw.               | 11      | 5               |
| srawi.              | 11      | 2               |
| srd.                | 11      | 5               |
| srw.                | 11      | 5               |
|---------------------|---------|-----------------|
</div>
<br />
<div class="subtitle">Avoid indirect shift and rotate instructions</div>
This is the simpliest case to find, but the hardest to eliminate:
<div class="code">
int64_t right_shift64( int64_t a, int64_t sa )
{
  return ( a &gt;&gt; sa );
}
</div>
This code will emit this deceptively simple function:
<div class="code">
.right_shift_64:
	srad 3,3,4
	blr
</div>
And the following warning (if <b>-mwarn-microcode</b> is enabled):
<div class="code">
test.c: In function `right_shift_64':
test.c:7: warning: emitting microcode insn srad%I2 %0,%1,%H2 [*ashrdi3_internal1] #20
</div>
<br />
The best option for eliminating indirect shift instructions is to know
the range of possible shift amounts and create an alternate branch-free
expression that selects between those choices.<br />
<br />
List of indirect shift and rotate microcoded instructions. From: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>
<div class="code">
|--------------------------------------------------|
| Unconditionally Microcoded Instructions          |
| (Shift and Rotate)                               |
|                                                  |
|     All indirect shift and rotate instructions   |
|     are handled using the same technique. First  |
|     the mt_shr is issued, followed by two noops  |
|     for delay, followed by the root instruction  |
|     (that is, rldcl_sh).                         |
|                                                  |
|--------------------------------------------------|
| INSTRUCTION         | LATENCY | MICROWORD SIZE   |
|---------------------|---------|------------------|
| rldcl               | 11      | 4                |
| rldcr               | 11      | 4                |
| rlwnm               | 11      | 4                |
| sld                 | 11      | 4                |
| slw                 | 11      | 4                |
| srad                | 11      | 4                |
| sraw                | 11      | 4                |
| srd                 | 11      | 4                |
| srw                 | 11      | 4                |
|--------------------------------------------------|
</div>
<br />
<div class="subtitle">Non-Pipelined, Complex Instructions</div>
In addition to microcoded instructions there is another class of low
performance instructions worth mentioning: the complex pipeline
instructions. These instructions are are not microcoded (i.e. the
resources already local to the execution pipeline can be used
directly), however they are complex enough that special handling is
required. In order for these instructions to be executed the
instruction pipeline must be evacuated (i.e. flushed). Therefore the
throughput of these instructions will be equal to the latency - They
will be slow.<br />
<br />
List of the non-pipelined instructions. From: <a href="http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F">Cell Broadband Engine Programming Handbook [ibm.com]</a>
<div class="code">
|----------------------------------------------------|
| Non-Microcoded, Non-Pipelined Integer Instructions |
|---------|----------|-------------------------------|
| Instr.  | Pipeline | Latency (cycles)              |
|---------|----------|-------------------------------|
| mulli   | FXU      | 6                             |
| mullw   | FXU      | 9                             |
| mulhw   | FXU      | 9                             |
| mulhwu  | FXU      | 9                             |
| mullwo  | FXU      | 9                             |
| mulld   | FXU      | 15                            |
| mulhd   | FXU      | 15                            |
| mulhdu  | FXU      | 15                            |
| mulldo  | FXU      | 15                            |
| divd    | FXU      | 10-70                         |
| divdu   | FXU      | 10-70                         |
| divdo   | FXU      | 10-70                         |
| divduo  | FXU      | 10-70                         |
| divw    | FXU      | 10-38                         |
| divwu   | FXU      | 10-38                         |
| divwo   | FXU      | 10-38                         |
| divwuo  | FXU      | 10-38                         |
|---------|----------|-------------------------------|

Note on divide instructions:

    The fixed-point divide is a variable latency operation
    that calculates RA and RB for word or doubleword and
    signed or unsigned fixedpoint (integer) operands.

    Division is defined by the following equation:
        dividend = (quotient x divisor) + r
        where: 0 = r &lt; |divisor|, 
        when dividend = 0 and -|divisor| &lt; r = 0,

        when dividend &lt; 0
        Overflow is set when an attempt is made to compute
        either the least negative integer divided by negative
        one or any integer divided by zero.

    The performance is determined by the number of
    bits required to represent the result.

    PPU cycles equal:
        ((1 setup) 
        + (ceil ((rb leading digits - ra leading digits)/2)
        + 1 iterations) 
        + (1 fixup)) × 2

    word minimum       = 10,
    maximum            = 38 cycles
    doubleword minimum = 10,
    maximum            = 70 cycles

    Overflow cases will complete in 10 cycles
</div>
<br />
<div class="rule-of-thumb">
There is no method to detect complex instructions emitted by the GCC compiler. Avoid integer multiplies and divides.<br />
<br />
Good luck with that! ;)
</div>

<br />
<div class="subtitle">Summary</div>
<ul><li>Keep an eye out for microcoded instructions: <i>use <b>-mwarn-microcode</b> in GCC.</i></li><li>Don't use multiple load/store instructions: <i>use <b>-mno-multiple</b> in GCC.</i></li><li>Avoid CR recording integer instructions </li><li>Avoid indirect shift and rotate instructions </li><li>Avoid integer multiply and divide instructions </li></ul> 
<div class="rule-of-thumb">
Eliminating microcoded and other non-pipelined instructions is
sometimes difficult and not always desireable (for example, when code
size is the determining factor in performance.) However, it is
important to know the penalty and make an informed choice. And as
always, if you are optimizing at this level, be sure to double-check
your results with a real profile on real hardware.
</div>
]]>
    </content>
</entry>

</feed>
