Demystifying The Restrict Keyword

UPDATED! More examples! More detailed explainations!
Contract
The restrict keyword can be considered an extension to the strict aliasing rule. It allows the programmer to declare that pointers which share the same type (or were otherwise validly created) do not alias eachother. By using restrict the programmer can declare that any loads and stores through the qualified pointer (or through another pointer copied either directly or indirectly from the restricted pointer) are the only loads and stores to the same address during the lifetime of the pointer. In other words, the pointer is not aliased by any pointers other than its own copies.

Restrict is a "no data hazards will be generated" contract between the programmer and the compiler. The compiler relies on this information to make optimizations. If the data is, in fact, aliased, the results are undefined and a programmer should not expect the compiler to output a warning. The compiler assumes the programmer is not lying.


THE RESTRICT CONTRACT
I, [insert your name], a PROFESSIONAL or AMATEUR [circle one] programmer recognize that there are limits to what a compiler can do. I certify that, to the best of my knowledge, there are no magic elves or monkeys in the compiler which through the forces of fairy dust can always make code faster. I understand that there are some problems for which there is not enough information to solve. I hereby declare that given the opportunity to provide the compiler with sufficient information, perhaps through some key word, I will gladly use said keyword and not bitch and moan about how "the compiler should be doing this for me."

In this case, I promise that the pointer declared along with the restrict qualifier is not aliased. I certify that writes through this pointer will not effect the values read through any other pointer available in the same context which is also declared as restricted.

* Your agreement to this contract is implied by use of the restrict keyword ;)


Read on for more information on the practical use and benefits to using the restrict keyword...
Restrict is a type qualifier
A new feature of C99: The restrict type qualifier allows programs to be written so that translators can produce significantly faster executables. [...] Anyone for whom this is not a concern can safely ignore this feature of the language.

The restrict keyword is a type qualifier for pointers and is a formal part of the C99 standard.

Example usage:
int* restrict foo;
Notice that the restrict keyword qualifies the pointer and not the object being pointed to.
Not all compilers are compliant with the C99 standard. For example Microsoft's compiler, does not support the C99 standard at all. If you are using MSVC on a x86 platform you will not have access to this critical optimization option.
When using GCC, remember to enable the C99 standard by adding -std=c99 to your compilation flags. In code that cannot be compiled with C99, use either __restrict or __restrict__ to enable the keyword as a GCC extension.
The restrict keyword was not included as part of the C++98 standard. However some C++ compilers may support it as an extension. It's important that when restrict is used in C++ to remember that the implicit this pointer should also be restricted. Consult your compiler's manual for how to do this, if possible.
An understanding the strict aliasing rule will provide good context for problems related to the restrict keyword.
Why was restrict introduced into C99?
The problem that the restrict qualifier addresses is that potential aliasing can inhibit optimizations. Specifically, if a translator cannot determine that two different pointers are being used to reference different objects, then it cannot apply optimizations such as maintaining the values of the objects in registers rather than in memory, or reordering loads and stores of these values. This problem can have a significant effect on a program that, for example, performs arithmetic calculations on large arrays of numbers. The effect can be measured by comparing a program that uses pointers with a similar program that uses file scope arrays (or with a similar Fortran program). The array version can run faster by a factor of ten or more on a system with vector processors. Where such large performance gains are possible, implementations have of course offered their own solutions, usually in the form of compiler directives that specify particular optimizations. Differences in the spelling, scope, and precise meaning of these directives have made them troublesome to use in a program that must run on many different systems. This was the motivation for a standard solution.

In other words, proper use of the restrict keyword gives the compiler enough information to select a more optimal order of loads and stores to/from memory and to potentially make better use of registers to store non-aliased objects.
Non-aliased Memory Windows
Given the following structure, there is a significant difference in performance in even the smallest update loops.
typedef struct vector3  vector3;

struct vector3
{
float x;
float y;
float z;
};
What follows is a simple example function that updates some "particles" with unrestricted pointers. Note that the pointers share the same type, so the compiler will assume they can be aliased, per the strict aliasing rule.
The example code sections in the article are not meant to serve as examples of real production code, but rather as examples of real patterns often found in production code.
void
move( vector3* velocity,
vector3* position,
vector3* acceleration,
float time_step,
size_t count )
{
for (size_t i=0;i<count;i++)
{
velocity[i].x += acceleration[i].x * time_step;
velocity[i].y += acceleration[i].y * time_step;
velocity[i].z += acceleration[i].z * time_step;
position[i].x += velocity[i].x * time_step;
position[i].y += velocity[i].y * time_step;
position[i].z += velocity[i].z * time_step;
}
}

This article will examine the assembly output generated for the PowerPC. However, the principles and suggestions presented are applicable to many common architectures.
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
cmpwi 0,6,0
stwu 1,-16(1)
beq- 0,.L7
li 8,0
mtctr 6
.L8:
add 9,8,3
lfsx 13,8,5
add 10,8,5
lfsx 0,8,3
lfs 8,4(9)
add 11,8,4
lfs 5,8(10)
lfs 7,4(10)
lfs 6,8(9)
fmadds 4,13,1,0
fmadds 3,7,1,8
fmadds 2,5,1,6
stfsx 4,8,3 # Store velocity_x
stfs 3,4(9) # Store velocity_y
stfs 2,8(9) # Store velocity_z

lfsx 11,8,4 # Load position_x
lfs 10,4(11) # Load position_y
lfs 9,8(11) # Load position_z

fmadds 12,4,1,11
fmadds 0,3,1,10
fmadds 13,2,1,9
stfsx 12,8,4
addi 8,8,12
stfs 0,4(11)
stfs 13,8(11)
bdnz .L8
.L7:
addi 1,1,16
blr
Notice above that position must wait for velocity to be stored. This is because the compiler cannot gaurantee that the two are not aliased and must assume that the write to velocity can overwrite the location where position will be read. Because the compiler must effectively perform the operations in the order declared in the source, it must assume this is the behavior the programmer intended.
The use of unrestricted pointers inhibits the compiler's ability to schedule loads and may cause redundant loads in many cases. With few exceptions, accessing any value through a pointer will force the compiler to load, or reload, the value after any store. This is because the compiler cannot gaurantee that the value being loaded was not aliased by the value that was stored.
For instance, there is no reason (other than sanity) why the programmer could not call the function in this way:
void 
call_move( vector3* some_data, float time_step, count )
{
move( some_data, some_data, some_data, time_step, count );
}
The use of restricted pointers would specifically disallow this.

Compare this to the same function working with arrays of file scope. Working with file scope arrays represents the best case for the compiler with regard to alias analysis and should be used as the baseline for implementing functions with restricted pointers.
vector3 velocity     [ PARTICLE_COUNT ];
vector3 position [ PARTICLE_COUNT ];
vector3 acceleration [ PARTICLE_COUNT ];
 
void
move( float time_step )
{
for (size_t i=0;i<PARTICLE_COUNT;i++)
{
velocity[i].x += acceleration[i].x * time_step;
velocity[i].y += acceleration[i].y * time_step;
velocity[i].z += acceleration[i].z * time_step;
position[i].x += velocity[i].x * time_step;
position[i].y += velocity[i].y * time_step;
position[i].z += velocity[i].z * time_step;
}
}
With the above code the compiler knows the arrays will be stored seperately and can determine that they are three independent data windows, or stripes and there can be no aliasing among them. A data stripe can be thought of as a data channel made up of indexable elements.

Data Channel Channel Elements (by Index)
velocity [0] ---> [1] ---> [2] ---> [N]
position [0] ---> [1] ---> [2] ---> [N]
acceleration [0] ---> [1] ---> [2] ---> [N]
An element in a restricted data stripe can be a function of one or more elements of any other restricted data stripes, but cannot be a function of a change in an element of a data stripe.
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
lis 3,velocity@ha
lis 11,acceleration@ha
lis 9,position@ha
la 6,velocity@l(3)
la 5,acceleration@l(11)
la 7,position@l(9)
li 8,0
stwu 1,-16(1)
li 0,8192
mtctr 0
.L18:
add 12,8,6
lfsx 12,8,6 # Load velocity + 0
add 10,8,5
lfsx 13,8,5 # Load acceleration + 0
lfs 8,4(12) # Load velocity + 4

add 4,8,7
lfs 5,8(10) # Load acceleration + 8
lfs 6,8(12) # Load velocity + 8
lfs 7,4(10) # Load acceleration + 4

fmadds 9,13,1,12
fmadds 10,7,1,8
fmadds 11,5,1,6
lfsx 4,8,7 # Load position + 0
lfs 3,4(4) # Load position + 4
lfs 2,8(4) # Load position + 8

fmadds 0,9,1,4
fmadds 13,10,1,3
fmadds 12,11,1,2
stfsx 9,8,6 # Store velocity + 0
stfs 11,8(12) # Store velocity + 8
stfs 10,4(12) # Store velocity + 4
stfsx 0,8,7 # Store position + 0

addi 8,8,12
stfs 13,4(4) # Store position + 4
stfs 12,8(4) # Store position + 8

bdnz .L18
addi 1,1,16
blr
All the stores are completed at the end of the loop. More specifically, the load for position is scheduled before the store of velocity. This validates that the compiler has enough information to determine that the values stored do not alias the values loaded.

In order to get this same behavior with non-file scope pointers, use the restrict keyword to declare that every location which is either loaded or stored has no aliases.
void
move( vector3* velocity,
vector3* position,
vector3* acceleration,
float time_step,
size_t count,
size_t stride )
{
float* restrict acceleration_x = &acceleration->x;
float* restrict velocity_x = &velocity->x;
float* restrict position_x = &position->x;
float* restrict acceleration_y = &acceleration->y;
float* restrict velocity_y = &velocity->y;
float* restrict position_y = &position->y;
float* restrict acceleration_z = &acceleration->z;
float* restrict velocity_z = &velocity->z;
float* restrict position_z = &position->z;

for (size_t i=0;i<count*stride;i+=stride)
{
velocity_x[i] += acceleration_x[i] * time_step;
velocity_y[i] += acceleration_y[i] * time_step;
velocity_z[i] += acceleration_z[i] * time_step;
position_x[i] += velocity_x[i] * time_step;
position_y[i] += velocity_y[i] * time_step;
position_z[i] += velocity_z[i] * time_step;
}
}
Nine (9) non-aliased memory stipes were declared in the above code. This completely defines the aliasing relationships between all the loads and stores.

Data Channel Channel Elements (by Index)
velocity_x [0] ---> [1] ---> [2] ---> [N]
velocity_y [0] ---> [1] ---> [2] ---> [N]
velocity_z [0] ---> [1] ---> [2] ---> [N]
position_x [0] ---> [1] ---> [2] ---> [N]
position_y [0] ---> [1] ---> [2] ---> [N]
position_z [0] ---> [1] ---> [2] ---> [N]
acceleration_x [0] ---> [1] ---> [2] ---> [N]
acceleration_y [0] ---> [1] ---> [2] ---> [N]
acceleration_z [0] ---> [1] ---> [2] ---> [N]

By copying addresses from from pointer to another, an implicit hierarchy (or tree) of pointers is created. The child pointers are usually completely aliased by the parent pointer and it's important not to use them both at the same time (i.e. in the same scope). When restricted child pointers are created, consider the parent pointer to be out of scope and do not make an accesses through it. Note that in this case, any use of velocity, position or acceleration would invalidate the restrict contract and the results would be undefined.
                |---> velocity_x
velocity -------|---> velocity_y
|---> velocity_z

|---> position_x
position -------|---> position_y
|---> position_z

|---> acceleration_x
acceleration ---|---> acceleration_y
|---> acceleration_z
Typically, only the leaf nodes in a hierarchy of restricted pointers should be used.
This code was compiled with GCC 3.4.1 for PowerPC with the following options: -O3 -fstrict-aliasing -std=c99
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
stwu 1,-32(1)
stw 31,28(1)
mullw 31,6,7
stw 30,24(1)
cmplwi 7,31,0
mr 30,7
addi 12,3,4
addi 6,5,4
addi 8,4,4
addi 7,5,8
addi 10,3,8
addi 11,4,8
li 9,0
ble- 7,.L27
.L31:
slwi 0,9,2
lfsx 13,3,0 # Load velocity_x
add 9,9,30
lfsx 8,12,0 # Load velocity_y
cmplw 7,31,9
lfsx 6,10,0 # Load velocity_z
lfsx 12,5,0 # Load acceleration_x
lfsx 7,6,0 # Load acceleration_y
lfsx 5,7,0 # Load acceleration_z

fmadds 11,12,1,13
fmadds 10,7,1,8
fmadds 9,5,1,6
lfsx 4,4,0 # Load position_x
lfsx 3,8,0 # Load position_y
lfsx 2,11,0
# Load position_z
fmadds 0,11,1,4
fmadds 13,10,1,3
fmadds 12,9,1,2
stfsx 11,3,0 # Store velocity_x
stfsx 10,12,0 # Store velocity_y
stfsx 9,10,0 # Store velocity_z
stfsx 0,4,0 # Store position_x
stfsx 13,8,0 # Store position_y
stfsx 12,11,0
# Store position_z
bgt+ 7,.L31
.L27:
lwz 30,24(1)
lwz 31,28(1)
addi 1,1,32
blr
This version has all the flexibility of the first (unrestricted) version and the performance of the second (file scope arrays) version. You should expect code where all aliasing information is declared with the restrict keyword to almost always perform significantly better, and never worse, than with unrestricted pointers. This is especially true on superscalar RISC, or RISC-like architectures with large register files, like the PowerPC or MIPS R4000.

The asute reader may also have noticed that because nine (9) restricted stripes were used instead of three (3) file scope arrays, the compiler has been able to select a much simplier addressing scheme. Much of the pointer arithmetic has been hoisted out of the loop. The version with the restricted pointers is actually more efficient than the one with file scope arrays.
Non-aliased Memory Access Patterns
An important distinction to make is that the restrict keyword is not restricting anything. It is in fact allowing the compiler to do more than it could previously. It should also be noted that the type of the pointer that is qualified with restrict is not important, it is only important what location and size was used when loading or storing from the pointer. The restrict keyword does not declare that the object being pointed to is completely without aliases, only that the addresses that are loaded and stored from are unaliased.

For example, the following setup would be a completely valid use of restricted pointers:
struct particle
{
vector3 position;
vector3 velocity;
vector3 acceleration;
};
 
[ ... ]
 
void
call_move( particle* particles, float time_step, count )
{
move( &particles->position,
&particles->velocity,
&particles->acceleration,
time_step,
count,
sizeof(particle) );
}
Although each stripe of data is part of the same "object", none of the accesses would be aliased. Some runtime systems try to determine whether or not pointers are aliased by simply checking to see if the memory windows overlap. That is not sufficient.
Memory windows can overlap and still be non-aliased.
Usage and Suggestions
Use of the restrict keyword should be very common. It should be used as a standard part of all new code. Older code should be revisited as possible to take advantage of the new optimization opportunities. It is somewhat difficult to refactor restricted requirements into pre-existing code as a certain amount of alias analysis must be done by the programmer. However, for the majority of live code in typical applications, memory access is not aliased (nor are memory windows overlapping) and aliasing hazards will be limited to a small fraction of the code base.
Before modifying code to use the restrict keyword, ensure that all code can compile safely with strict aliasing enabled.
Programmers using functions that make assumptions about aliasing must know what those assumptions are. Certainly, if at all possible, memory usage patterns should be documented. However, at the very least, aliasing assumptions in the parameters passed to the functions should be declared. In the above examples, the parameters velocity, position and acceleration must not be aliased and the restrict contract should be made public by also declaring those parameters restricted.
void 
move( vector3* restrict velocity,
vector3* restrict position,
vector3* restrict acceleration,
float time_step,
size_t count,
size_t stride );
Not publishing aliasing assumptions will lead to very difficult to find bugs. Programmers will not know that the data must be independent and someone, someday will find a reason to use the same array in two or more pointers.

Take for example memcpy, which has been officially changed to have the following declaration:
void* 
memcpy(void* restrict s1,
const void* restrict s2,
size_t n );
Can you guess why?
Use restrict in function prototypes and in structure definitions to publish the assumptions made about aliasing.
Restricted pointers can be copied from one to another to create a hierarchy of pointers. However there is one limitation defined in the C99 standard. The child pointer must not be in the same block-level scope as the parent pointer. The result of copying restricted pointers in the same block-level scope is undefined.
{
vector3* restrict position = &obj_a->position;
float* restrict position_x = &position->x; <-- UNDEFINED
{
float* restrict position_y = &position->y; <-- VALID
}
}
Restricted child pointers must be in a different block-level scope than the parent pointer.

There is one additional problem in the assembly output above which is somewhat particular to the GCC scheduler. Notice that the load for position happens immediately before its update and store. The first multiply-add will stall waiting the first load to be completed before executing. The first float (position_x) will not be ready in three (3) cycles. It would be considerably better (and faster) if the load could be pushed closer to the top of the loop so that it is more likely to be completed by the time it is needed.
  lfsx   4,4,0      # Load   position_x
lfsx 3,8,0 # Load position_y
lfsx 2,11,0 # Load position_z

fmadds 0,11,1,4 # Update position_y
fmadds 13,10,1,3 # Update position_x
fmadds 12,9,1,2 # Update position_z

Due to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:
void
move( vector3* restrict velocity,
vector3* restrict position,
vector3* restrict acceleration,
float time_step,
size_t count,
size_t stride )
{
float* restrict acceleration_x = &acceleration->x;
float* restrict velocity_x = &velocity->x;
float* restrict position_x = &position->x;
float* restrict acceleration_y = &acceleration->y;
float* restrict velocity_y = &velocity->y;
float* restrict position_y = &position->y;
float* restrict acceleration_z = &acceleration->z;
float* restrict velocity_z = &velocity->z;
float* restrict position_z = &position->z;

for (size_t i=0;i<count*stride;i+=stride)
{
const float ax = acceleration_x[i];
const float ay = acceleration_y[i];
const float az = acceleration_z[i];
const float vx = velocity_x[i];
const float vy = velocity_y[i];
const float vz = velocity_z[i];
const float px = position_x[i];
const float py = position_y[i];
const float pz = position_z[i];

const float nvx = vx + ( ax * time_step );
const float nvy = vy + ( ay * time_step );
const float nvz = vz + ( az * time_step );
const float npx = px + ( vx * time_step );
const float npy = py + ( vy * time_step );
const float npz = pz + ( vz * time_step );

velocity_x[i] = nvx;
velocity_y[i] = nvy;
velocity_z[i] = nvz;
position_x[i] = npx;
position_y[i] = npy;
position_z[i] = npz;
}
}
# This code was compiled with GCC 3.4.1 for PowerPC,
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
stwu 1,-32(1)
stw 31,28(1)
mullw 31,6,7
stw 30,24(1)
cmplwi 7,31,0
mr 30,7
addi 12,3,4
addi 6,5,4
addi 8,4,4
addi 7,5,8
addi 10,3,8
addi 11,4,8
li 9,0
ble- 7,.L47
.L51:
slwi 0,9,2
lfsx 8,3,0 # Load vx
add 9,9,30
lfsx 7,12,0 # Load vy
cmplw 7,31,9
lfsx 6,10,0 # Load vz
lfsx 10,4,0 # Load px
lfsx 9,8,0 # Load py
lfsx 5,11,0 # Load pz
lfsx 4,5,0 # Load ax
lfsx 3,6,0 # Load ay
lfsx 2,7,0 # Load az

fmadds 0,8,1,10 # Update npx
fmadds 13,7,1,9 # Update npy
fmadds 12,6,1,5 # Update npz
fmadds 11,4,1,8 # Update nvx
fmadds 10,3,1,7 # Update nvy
fmadds 9,2,1,6 # Update nvz

stfsx 0,4,0 # Store npx
stfsx 13,8,0 # Store npy
stfsx 12,11,0 # Store npz
stfsx 11,3,0 # Store nvx
stfsx 10,12,0 # Store nvy
stfsx 9,10,0 # Store nvz

bgt+ 7,.L51
.L47:
lwz 30,24(1)
lwz 31,28(1)
addi 1,1,32
blr
The loads are now properly scheduled and moved as far in advance as possible. The pattern [Load --> Update --> Store] is usually the optimal pattern for simple memory transformations on a superscalar RISC-like architecture, and is exactly what is being emitted. This is reasonably close to good hand-written assembly for the same code (without re-defining the problem), and the code now very suitable for unrolling.
Simplify expressions. Do not mix memory access with calculations. Use the [ Load --> Update --> Store ] pattern.
Summary
  • Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the -fstrict-aliasing flag. Be sure that all code can safely run with this rule enabled. Enable strict aliasing related warnings with -Wstrict-aliasing, but do not expect to be warned in all cases.
  • Compare the assembly output of the function with restricted pointers and file scope arrays to ensure that all of the possible aliasing information has been used.
  • Only use restricted leaf pointers. Use of parent pointers may break the restrict contract.
  • Publish as many assumptions as possible about aliasing information in the function declaration.
  • Memory windows may be overlapping and still be without aliases. Do not limit the data design to non-overlapping windows.
  • Begin using the restrict keyword immediately. Retrofit old code as soon as possible.
  • Keep loads and stores separated from calculations. This results in better scheduling in GCC, and makes the relationship between the output assembly and the original source clearer.
Additional Reading