Unaligned scalar load and store on the SPU
Mike Acton
September 15, 2006
September 15, 2006
Albert Noll, a student at UC Irvine is working on an interesting project. According to him:
One of the problems Albert has encountered recently is in loading and storing scalar doubles to the java stack. He recently posed a question on the Cell Broadband Engine Architecture forum [ibm.com]:
The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords. The ability to compile scalar code on the SPU is something of a convinience, but it doesn't come without a penalty.
The first step, before considering performance, is to properly be able to load and store the unaligned double values.
Instead of:
Have a look at the unaligned load and store functions below and hopefully it will be clear why scalar values necessarily complicates things on the SPU and ultimately leads to poorer performance.
double unaligned_load_double( const char* const arg0 );
void unaligned_store_double( char* const arg0, double arg1 );
Or download the files directly:
unaligned_double.c
unaligned_double.h
vsl.c
vsl.h
unaligned_double.h
unaligned_double.c
vsl.h
vsl.c
"I am currently working on a java virtual machine runtime environment which
hides the heterogenity of the cell architecture. Conventional java code
code be executed and benefit from the numerous execution units the Cell
architecture offers. I am doing some java benchmarks (java grande) to test
the efficiancy of the implementation, but I still have some problems
achieving really good results."
One of the problems Albert has encountered recently is in loading and storing scalar doubles to the java stack. He recently posed a question on the Cell Broadband Engine Architecture forum [ibm.com]:
"I have the following problem: I want to load a double value from an array (represents stack of the application)
which is of type unsigned int. The two 32-bit values, which
represent the double value have been casted to a double before, so the bits
are set according to the double representation of the value."
The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords. The ability to compile scalar code on the SPU is something of a convinience, but it doesn't come without a penalty.
The first step, before considering performance, is to properly be able to load and store the unaligned double values.
Instead of:
{
// [ostack is unsigned int]
double tmp[2];
memcpy(tmp, ostack - 4, 16);
double res = tmp[0] * tmp[1];
memcpy(ostack - 4, &res, 8);
ostack -= 2;
}
Simplify a bit:
{
const double arg0 = unaligned_load_double( ostack-4 );
const double arg1 = unaligned_load_double( ostack-2 );
const double result = arg0 * arg1;
unaligned_store_double( ostack-4, result );
ostack -= 2;
}
After Albert gets his basic scalar code working, the next two steps to optimizing this access would probably be:- Instead of loading and storing each individual element, keep a "cache" of a couple of quadwords in the stack that are being worked with and keep using them until the data must be flushed (store) or requires loading a new line.
- Start moving away from the scalar implementation completely. The both the SPU compiler and the SPU itself are most effective when working with vectors.
Have a look at the unaligned load and store functions below and hopefully it will be clear why scalar values necessarily complicates things on the SPU and ultimately leads to poorer performance.
double unaligned_load_double( const char* const arg0 );
void unaligned_store_double( char* const arg0, double arg1 );
Or download the files directly:
unaligned_double.c
unaligned_double.h
vsl.c
vsl.h
If you have a question you think Mike Acton or one of the other members of CellPerformance can help you with, feel free to email or post in our forums.
unaligned_double.h
0#ifndef UNALIGNED_DOUBLE_H
1#define UNALIGNED_DOUBLE_H
2
3double unaligned_load_double( const char* const arg0 );
4void unaligned_store_double( char* const arg0, double arg1 );
5
6#endif
unaligned_double.c
0#include "unaligned_double.h"
1#include "vsl.h"
2#include <stdint.h>
3#include <spu_intrinsics.h>
4
5// Return the double stored at the unaligned address at (arg0)
6double
7unaligned_load_double( const char* const arg0 )
8{
9 uintptr_t source_addr_u = (uintptr_t)arg0;
10 qword source_addr = si_from_uint( source_addr_u );
11
12 // Unaligned load requires two reads since the double may be
13 // stored across two aligned quadwords. These loads will read
14 // starting at the quadword aligned address of source_addr
15 // i.e. will ignore the low 4 bits of source_addr
16
17 qword source_lo = si_lqd( source_addr, 0x00 );
18 qword source_hi = si_lqd( source_addr, 0x10 );
19
20 // Get low 4 bits of source address. This is the offset from the
21 // aligned address.
22
23 qword shift_offset = si_andi( source_addr, 0x0f );
24
25 // Create a shuffle pattern which selects the appropriate
26 // bytes of the double from the two quadwords.
27 //
28 // The value of shift offset is stored in byte[3] of shift_offset,
29 // 1. Generate the pattern 0x03030303_03030303_03030303_03030303
30 // 2. Use shuffle to generate a qword with each byte filled with
31 // the shift offset.
32
33 qword lo_byte_pattern = si_ilh( 0x0303 );
34 qword shift_offset_pattern = si_shufb( shift_offset, shift_offset, lo_byte_pattern );
35
36 // Add shift_offset_pattern to a vector for shift left (which is
37 // just a byte vector with bytes incrementing from 0-15). This will
38 // generate a shuffle pattern which will be used store the unaligned
39 // double in the preferred slot of the result.
40
41 qword vector_shift_left = (qword)_vsl;
42 qword shift_pattern = si_a( shift_offset_pattern, vector_shift_left );
43
44 // Move double into preferred slot and extract.
45
46 qword result_preferred = si_shufb( source_lo, source_hi, shift_pattern );
47 double result = si_to_double( result_preferred );
48
49 return (result);
50}
51
52// Write the double (arg1) at the unaligned address at (arg0)
53void
54unaligned_store_double( char* const arg0, double arg1 )
55{
56 uintptr_t source_addr_u = (uintptr_t)arg0;
57 qword source_addr = si_from_uint( source_addr_u );
58 qword value = si_from_double( arg1 );
59
60 // Unaligned store requires two reads since the double may be
61 // stored across two aligned quadwords. The two source lines
62 // will be loaded, modified then stored.
63
64 qword source_lo = si_lqd( source_addr, 0x00 );
65 qword source_hi = si_lqd( source_addr, 0x10 );
66
67 // Get low 4 bits of source address. This is the offset from the
68 // aligned address.
69
70 qword shift_offset = si_andi( source_addr, 0x0f );
71
72 // Create a shuffle pattern which selects the appropriate
73 // bytes of the double from the low quadword.
74 //
75 // The value of shift offset is stored in byte[3] of shift_offset,
76 // 1. Generate the pattern 0x03030303_03030303_03030303_03030303
77 // 2. Use shuffle to generate a qword with each byte filled with
78 // the shift offset.
79
80 qword lo_byte_pattern = si_ilh( 0x0303 );
81 qword shift_offset_pattern_lo = si_shufb( shift_offset, shift_offset, lo_byte_pattern );
82
83
84 // Create a shuffle pattern which selects the appropriate
85 // bytes of the double from the high quadword
86 // high offset = (16 bytes - low offset)
87
88 qword shift_adjust_hi = si_ilh( 0x1010 );
89 qword shift_offset_pattern_hi = si_sf( shift_adjust_hi, shift_offset_pattern_lo );
90
91 // Subtract shift offset pattern from a vector for shift left (which is
92 // just a byte vector with bytes incrementing from 0-15). This will
93 // generate a shuffle pattern which will be used store the unaligned
94 // double at the unaliged locations used in the first source line.
95
96 qword vector_shift_left = (qword)_vsl;
97 qword shift_pattern_lo = si_sf( shift_offset_pattern_lo, vector_shift_left );
98 qword shift_pattern_hi = si_sf( shift_offset_pattern_hi, vector_shift_left );
99
100 // Mask the bits that will be unmodified in the source lines.
101 // Any shuffle pattern outside the range [0x00,0x07] is not being used
102 // by the value, so that will be kept in the source lines.
103
104 qword source_bits_mask_lo = si_clgtbi( shift_pattern_lo, 0x07 );
105 qword source_bits_mask_hi = si_clgtbi( shift_pattern_hi, 0x07 );
106 qword value_lo = si_shufb( value, value, shift_pattern_lo );
107 qword value_hi = si_shufb( value, value, shift_pattern_hi );
108
109 // Clear space in source lines to store the value
110
111 qword prepped_source_lo = si_and( source_lo, source_bits_mask_lo );
112 qword prepped_source_hi = si_and( source_hi, source_bits_mask_hi );
113
114 // Clear everything unwanted from the value
115
116 qword prepped_value_lo = si_andc( value_lo, source_bits_mask_lo );
117 qword prepped_value_hi = si_andc( value_hi, source_bits_mask_hi );
118
119 // Combine the source lines and the value lines
120
121 qword result_lo = si_or( prepped_source_lo, prepped_value_lo );
122 qword result_hi = si_or( prepped_source_hi, prepped_value_hi );
123
124 // Store the result
125
126 si_stqd( result_lo, source_addr, 0x00 );
127 si_stqd( result_hi, source_addr, 0x10 );
128}
vsl.h
0#ifndef VSL_H
1#define VSL_H
2
3extern const vector unsigned char _vsl;
4
5#endif
vsl.c
0// Vector for shift left
1const vector unsigned char _vsl = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
2 ,0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };