Branch-free implementation of half-precision (16 bit) floating point

Suggestions? Comments? Questions?

Send email to Mike Acton

Articles

Cross-compiling for PS3 Linux
n this article, I will detail the basic steps I used to get started building on a host PC and running on the PS3.

Unaligned scalar load and store on the SPU
An example of unaligned loads and stores on the SPU. The solution to this problem is to remember that the SPU does not have a scalar instruction set or access local memory in anything except 16 bytes quadwords.

atan2 on SPU
A branch-free implementation of atan2 vector floats for the SPU.

Branch-free implementation of half-precision (16 bit) floating point
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers.

Better Performance Through Branch Elimination
An introduction to branch penalties: Why it's a good idea to avoid branchy code.

Box Overlap
A look at a function to test for overlap between 3D boxes, and how to optimize it for the CBE.

A 4x4 Matrix Inverse
Study case about how to convert scalar code indo SIMD code for PPU and SPU using the matrix inverse as example.

Avoiding Microcoded Instructions On The PPU
Executing instructions from microcode can wreck havok on inner loop performance. Find out which instructions are microcoded and how to avoid them.

Choosing to Avoid Branches: A Small Altivec Example
An example of why less instructions doesn't always equal faster code.

More Techniques for Eliminating Branches
Some additional examples for eliminating integer and floating-point branches.

Programming with Branches, Patterns and Tips
GCC follows some straightforward rules that are useful to know when programming with branches.

Benefits to Branch Elimination
The fundamental principal behind branch elimination is that expressing a value as a simple function of its inputs (a single basic block) is often more efficient than selecting a result through a change in control flow (branching).

Background on Branching
A background in understanding how branches operate on the PPU and SPU.

Links

No Insider Info!

Although discussions on applying the Cell processor to game development are welcome here, do not ask for insider information related to Sony's Playstation 3.

The details of the hardware and development are covered by a non-disclosure agreement and under no conditions will confidential information be permitted on this site.

Playstation 3 developers are welcome to participate in the discussions but be aware that this is a publicly accessable site and information not available to the general public may not be disclosed.

Keep it clean so that we can continue to build on the community of Cell developers both inside and outside video game development.

Thank you for your cooperation,
Mike.

Legal

Content Copyright © 2006 by Mike Acton. All Rights Reserved.

This site uses the Movable Type 3.2 content engine.

This site uses the phpBB bulletin board engine Copyright © 2001, 2005 phpBB Group.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc

PowerPC is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds in the U.S. and other countries.

Macintosh, and Mac are registered trademarks of Apple Computer, Inc

All other trademarks are the property of their respective owners.

Branch-free implementation of half-precision (16 bit) floating point

Mike Acton
July 17, 2006

Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz().

Update! (17 July 06) The code has been considerably refactored. Decided to go with single function per expression. The expressions have been reduced as a first optimization pass.

Project

The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers (That does not use floating point hardware). This example should echo the IEEE 754 standard for floating point numbers as closely as reasonable, including support for +/- INF, QNan, SNan, and denormalized numbers. However, exceptions will not be implemented.

Half-precision floats are used in cases where neither the range nor the precision of 32 bit floating point numbers are needed, but where some dynamic precision is required. Two common uses are for image transformation, where the range of each component (e.g. red, green, blue, alpha) is typically limited to or near [0.0,1.0] or vertex data (e.g. position, texture coordinates, color values, etc.).

The main advantage of half-precision floats is their size. Beyond the considerable potential for memory savings, processing a large number of half-precision values is more cache-friendly than using 32 bit values.

The current released version (including tests) can be downloaded here: half.tgz

half_to_float() Convert Half To Float (Scalar Version)
half_from_float() Convert Float to Half (Scalar Version)
half_add() Half Add (Scalar Version)
half_sub() Half Subtract (Scalar Version)
half_mul() Half Multiply (Scalar Version)

There is sometimes confusion on the best way to pluralize a half-precision floating-point number (half) in English. I asked Steven Pinker, a renowned expert in the English language and our mis-use of it, what he thought. Here is his reply:

"Dear Mike,

Well, in my line of work I should be asking what sounds right to people and then trying to explain that as a datum, rather than telling people what is right. But if it was me, I would probably say "halfs," not "halves," for the same reason as the team in Toronto is called the Maple Leafs, and as explained in the chapter "Of Mice and Men" in my book Words and Rules. Basically, a half-precision floating point datum is not a "half" of anything, even metaphorically speaking, so the irregular form doesn't get inherited by the neologism. See the chapter for more details.

Best,
Steve Pinker"

Note that all integer operations all encapulated in macros (implemented as static inline functions). This process both helps to enforce the rule of branch-free coding and will make writing the SIMD version much easier.

half.c

  0// Branch-free implementation of half-precision (16 bit) floating point
  1// Copyright 2006 Mike Acton 
  2// 
  3// Permission is hereby granted, free of charge, to any person obtaining a 
  4// copy of this software and associated documentation files (the "Software"),
  5// to deal in the Software without restriction, including without limitation
  6// the rights to use, copy, modify, merge, publish, distribute, sublicense, 
  7// and/or sell copies of the Software, and to permit persons to whom the 
  8// Software is furnished to do so, subject to the following conditions:
  9// 
 10// The above copyright notice and this permission notice shall be included 
 11// in all copies or substantial portions of the Software.
 12// 
 13// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 14// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
 15// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 16// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
 17// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 18// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 19// THE SOFTWARE
 20//
 21// Half-precision floating point format
 22// ------------------------------------
 23//
 24//   | Field    | Last | First | Note
 25//   |----------|------|-------|----------
 26//   | Sign     | 15   | 15    |
 27//   | Exponent | 14   | 10    | Bias = 15
 28//   | Mantissa | 9    | 0     |
 29//
 30// Compiling
 31// ---------
 32//
 33//  Preferred compile flags for GCC: 
 34//     -O3 -fstrict-aliasing -std=c99 -pedantic -Wall -Wstrict-aliasing
 35//
 36//     This file is a C99 source file, intended to be compiled with a C99 
 37//     compliant compiler. However, for the moment it remains combatible
 38//     with C++98. Therefore if you are using a compiler that poorly implements
 39//     C standards (e.g. MSVC), it may be compiled as C++. This is not
 40//     guaranteed for future versions. 
 41//
 42
 43#include "half.h"
 44
 45// Load immediate
 46static inline uint32_t _uint32_li( uint32_t a )
 47{
 48  return (a);
 49}
 50
 51// Decrement
 52static inline uint32_t _uint32_dec( uint32_t a )
 53{
 54  return (a - 1);
 55}
 56
 57// Increment
 58static inline uint32_t _uint32_inc( uint32_t a )
 59{
 60  return (a + 1);
 61}
 62
 63// Complement
 64static inline uint32_t _uint32_not( uint32_t a )
 65{
 66  return (~a);
 67}
 68
 69// Negate
 70static inline uint32_t _uint32_neg( uint32_t a )
 71{
 72  return (-a);
 73}
 74
 75// Extend sign
 76static inline uint32_t _uint32_ext( uint32_t a )
 77{
 78  return (((int32_t)a)>>31);
 79}
 80
 81// And
 82static inline uint32_t _uint32_and( uint32_t a, uint32_t b )
 83{
 84  return (a & b);
 85}
 86
 87// Exclusive Or
 88static inline uint32_t _uint32_xor( uint32_t a, uint32_t b )
 89{
 90  return (a ^ b);
 91}
 92
 93// And with Complement
 94static inline uint32_t _uint32_andc( uint32_t a, uint32_t b )
 95{
 96  return (a & ~b);
 97}
 98
 99// Or
100static inline uint32_t _uint32_or( uint32_t a, uint32_t b )
101{
102  return (a | b);
103}
104
105// Shift Right Logical
106static inline uint32_t _uint32_srl( uint32_t a, int sa )
107{
108  return (a >> sa);
109}
110
111// Shift Left Logical
112static inline uint32_t _uint32_sll( uint32_t a, int sa )
113{
114  return (a << sa);
115}
116
117// Add
118static inline uint32_t _uint32_add( uint32_t a, uint32_t b )
119{
120  return (a + b);
121}
122
123// Subtract
124static inline uint32_t _uint32_sub( uint32_t a, uint32_t b )
125{
126  return (a - b);
127}
128
129// Multiply
130static inline uint32_t _uint32_mul( uint32_t a, uint32_t b )
131{
132  return (a * b);
133}
134
135// Select on Sign bit
136static inline uint32_t _uint32_sels( uint32_t test, uint32_t a, uint32_t b )
137{
138  const uint32_t mask   = _uint32_ext( test );
139  const uint32_t sel_a  = _uint32_and(  a,     mask  );
140  const uint32_t sel_b  = _uint32_andc( b,     mask  );
141  const uint32_t result = _uint32_or(   sel_a, sel_b );
142
143  return (result);
144}
145
146// Select Bits on mask
147static inline uint32_t _uint32_selb( uint32_t mask, uint32_t a, uint32_t b )
148{
149  const uint32_t sel_a  = _uint32_and(  a,     mask  );
150  const uint32_t sel_b  = _uint32_andc( b,     mask  );
151  const uint32_t result = _uint32_or(   sel_a, sel_b );
152
153  return (result);
154}
155
156// Load Immediate
157static inline uint16_t _uint16_li( uint16_t a )
158{
159  return (a);
160}
161
162// Extend sign
163static inline uint16_t _uint16_ext( uint16_t a )
164{
165  return (((int16_t)a)>>15);
166}
167
168// Negate
169static inline uint16_t _uint16_neg( uint16_t a )
170{
171  return (-a);
172}
173
174// Complement
175static inline uint16_t _uint16_not( uint16_t a )
176{
177  return (~a);
178}
179
180// Decrement
181static inline uint16_t _uint16_dec( uint16_t a )
182{
183  return (a - 1);
184}
185
186// Shift Left Logical
187static inline uint16_t _uint16_sll( uint16_t a, int sa )
188{
189  return (a << sa);
190}
191
192// Shift Right Logical
193static inline uint16_t _uint16_srl( uint16_t a, int sa )
194{
195  return (a >> sa);
196}
197
198// Add
199static inline uint16_t _uint16_add( uint16_t a, uint16_t b )
200{
201  return (a + b);
202}
203
204// Subtract
205static inline uint16_t _uint16_sub( uint16_t a, uint16_t b )
206{
207  return (a - b);
208}
209
210// And
211static inline uint16_t _uint16_and( uint16_t a, uint16_t b )
212{
213  return (a & b);
214}
215
216// Or
217static inline uint16_t _uint16_or( uint16_t a, uint16_t b )
218{
219  return (a | b);
220}
221
222// Exclusive Or
223static inline uint16_t _uint16_xor( uint16_t a, uint16_t b )
224{
225  return (a ^ b);
226}
227
228// And with Complement
229static inline uint16_t _uint16_andc( uint16_t a, uint16_t b )
230{
231  return (a & ~b);
232}
233
234// And then Shift Right Logical
235static inline uint16_t _uint16_andsrl( uint16_t a, uint16_t b, int sa )
236{
237  return ((a & b) >> sa);
238}
239
240// Shift Right Logical then Mask
241static inline uint16_t _uint16_srlm( uint16_t a, int sa, uint16_t mask )
242{
243  return ((a >> sa) & mask);
244}
245
246// Add then Mask
247static inline uint16_t _uint16_addm( uint16_t a, uint16_t b, uint16_t mask )
248{
249  return ((a + b) & mask);
250}
251
252
253// Select on Sign bit
254static inline uint16_t _uint16_sels( uint16_t test, uint16_t a, uint16_t b )
255{
256  const uint16_t mask   = _uint16_ext( test );
257  const uint16_t sel_a  = _uint16_and(  a,     mask  );
258  const uint16_t sel_b  = _uint16_andc( b,     mask  );
259  const uint16_t result = _uint16_or(   sel_a, sel_b );
260
261  return (result);
262}
263
264// Count Leading Zeros
265static inline uint32_t _uint32_cntlz( uint32_t x )
266{
267#ifdef __GNUC__
268  /* NOTE: __builtin_clz is undefined for x == 0 */
269  /* On PowerPC, this will map to insn: cntlzw   */
270  /* On Pentium, this will map to insn: clz      */
271  uint32_t is_x_nez_msb = _uint32_neg( x );
272  uint32_t nlz          = __builtin_clz( x );
273  uint32_t result       = _uint32_sels( is_x_nez_msb, nlz, 0x00000020 );
274  return (result);
275#else
276  const uint32_t x0  = _uint32_srl(  x,  1 );
277  const uint32_t x1  = _uint32_or(   x,  x0 );
278  const uint32_t x2  = _uint32_srl(  x1, 2 );
279  const uint32_t x3  = _uint32_or(   x1, x2 );
280  const uint32_t x4  = _uint32_srl(  x3, 4 );
281  const uint32_t x5  = _uint32_or(   x3, x4 );
282  const uint32_t x6  = _uint32_srl(  x5, 8 );
283  const uint32_t x7  = _uint32_or(   x5, x6 );
284  const uint32_t x8  = _uint32_srl(  x7, 16 );
285  const uint32_t x9  = _uint32_or(   x7, x8 );
286  const uint32_t xA  = _uint32_not(  x9 );
287  const uint32_t xB  = _uint32_srl(  xA, 1 );
288  const uint32_t xC  = _uint32_and(  xB, 0x55555555 );
289  const uint32_t xD  = _uint32_sub(  xA, xC );
290  const uint32_t xE  = _uint32_and(  xD, 0x33333333 );
291  const uint32_t xF  = _uint32_srl(  xD, 2 );
292  const uint32_t x10 = _uint32_and(  xF, 0x33333333 );
293  const uint32_t x11 = _uint32_add(  xE, x10 );
294  const uint32_t x12 = _uint32_srl(  x11, 4 );
295  const uint32_t x13 = _uint32_add(  x11, x12 );
296  const uint32_t x14 = _uint32_and(  x13, 0x0f0f0f0f );
297  const uint32_t x15 = _uint32_srl(  x14, 8 );
298  const uint32_t x16 = _uint32_add(  x14, x15 );
299  const uint32_t x17 = _uint32_srl(  x16, 16 );
300  const uint32_t x18 = _uint32_add(  x16, x17 );
301  const uint32_t x19 = _uint32_and(  x18, 0x0000003f );
302  return ( x19 );
303#endif
304}
305
306// Count Leading Zeros
307static inline uint16_t _uint16_cntlz( uint16_t x )
308{
309#ifdef __GNUC__
310  uint16_t nlz32 = (uint16_t)_uint32_cntlz( (uint32_t)x );
311  uint32_t nlz   = _uint32_sub( nlz32, 16 );
312  return (nlz);
313#else
314  const uint16_t x0  = _uint16_srl(  x,  1 );
315  const uint16_t x1  = _uint16_or(   x,  x0 );
316  const uint16_t x2  = _uint16_srl(  x1, 2 );
317  const uint16_t x3  = _uint16_or(   x1, x2 );
318  const uint16_t x4  = _uint16_srl(  x3, 4 );
319  const uint16_t x5  = _uint16_or(   x3, x4 );
320  const uint16_t x6  = _uint16_srl(  x5, 8 );
321  const uint16_t x7  = _uint16_or(   x5, x6 );
322  const uint16_t x8  = _uint16_not(  x7 );
323  const uint16_t x9  = _uint16_srlm( x8, 1, 0x5555 );
324  const uint16_t xA  = _uint16_sub(  x8, x9 );
325  const uint16_t xB  = _uint16_and(  xA, 0x3333 );
326  const uint16_t xC  = _uint16_srlm( xA, 2, 0x3333 );
327  const uint16_t xD  = _uint16_add(  xB, xC );
328  const uint16_t xE  = _uint16_srl(  xD, 4 );
329  const uint16_t xF  = _uint16_addm( xD, xE, 0x0f0f );
330  const uint16_t x10 = _uint16_srl(  xF, 8 );
331  const uint16_t x11 = _uint16_addm( xF, x10, 0x001f );
332  return ( x11 );
333#endif
334}
335

336uint16_t
337half_from_float( uint32_t f )
338{
339  const uint32_t one                        = _uint32_li( 0x00000001 );
340  const uint32_t f_s_mask                   = _uint32_li( 0x80000000 );
341  const uint32_t f_e_mask                   = _uint32_li( 0x7f800000 );
342  const uint32_t f_m_mask                   = _uint32_li( 0x007fffff );
343  const uint32_t f_m_hidden_bit             = _uint32_li( 0x00800000 );
344  const uint32_t f_m_round_bit              = _uint32_li( 0x00001000 );
345  const uint32_t f_snan_mask                = _uint32_li( 0x7fc00000 );
346  const uint32_t f_e_pos                    = _uint32_li( 0x00000017 );
347  const uint32_t h_e_pos                    = _uint32_li( 0x0000000a );
348  const uint32_t h_e_mask                   = _uint32_li( 0x00007c00 );
349  const uint32_t h_snan_mask                = _uint32_li( 0x00007e00 );
350  const uint32_t h_e_mask_value             = _uint32_li( 0x0000001f );
351  const uint32_t f_h_s_pos_offset           = _uint32_li( 0x00000010 );
352  const uint32_t f_h_bias_offset            = _uint32_li( 0x00000070 );
353  const uint32_t f_h_m_pos_offset           = _uint32_li( 0x0000000d );
354  const uint32_t h_nan_min                  = _uint32_li( 0x00007c01 );
355  const uint32_t f_h_e_biased_flag          = _uint32_li( 0x0000008f );
356  const uint32_t f_s                        = _uint32_and( f,               f_s_mask         );
357  const uint32_t f_e                        = _uint32_and( f,               f_e_mask         );
358  const uint16_t h_s                        = _uint32_srl( f_s,             f_h_s_pos_offset );
359  const uint32_t f_m                        = _uint32_and( f,               f_m_mask         );
360  const uint16_t f_e_amount                 = _uint32_srl( f_e,             f_e_pos          );
361  const uint32_t f_e_half_bias              = _uint32_sub( f_e_amount,      f_h_bias_offset  );
362  const uint32_t f_snan                     = _uint32_and( f,               f_snan_mask      );
363  const uint32_t f_m_round_mask             = _uint32_and( f_m,             f_m_round_bit    );
364  const uint32_t f_m_round_offset           = _uint32_sll( f_m_round_mask,  one              );
365  const uint32_t f_m_rounded                = _uint32_add( f_m,             f_m_round_offset );
366  const uint32_t f_m_denorm_sa              = _uint32_sub( one,             f_e_half_bias    );
367  const uint32_t f_m_with_hidden            = _uint32_or(  f_m_rounded,     f_m_hidden_bit   );
368  const uint32_t f_m_denorm                 = _uint32_srl( f_m_with_hidden, f_m_denorm_sa    );
369  const uint32_t h_m_denorm                 = _uint32_srl( f_m_denorm,      f_h_m_pos_offset );
370  const uint32_t f_m_rounded_overflow       = _uint32_and( f_m_rounded,     f_m_hidden_bit   );
371  const uint32_t m_nan                      = _uint32_srl( f_m,             f_h_m_pos_offset );
372  const uint32_t h_em_nan                   = _uint32_or(  h_e_mask,        m_nan            );
373  const uint32_t h_e_norm_overflow_offset   = _uint32_inc( f_e_half_bias );
374  const uint32_t h_e_norm_overflow          = _uint32_sll( h_e_norm_overflow_offset, h_e_pos          );
375  const uint32_t h_e_norm                   = _uint32_sll( f_e_half_bias,            h_e_pos          );
376  const uint32_t h_m_norm                   = _uint32_srl( f_m_rounded,              f_h_m_pos_offset );
377  const uint32_t h_em_norm                  = _uint32_or(  h_e_norm,                 h_m_norm         );
378  const uint32_t is_h_ndenorm_msb           = _uint32_sub( f_h_bias_offset,   f_e_amount    );
379  const uint32_t is_f_e_flagged_msb         = _uint32_sub( f_h_e_biased_flag, f_e_half_bias );
380  const uint32_t is_h_denorm_msb            = _uint32_not( is_h_ndenorm_msb );
381  const uint32_t is_f_m_eqz_msb             = _uint32_dec( f_m   );
382  const uint32_t is_h_nan_eqz_msb           = _uint32_dec( m_nan );
383  const uint32_t is_f_inf_msb               = _uint32_and( is_f_e_flagged_msb, is_f_m_eqz_msb   );
384  const uint32_t is_f_nan_underflow_msb     = _uint32_and( is_f_e_flagged_msb, is_h_nan_eqz_msb );
385  const uint32_t is_e_overflow_msb          = _uint32_sub( h_e_mask_value,     f_e_half_bias    );
386  const uint32_t is_h_inf_msb               = _uint32_or(  is_e_overflow_msb,  is_f_inf_msb     );
387  const uint32_t is_f_nsnan_msb             = _uint32_sub( f_snan,             f_snan_mask      );
388  const uint32_t is_m_norm_overflow_msb     = _uint32_neg( f_m_rounded_overflow );
389  const uint32_t is_f_snan_msb              = _uint32_not( is_f_nsnan_msb );
390  const uint32_t h_em_overflow_result       = _uint32_sels( is_m_norm_overflow_msb, h_e_norm_overflow, h_em_norm                 );
391  const uint32_t h_em_nan_result            = _uint32_sels( is_f_e_flagged_msb,     h_em_nan,          h_em_overflow_result      );
392  const uint32_t h_em_nan_underflow_result  = _uint32_sels( is_f_nan_underflow_msb, h_nan_min,         h_em_nan_result           );
393  const uint32_t h_em_inf_result            = _uint32_sels( is_h_inf_msb,           h_e_mask,          h_em_nan_underflow_result );
394  const uint32_t h_em_denorm_result         = _uint32_sels( is_h_denorm_msb,        h_m_denorm,        h_em_inf_result           );
395  const uint32_t h_em_snan_result           = _uint32_sels( is_f_snan_msb,          h_snan_mask,       h_em_denorm_result        );
396  const uint32_t h_result                   = _uint32_or( h_s, h_em_snan_result );
397
398  return (uint16_t)(h_result);
399}
400

401uint32_t 
402half_to_float( uint16_t h )
403{
404  const uint32_t h_e_mask              = _uint32_li( 0x00007c00 );
405  const uint32_t h_m_mask              = _uint32_li( 0x000003ff );
406  const uint32_t h_s_mask              = _uint32_li( 0x00008000 );
407  const uint32_t h_f_s_pos_offset      = _uint32_li( 0x00000010 );
408  const uint32_t h_f_e_pos_offset      = _uint32_li( 0x0000000d );
409  const uint32_t h_f_bias_offset       = _uint32_li( 0x0001c000 );
410  const uint32_t f_e_mask              = _uint32_li( 0x7f800000 );
411  const uint32_t f_m_mask              = _uint32_li( 0x007fffff );
412  const uint32_t h_f_e_denorm_bias     = _uint32_li( 0x0000007e );
413  const uint32_t h_f_m_denorm_sa_bias  = _uint32_li( 0x00000008 );
414  const uint32_t f_e_pos               = _uint32_li( 0x00000017 );
415  const uint32_t h_e_mask_minus_one    = _uint32_li( 0x00007bff );
416  const uint32_t h_e                   = _uint32_and( h, h_e_mask );
417  const uint32_t h_m                   = _uint32_and( h, h_m_mask );
418  const uint32_t h_s                   = _uint32_and( h, h_s_mask );
419  const uint32_t h_e_f_bias            = _uint32_add( h_e, h_f_bias_offset );
420  const uint32_t h_m_nlz               = _uint32_cntlz( h_m );
421  const uint32_t f_s                   = _uint32_sll( h_s,        h_f_s_pos_offset );
422  const uint32_t f_e                   = _uint32_sll( h_e_f_bias, h_f_e_pos_offset );
423  const uint32_t f_m                   = _uint32_sll( h_m,        h_f_e_pos_offset );
424  const uint32_t f_em                  = _uint32_or(  f_e,        f_m              );
425  const uint32_t h_f_m_sa              = _uint32_sub( h_m_nlz,             h_f_m_denorm_sa_bias );
426  const uint32_t f_e_denorm_unpacked   = _uint32_sub( h_f_e_denorm_bias,   h_f_m_sa             );
427  const uint32_t h_f_m                 = _uint32_sll( h_m,                 h_f_m_sa             );
428  const uint32_t f_m_denorm            = _uint32_and( h_f_m,               f_m_mask             );
429  const uint32_t f_e_denorm            = _uint32_sll( f_e_denorm_unpacked, f_e_pos              );
430  const uint32_t f_em_denorm           = _uint32_or(  f_e_denorm,          f_m_denorm           );
431  const uint32_t f_em_nan              = _uint32_or(  f_e_mask,            f_m                  );
432  const uint32_t is_e_eqz_msb          = _uint32_dec(  h_e );
433  const uint32_t is_m_nez_msb          = _uint32_neg(  h_m );
434  const uint32_t is_e_flagged_msb      = _uint32_sub(  h_e_mask_minus_one, h_e );
435  const uint32_t is_zero_msb           = _uint32_andc( is_e_eqz_msb,       is_m_nez_msb );
436  const uint32_t is_inf_msb            = _uint32_andc( is_e_flagged_msb,   is_m_nez_msb );
437  const uint32_t is_denorm_msb         = _uint32_and(  is_m_nez_msb,       is_e_eqz_msb );
438  const uint32_t is_nan_msb            = _uint32_and(  is_e_flagged_msb,   is_m_nez_msb ); 
439  const uint32_t is_zero               = _uint32_ext(  is_zero_msb );
440  const uint32_t f_zero_result         = _uint32_andc( f_em, is_zero );
441  const uint32_t f_denorm_result       = _uint32_sels( is_denorm_msb, f_em_denorm, f_zero_result );
442  const uint32_t f_inf_result          = _uint32_sels( is_inf_msb,    f_e_mask,    f_denorm_result );
443  const uint32_t f_nan_result          = _uint32_sels( is_nan_msb,    f_em_nan,    f_inf_result    );
444  const uint32_t f_result              = _uint32_or( f_s, f_nan_result );
445 
446  return (f_result);
447}
448

449// half_add
450// --------
451//
452//  (SUM)        uint16_t z = half_add( x, y );
453//  (DIFFERENCE) uint16_t z = half_add( x, -y );
454//
455//  * Difference of ZEROs is always +ZERO
456//  * Sum round with guard + round + sticky bit (grs)
457//  * QNaN + [x]  = QNaN
458//  * [x]  + +INF = +INF
459//  * [x]  - -INF = -INF
460//  * INF  - INF  = SNaN
461//
462//  Will have exactly (0 ulps difference) the same result as:
463//  (Round up)
464//
465//     union FLOAT_32
466//     {
467//       float    f32;
468//       uint32_t u32;
469//     };
470//
471//     union FLOAT_32 fx = { .u32 = half_to_float( x ) };
472//     union FLOAT_32 fy = { .u32 = half_to_float( y ) };
473//     union FLOAT_32 fz = { .f32 = fx.f32 + fy.f32    };
474//     uint16_t       z  = float_to_half( fz );
475//
476uint16_t
477half_add( uint16_t x, uint16_t y )
478{
479  const uint16_t one                       = _uint16_li( 0x0001 );
480  const uint16_t msb_to_lsb_sa             = _uint16_li( 0x000f );
481  const uint16_t h_s_mask                  = _uint16_li( 0x8000 );
482  const uint16_t h_e_mask                  = _uint16_li( 0x7c00 );
483  const uint16_t h_m_mask                  = _uint16_li( 0x03ff );
484  const uint16_t h_m_msb_mask              = _uint16_li( 0x2000 );
485  const uint16_t h_m_msb_sa                = _uint16_li( 0x000d );
486  const uint16_t h_m_hidden                = _uint16_li( 0x0400 );
487  const uint16_t h_e_pos                   = _uint16_li( 0x000a );
488  const uint16_t h_e_bias_minus_one        = _uint16_li( 0x000e );
489  const uint16_t h_m_grs_carry             = _uint16_li( 0x4000 );
490  const uint16_t h_m_grs_carry_pos         = _uint16_li( 0x000e );
491  const uint16_t h_grs_size                = _uint16_li( 0x0003 );
492  const uint16_t h_snan                    = _uint16_li( 0xfe00 );
493  const uint16_t h_e_mask_minus_one        = _uint16_li( 0x7bff );
494  const uint16_t h_grs_round_carry         = _uint16_sll( one, h_grs_size );
495  const uint16_t h_grs_round_mask          = _uint16_sub( h_grs_round_carry, one );
496  const uint16_t x_e                       = _uint16_and( x, h_e_mask );
497  const uint16_t y_e                       = _uint16_and( y, h_e_mask );
498  const uint16_t is_y_e_larger_msb         = _uint16_sub( x_e, y_e );
499  const uint16_t a                         = _uint16_sels( is_y_e_larger_msb, y, x);
500  const uint16_t a_s                       = _uint16_and( a, h_s_mask );
501  const uint16_t a_e                       = _uint16_and( a, h_e_mask );
502  const uint16_t a_m_no_hidden_bit         = _uint16_and( a, h_m_mask );
503  const uint16_t a_em_no_hidden_bit        = _uint16_or( a_e, a_m_no_hidden_bit );
504  const uint16_t b                         = _uint16_sels( is_y_e_larger_msb, x, y);
505  const uint16_t b_s                       = _uint16_and( b, h_s_mask );
506  const uint16_t b_e                       = _uint16_and( b, h_e_mask );
507  const uint16_t b_m_no_hidden_bit         = _uint16_and( b, h_m_mask );
508  const uint16_t b_em_no_hidden_bit        = _uint16_or( b_e, b_m_no_hidden_bit );
509  const uint16_t is_diff_sign_msb          = _uint16_xor( a_s, b_s );
510  const uint16_t is_a_inf_msb              = _uint16_sub( h_e_mask_minus_one, a_em_no_hidden_bit );
511  const uint16_t is_b_inf_msb              = _uint16_sub( h_e_mask_minus_one, b_em_no_hidden_bit );
512  const uint16_t is_undenorm_msb           = _uint16_dec( a_e );
513  const uint16_t is_undenorm               = _uint16_ext( is_undenorm_msb );
514  const uint16_t is_both_inf_msb           = _uint16_and( is_a_inf_msb, is_b_inf_msb );
515  const uint16_t is_invalid_inf_op_msb     = _uint16_and( is_both_inf_msb, b_s );
516  const uint16_t is_a_e_nez_msb            = _uint16_neg( a_e );
517  const uint16_t is_b_e_nez_msb            = _uint16_neg( b_e );
518  const uint16_t is_a_e_nez                = _uint16_ext( is_a_e_nez_msb );
519  const uint16_t is_b_e_nez                = _uint16_ext( is_b_e_nez_msb );
520  const uint16_t a_m_hidden_bit            = _uint16_and( is_a_e_nez, h_m_hidden );
521  const uint16_t b_m_hidden_bit            = _uint16_and( is_b_e_nez, h_m_hidden );
522  const uint16_t a_m_no_grs                = _uint16_or( a_m_no_hidden_bit, a_m_hidden_bit );
523  const uint16_t b_m_no_grs                = _uint16_or( b_m_no_hidden_bit, b_m_hidden_bit );
524  const uint16_t diff_e                    = _uint16_sub( a_e,        b_e );
525  const uint16_t a_e_unbias                = _uint16_sub( a_e,        h_e_bias_minus_one );
526  const uint16_t a_m                       = _uint16_sll( a_m_no_grs, h_grs_size );
527  const uint16_t a_e_biased                = _uint16_srl( a_e,        h_e_pos );
528  const uint16_t m_sa_unbias               = _uint16_srl( a_e_unbias, h_e_pos );
529  const uint16_t m_sa_default              = _uint16_srl( diff_e,     h_e_pos );
530  const uint16_t m_sa_unbias_mask          = _uint16_andc( is_a_e_nez_msb,   is_b_e_nez_msb );
531  const uint16_t m_sa                      = _uint16_sels( m_sa_unbias_mask, m_sa_unbias, m_sa_default );
532  const uint16_t b_m_no_sticky             = _uint16_sll( b_m_no_grs,        h_grs_size );
533  const uint16_t sh_m                      = _uint16_srl( b_m_no_sticky,     m_sa );
534  const uint16_t sticky_overflow           = _uint16_sll( one,               m_sa );
535  const uint16_t sticky_mask               = _uint16_dec( sticky_overflow );
536  const uint16_t sticky_collect            = _uint16_and( b_m_no_sticky, sticky_mask );
537  const uint16_t is_sticky_set_msb         = _uint16_neg( sticky_collect );
538  const uint16_t sticky                    = _uint16_srl( is_sticky_set_msb, msb_to_lsb_sa);
539  const uint16_t b_m                       = _uint16_or( sh_m, sticky );
540  const uint16_t is_c_m_ab_pos_msb         = _uint16_sub( b_m, a_m );
541  const uint16_t c_inf                     = _uint16_or( a_s, h_e_mask );
542  const uint16_t c_m_sum                   = _uint16_add( a_m, b_m );
543  const uint16_t c_m_diff_ab               = _uint16_sub( a_m, b_m );
544  const uint16_t c_m_diff_ba               = _uint16_sub( b_m, a_m );
545  const uint16_t c_m_smag_diff             = _uint16_sels( is_c_m_ab_pos_msb, c_m_diff_ab, c_m_diff_ba );
546  const uint16_t c_s_diff                  = _uint16_sels( is_c_m_ab_pos_msb, a_s,         b_s         );
547  const uint16_t c_s                       = _uint16_sels( is_diff_sign_msb,  c_s_diff,    a_s         );
548  const uint16_t c_m_smag_diff_nlz         = _uint16_cntlz( c_m_smag_diff );
549  const uint16_t diff_norm_sa              = _uint16_sub( c_m_smag_diff_nlz, one );
550  const uint16_t is_diff_denorm_msb        = _uint16_sub( a_e_biased, diff_norm_sa );
551  const uint16_t is_diff_denorm            = _uint16_ext( is_diff_denorm_msb );
552  const uint16_t is_a_or_b_norm_msb        = _uint16_neg( a_e_biased );
553  const uint16_t diff_denorm_sa            = _uint16_dec( a_e_biased );
554  const uint16_t c_m_diff_denorm           = _uint16_sll( c_m_smag_diff, diff_denorm_sa );
555  const uint16_t c_m_diff_norm             = _uint16_sll( c_m_smag_diff, diff_norm_sa );
556  const uint16_t c_e_diff_norm             = _uint16_sub( a_e_biased,  diff_norm_sa );
557  const uint16_t c_m_diff_ab_norm          = _uint16_sels( is_diff_denorm_msb, c_m_diff_denorm, c_m_diff_norm );
558  const uint16_t c_e_diff_ab_norm          = _uint16_andc( c_e_diff_norm, is_diff_denorm );
559  const uint16_t c_m_diff                  = _uint16_sels( is_a_or_b_norm_msb, c_m_diff_ab_norm, c_m_smag_diff );
560  const uint16_t c_e_diff                  = _uint16_sels( is_a_or_b_norm_msb, c_e_diff_ab_norm, a_e_biased    );
561  const uint16_t is_diff_eqz_msb           = _uint16_dec( c_m_diff );
562  const uint16_t is_diff_exactly_zero_msb  = _uint16_and( is_diff_sign_msb, is_diff_eqz_msb );
563  const uint16_t is_diff_exactly_zero      = _uint16_ext( is_diff_exactly_zero_msb );
564  const uint16_t c_m_added                 = _uint16_sels( is_diff_sign_msb, c_m_diff, c_m_sum );
565  const uint16_t c_e_added                 = _uint16_sels( is_diff_sign_msb, c_e_diff, a_e_biased );
566  const uint16_t c_m_carry                 = _uint16_and( c_m_added, h_m_grs_carry );
567  const uint16_t is_c_m_carry_msb          = _uint16_neg( c_m_carry );
568  const uint16_t c_e_hidden_offset         = _uint16_andsrl( c_m_added, h_m_grs_carry, h_m_grs_carry_pos );
569  const uint16_t c_m_sub_hidden            = _uint16_srl( c_m_added, one );
570  const uint16_t c_m_no_hidden             = _uint16_sels( is_c_m_carry_msb, c_m_sub_hidden, c_m_added );
571  const uint16_t c_e_no_hidden             = _uint16_add( c_e_added,         c_e_hidden_offset  );
572  const uint16_t c_m_no_hidden_msb         = _uint16_and( c_m_no_hidden,     h_m_msb_mask       );
573  const uint16_t undenorm_m_msb_odd        = _uint16_srl( c_m_no_hidden_msb, h_m_msb_sa         );
574  const uint16_t undenorm_fix_e            = _uint16_and( is_undenorm,       undenorm_m_msb_odd );
575  const uint16_t c_e_fixed                 = _uint16_add( c_e_no_hidden,     undenorm_fix_e     );
576  const uint16_t c_m_round_amount          = _uint16_and( c_m_no_hidden,     h_grs_round_mask   );
577  const uint16_t c_m_rounded               = _uint16_add( c_m_no_hidden,     c_m_round_amount   );
578  const uint16_t c_m_round_overflow        = _uint16_andsrl( c_m_rounded, h_m_grs_carry, h_m_grs_carry_pos );
579  const uint16_t c_e_rounded               = _uint16_add( c_e_fixed, c_m_round_overflow );
580  const uint16_t c_m_no_grs                = _uint16_srlm( c_m_rounded, h_grs_size,  h_m_mask );
581  const uint16_t c_e                       = _uint16_sll( c_e_rounded, h_e_pos );
582  const uint16_t c_em                      = _uint16_or( c_e, c_m_no_grs );
583  const uint16_t c_normal                  = _uint16_or( c_s, c_em );
584  const uint16_t c_inf_result              = _uint16_sels( is_a_inf_msb, c_inf, c_normal );
585  const uint16_t c_zero_result             = _uint16_andc( c_inf_result, is_diff_exactly_zero );
586  const uint16_t c_result                  = _uint16_sels( is_invalid_inf_op_msb, h_snan, c_zero_result );
587
588  return (c_result);
589}
590

591// half_mul
592// --------
593//
594//  May have 0 or 1 ulp difference from the following result:
595//  (Round to nearest) 
596//  NOTE: Rounding mode differs between conversion and multiply
597//
598//     union FLOAT_32
599//     {
600//       float    f32;
601//       uint32_t u32;
602//     };
603//
604//     union FLOAT_32 fx = { .u32 = half_to_float( x ) };
605//     union FLOAT_32 fy = { .u32 = half_to_float( y ) };
606//     union FLOAT_32 fz = { .f32 = fx.f32 * fy.f32    };
607//     uint16_t       z  = float_to_half( fz );
608//
609uint16_t
610half_mul( uint16_t x, uint16_t y )
611{
612  const uint32_t one                                = _uint32_li( 0x00000001 );
613  const uint32_t h_s_mask                           = _uint32_li( 0x00008000 );
614  const uint32_t h_e_mask                           = _uint32_li( 0x00007c00 );
615  const uint32_t h_m_mask                           = _uint32_li( 0x000003ff );
616  const uint32_t h_m_hidden                         = _uint32_li( 0x00000400 );
617  const uint32_t h_e_pos                            = _uint32_li( 0x0000000a );
618  const uint32_t h_e_bias                           = _uint32_li( 0x0000000f );
619  const uint32_t h_m_bit_count                      = _uint32_li( 0x0000000a );
620  const uint32_t h_m_bit_half_count                 = _uint32_li( 0x00000005 );
621  const uint32_t h_nan_min                          = _uint32_li( 0x00007c01 );
622  const uint32_t h_e_mask_minus_one                 = _uint32_li( 0x00007bff );
623  const uint32_t h_snan                             = _uint32_li( 0x0000fe00 );
624  const uint32_t m_round_overflow_bit               = _uint32_li( 0x00000020 );
625  const uint32_t m_hidden_bit                       = _uint32_li( 0x00100000 );
626  const uint32_t a_s                                = _uint32_and(  x,   h_s_mask );
627  const uint32_t b_s                                = _uint32_and(  y,   h_s_mask );
628  const uint32_t c_s                                = _uint32_xor(  a_s, b_s      );
629  const uint32_t x_e                                = _uint32_and(  x,   h_e_mask );
630  const uint32_t x_e_eqz_msb                        = _uint32_dec(  x_e );
631  const uint32_t a                                  = _uint32_sels( x_e_eqz_msb, y, x );
632  const uint32_t b                                  = _uint32_sels( x_e_eqz_msb, x, y );
633  const uint32_t a_e                                = _uint32_and(  a,   h_e_mask );
634  const uint32_t b_e                                = _uint32_and(  b,   h_e_mask );
635  const uint32_t a_m                                = _uint32_and(  a,   h_m_mask );
636  const uint32_t b_m                                = _uint32_and(  b,   h_m_mask );
637  const uint32_t a_e_amount                         = _uint32_srl(  a_e,                 h_e_pos                 );
638  const uint32_t b_e_amount                         = _uint32_srl(  b_e,                 h_e_pos                 );
639  const uint32_t a_m_with_hidden                    = _uint32_or(   a_m,                 h_m_hidden              );
640  const uint32_t b_m_with_hidden                    = _uint32_or(   b_m,                 h_m_hidden              );
641  const uint32_t c_m_normal                         = _uint32_mul(  a_m_with_hidden,     b_m_with_hidden         );
642  const uint32_t c_m_denorm_biased                  = _uint32_mul(  a_m_with_hidden,     b_m                     );
643  const uint32_t c_e_denorm_unbias_e                = _uint32_sub(  h_e_bias,            a_e_amount              );
644  const uint32_t c_m_denorm_round_amount            = _uint32_and(  c_m_denorm_biased,   h_m_mask                );
645  const uint32_t c_m_denorm_rounded                 = _uint32_add(  c_m_denorm_biased,   c_m_denorm_round_amount );
646  const uint32_t c_m_denorm_inplace                 = _uint32_srl(  c_m_denorm_rounded,  h_m_bit_count           );
647  const uint32_t c_m_denorm_unbiased                = _uint32_srl(  c_m_denorm_inplace,  c_e_denorm_unbias_e     );
648  const uint32_t c_m_denorm                         = _uint32_and(  c_m_denorm_unbiased, h_m_mask                );
649  const uint32_t c_e_amount_biased                  = _uint32_add(  a_e_amount,          b_e_amount              );
650  const uint32_t c_e_amount_unbiased                = _uint32_sub(  c_e_amount_biased,   h_e_bias                );
651  const uint32_t is_c_e_unbiased_underflow          = _uint32_ext(  c_e_amount_unbiased );
652  const uint32_t c_e_underflow_half_sa              = _uint32_neg(  c_e_amount_unbiased );
653  const uint32_t c_e_underflow_sa                   = _uint32_sll(  c_e_underflow_half_sa,     one );
654  const uint32_t c_m_underflow                      = _uint32_srl(  c_m_normal,                c_e_underflow_sa );
655  const uint32_t c_e_underflow_added                = _uint32_andc( c_e_amount_unbiased,       is_c_e_unbiased_underflow );
656  const uint32_t c_m_underflow_added                = _uint32_selb( is_c_e_unbiased_underflow, c_m_underflow, c_m_normal );
657  const uint32_t is_mul_overflow_test               = _uint32_and(  c_e_underflow_added, m_round_overflow_bit );
658  const uint32_t is_mul_overflow_msb                = _uint32_neg(  is_mul_overflow_test );
659  const uint32_t c_e_norm_radix_corrected           = _uint32_inc(  c_e_underflow_added );
660  const uint32_t c_m_norm_radix_corrected           = _uint32_srl(  c_m_underflow_added, one );
661  const uint32_t c_m_norm_hidden_bit                = _uint32_and(  c_m_norm_radix_corrected,  m_hidden_bit );
662  const uint32_t is_c_m_norm_no_hidden_msb          = _uint32_dec(  c_m_norm_hidden_bit );
663  const uint32_t c_m_norm_lo                        = _uint32_srl(  c_m_norm_radix_corrected, h_m_bit_half_count );
664  const uint32_t c_m_norm_lo_nlz                    = _uint16_cntlz( c_m_norm_lo );
665  const uint32_t is_c_m_hidden_nunderflow_msb       = _uint32_sub(  c_m_norm_lo_nlz, c_e_norm_radix_corrected );
666  const uint32_t is_c_m_hidden_underflow_msb        = _uint32_not(  is_c_m_hidden_nunderflow_msb );
667  const uint32_t is_c_m_hidden_underflow            = _uint32_ext(  is_c_m_hidden_underflow_msb  );
668  const uint32_t c_m_hidden_underflow_normalized_sa = _uint32_srl(  c_m_norm_lo_nlz, one );
669  const uint32_t c_m_hidden_underflow_normalized    = _uint32_sll(  c_m_norm_radix_corrected, c_m_hidden_underflow_normalized_sa );
670  const uint32_t c_m_hidden_normalized              = _uint32_sll(  c_m_norm_radix_corrected, c_m_norm_lo_nlz );
671  const uint32_t c_e_hidden_normalized              = _uint32_sub(  c_e_norm_radix_corrected, c_m_norm_lo_nlz );
672  const uint32_t c_e_hidden                         = _uint32_andc( c_e_hidden_normalized, is_c_m_hidden_underflow );
673  const uint32_t c_m_hidden                         = _uint32_sels( is_c_m_hidden_underflow_msb, c_m_hidden_underflow_normalized, c_m_hidden_normalized );
674  const uint32_t c_m_normalized                     = _uint32_sels( is_c_m_norm_no_hidden_msb, c_m_hidden, c_m_norm_radix_corrected );
675  const uint32_t c_e_normalized                     = _uint32_sels( is_c_m_norm_no_hidden_msb, c_e_hidden, c_e_norm_radix_corrected );
676  const uint32_t c_m_norm_round_amount              = _uint32_and(  c_m_normalized, h_m_mask );
677  const uint32_t c_m_norm_rounded                   = _uint32_add(  c_m_normalized, c_m_norm_round_amount );
678  const uint32_t is_round_overflow_test             = _uint32_and(  c_e_normalized, m_round_overflow_bit  );
679  const uint32_t is_round_overflow_msb              = _uint32_neg(  is_round_overflow_test );
680  const uint32_t c_m_norm_inplace                   = _uint32_srl(  c_m_norm_rounded,    h_m_bit_count );
681  const uint32_t c_m                                = _uint32_and(  c_m_norm_inplace,    h_m_mask      );
682  const uint32_t c_e_norm_inplace                   = _uint32_sll(  c_e_normalized, h_e_pos       );
683  const uint32_t c_e                                = _uint32_and(  c_e_norm_inplace,    h_e_mask      );
684  const uint32_t c_em_nan                           = _uint32_or(   h_e_mask,  a_m        );
685  const uint32_t c_nan                              = _uint32_or(   a_s,       c_em_nan   );
686  const uint32_t c_denorm                           = _uint32_or(   c_s,       c_m_denorm );
687  const uint32_t c_inf                              = _uint32_or(   c_s,       h_e_mask   );
688  const uint32_t c_em_norm                          = _uint32_or(   c_e,       c_m        );
689  const uint32_t is_a_e_flagged_msb                 = _uint32_sub(  h_e_mask_minus_one, a_e );
690  const uint32_t is_b_e_flagged_msb                 = _uint32_sub(  h_e_mask_minus_one, b_e );
691  const uint32_t is_a_e_eqz_msb                     = _uint32_dec(  a_e );
692  const uint32_t is_a_m_eqz_msb                     = _uint32_dec(  a_m );
693  const uint32_t is_b_e_eqz_msb                     = _uint32_dec(  b_e );
694  const uint32_t is_b_m_eqz_msb                     = _uint32_dec(  b_m );
695  const uint32_t is_b_eqz_msb                       = _uint32_and(  is_b_e_eqz_msb,          is_b_m_eqz_msb         );
696  const uint32_t is_a_eqz_msb                       = _uint32_and(  is_a_e_eqz_msb,          is_a_m_eqz_msb         );
697  const uint32_t is_c_nan_via_a_msb                 = _uint32_andc( is_a_e_flagged_msb,      is_b_e_flagged_msb     );
698  const uint32_t is_c_nan_via_b_msb                 = _uint32_andc( is_b_e_flagged_msb,      is_b_m_eqz_msb         );
699  const uint32_t is_c_nan_msb                       = _uint32_or(   is_c_nan_via_a_msb,      is_c_nan_via_b_msb     );
700  const uint32_t is_c_denorm_msb                    = _uint32_andc( is_b_e_eqz_msb,          is_a_e_flagged_msb     );
701  const uint32_t is_a_inf_msb                       = _uint32_and(  is_a_e_flagged_msb,      is_a_m_eqz_msb         );
702  const uint32_t is_c_snan_msb                      = _uint32_and(  is_a_inf_msb,            is_b_eqz_msb           );
703  const uint32_t is_c_nan_min_via_a_msb             = _uint32_and(  is_a_e_flagged_msb,      is_b_eqz_msb           );
704  const uint32_t is_c_nan_min_via_b_msb             = _uint32_and(  is_b_e_flagged_msb,      is_a_eqz_msb           );
705  const uint32_t is_c_nan_min_msb                   = _uint32_or(   is_c_nan_min_via_a_msb,  is_c_nan_min_via_b_msb );
706  const uint32_t is_c_inf_msb                       = _uint32_or(   is_a_e_flagged_msb,      is_b_e_flagged_msb     );
707  const uint32_t is_overflow_msb                    = _uint32_or(   is_round_overflow_msb,   is_mul_overflow_msb    );
708  const uint32_t c_em_overflow_result               = _uint32_sels( is_overflow_msb, h_e_mask, c_em_norm );
709  const uint32_t c_common_result                    = _uint32_or(   c_s, c_em_overflow_result );
710  const uint32_t c_zero_result                      = _uint32_sels( is_b_eqz_msb,     c_s,       c_common_result  );
711  const uint32_t c_nan_result                       = _uint32_sels( is_c_nan_msb,     c_nan,     c_zero_result );
712  const uint32_t c_nan_min_result                   = _uint32_sels( is_c_nan_min_msb, h_nan_min, c_nan_result     );
713  const uint32_t c_inf_result                       = _uint32_sels( is_c_inf_msb,     c_inf,     c_nan_min_result   );
714  const uint32_t c_denorm_result                    = _uint32_sels( is_c_denorm_msb,  c_denorm,  c_inf_result);
715  const uint32_t c_result                           = _uint32_sels( is_c_snan_msb,    h_snan,    c_denorm_result );
716
717  return (uint16_t)(c_result);
718}

half.h

  0#ifndef HALF_H
  1#define HALF_H
  2
  3#include <stdint.h>
  4
  5uint32_t half_to_float( uint16_t h );
  6uint16_t half_from_float( uint32_t f );
  7uint16_t half_add( uint16_t arg0, uint16_t arg1 );
  8uint16_t half_mul( uint16_t arg0, uint16_t arg1 );
  9

 10static inline uint16_t 
 11half_sub( uint16_t ha, uint16_t hb ) 
 12{
 13  // (a-b) is the same as (a+(-b))
 14  return half_add( ha, hb ^ 0x8000 );
 15}
 16
 17#endif /* HALF_H */

Comment on this article