Update! (19 July 06) Added Multiply. Fixed a problem with using __builtin_clz().
Update! (17 July 06) The code has been considerably refactored. Decided to go with single function per expression. The expressions have been reduced as a first optimization pass.
Project
The goal of this project is serve as an example of developing some relatively complex operations completely without branches - a software implementation of half-precision floating point numbers (That does not use floating point hardware). This example should echo the IEEE 754 standard for floating point numbers as closely as reasonable, including support for +/- INF, QNan, SNan, and denormalized numbers. However, exceptions will not be implemented.Half-precision floats are used in cases where neither the range nor the precision of 32 bit floating point numbers are needed, but where some dynamic precision is required. Two common uses are for image transformation, where the range of each component (e.g. red, green, blue, alpha) is typically limited to or near [0.0,1.0] or vertex data (e.g. position, texture coordinates, color values, etc.).
The main advantage of half-precision floats is their size. Beyond the considerable potential for memory savings, processing a large number of half-precision values is more cache-friendly than using 32 bit values.
The current released version (including tests) can be downloaded here: half.tgz
half_to_float() Convert Half To Float (Scalar Version)
half_from_float() Convert Float to Half (Scalar Version)
half_add() Half Add (Scalar Version)
half_sub() Half Subtract (Scalar Version)
half_mul() Half Multiply (Scalar Version)
There is sometimes confusion on the best way to pluralize a half-precision floating-point number (half) in English. I asked Steven Pinker, a renowned expert in the English language and our mis-use of it, what he thought. Here is his reply:
"Dear Mike,
Well, in my line of work I should be asking what sounds right to people and then trying to explain that as a datum, rather than telling people what is right. But if it was me, I would probably say "halfs," not "halves," for the same reason as the team in Toronto is called the Maple Leafs, and as explained in the chapter "Of Mice and Men" in my book Words and Rules. Basically, a half-precision floating point datum is not a "half" of anything, even metaphorically speaking, so the irregular form doesn't get inherited by the neologism. See the chapter for more details.
Best,
Steve Pinker"
Well, in my line of work I should be asking what sounds right to people and then trying to explain that as a datum, rather than telling people what is right. But if it was me, I would probably say "halfs," not "halves," for the same reason as the team in Toronto is called the Maple Leafs, and as explained in the chapter "Of Mice and Men" in my book Words and Rules. Basically, a half-precision floating point datum is not a "half" of anything, even metaphorically speaking, so the irregular form doesn't get inherited by the neologism. See the chapter for more details.
Best,
Steve Pinker"
Note that all integer operations all encapulated in macros (implemented as static inline functions). This process both helps to enforce the rule of branch-free
coding and will make writing the SIMD version much easier.
half.c
0// Branch-free implementation of half-precision (16 bit) floating point 1// Copyright 2006 Mike Acton2// 3// Permission is hereby granted, free of charge, to any person obtaining a 4// copy of this software and associated documentation files (the "Software"), 5// to deal in the Software without restriction, including without limitation 6// the rights to use, copy, modify, merge, publish, distribute, sublicense, 7// and/or sell copies of the Software, and to permit persons to whom the 8// Software is furnished to do so, subject to the following conditions: 9// 10// The above copyright notice and this permission notice shall be included 11// in all copies or substantial portions of the Software. 12// 13// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19// THE SOFTWARE 20// 21// Half-precision floating point format 22// ------------------------------------ 23// 24// | Field | Last | First | Note 25// |----------|------|-------|---------- 26// | Sign | 15 | 15 | 27// | Exponent | 14 | 10 | Bias = 15 28// | Mantissa | 9 | 0 | 29// 30// Compiling 31// --------- 32// 33// Preferred compile flags for GCC: 34// -O3 -fstrict-aliasing -std=c99 -pedantic -Wall -Wstrict-aliasing 35// 36// This file is a C99 source file, intended to be compiled with a C99 37// compliant compiler. However, for the moment it remains combatible 38// with C++98. Therefore if you are using a compiler that poorly implements 39// C standards (e.g. MSVC), it may be compiled as C++. This is not 40// guaranteed for future versions. 41// 42 43#include "half.h" 44 45// Load immediate 46static inline uint32_t _uint32_li( uint32_t a ) 47{ 48 return (a); 49} 50 51// Decrement 52static inline uint32_t _uint32_dec( uint32_t a ) 53{ 54 return (a - 1); 55} 56 57// Increment 58static inline uint32_t _uint32_inc( uint32_t a ) 59{ 60 return (a + 1); 61} 62 63// Complement 64static inline uint32_t _uint32_not( uint32_t a ) 65{ 66 return (~a); 67} 68 69// Negate 70static inline uint32_t _uint32_neg( uint32_t a ) 71{ 72 return (-a); 73} 74 75// Extend sign 76static inline uint32_t _uint32_ext( uint32_t a ) 77{ 78 return (((int32_t)a)>>31); 79} 80 81// And 82static inline uint32_t _uint32_and( uint32_t a, uint32_t b ) 83{ 84 return (a & b); 85} 86 87// Exclusive Or 88static inline uint32_t _uint32_xor( uint32_t a, uint32_t b ) 89{ 90 return (a ^ b); 91} 92 93// And with Complement 94static inline uint32_t _uint32_andc( uint32_t a, uint32_t b ) 95{ 96 return (a & ~b); 97} 98 99// Or 100static inline uint32_t _uint32_or( uint32_t a, uint32_t b ) 101{ 102 return (a | b); 103} 104 105// Shift Right Logical 106static inline uint32_t _uint32_srl( uint32_t a, int sa ) 107{ 108 return (a >> sa); 109} 110 111// Shift Left Logical 112static inline uint32_t _uint32_sll( uint32_t a, int sa ) 113{ 114 return (a << sa); 115} 116 117// Add 118static inline uint32_t _uint32_add( uint32_t a, uint32_t b ) 119{ 120 return (a + b); 121} 122 123// Subtract 124static inline uint32_t _uint32_sub( uint32_t a, uint32_t b ) 125{ 126 return (a - b); 127} 128 129// Multiply 130static inline uint32_t _uint32_mul( uint32_t a, uint32_t b ) 131{ 132 return (a * b); 133} 134 135// Select on Sign bit 136static inline uint32_t _uint32_sels( uint32_t test, uint32_t a, uint32_t b ) 137{ 138 const uint32_t mask = _uint32_ext( test ); 139 const uint32_t sel_a = _uint32_and( a, mask ); 140 const uint32_t sel_b = _uint32_andc( b, mask ); 141 const uint32_t result = _uint32_or( sel_a, sel_b ); 142 143 return (result); 144} 145 146// Select Bits on mask 147static inline uint32_t _uint32_selb( uint32_t mask, uint32_t a, uint32_t b ) 148{ 149 const uint32_t sel_a = _uint32_and( a, mask ); 150 const uint32_t sel_b = _uint32_andc( b, mask ); 151 const uint32_t result = _uint32_or( sel_a, sel_b ); 152 153 return (result); 154} 155 156// Load Immediate 157static inline uint16_t _uint16_li( uint16_t a ) 158{ 159 return (a); 160} 161 162// Extend sign 163static inline uint16_t _uint16_ext( uint16_t a ) 164{ 165 return (((int16_t)a)>>15); 166} 167 168// Negate 169static inline uint16_t _uint16_neg( uint16_t a ) 170{ 171 return (-a); 172} 173 174// Complement 175static inline uint16_t _uint16_not( uint16_t a ) 176{ 177 return (~a); 178} 179 180// Decrement 181static inline uint16_t _uint16_dec( uint16_t a ) 182{ 183 return (a - 1); 184} 185 186// Shift Left Logical 187static inline uint16_t _uint16_sll( uint16_t a, int sa ) 188{ 189 return (a << sa); 190} 191 192// Shift Right Logical 193static inline uint16_t _uint16_srl( uint16_t a, int sa ) 194{ 195 return (a >> sa); 196} 197 198// Add 199static inline uint16_t _uint16_add( uint16_t a, uint16_t b ) 200{ 201 return (a + b); 202} 203 204// Subtract 205static inline uint16_t _uint16_sub( uint16_t a, uint16_t b ) 206{ 207 return (a - b); 208} 209 210// And 211static inline uint16_t _uint16_and( uint16_t a, uint16_t b ) 212{ 213 return (a & b); 214} 215 216// Or 217static inline uint16_t _uint16_or( uint16_t a, uint16_t b ) 218{ 219 return (a | b); 220} 221 222// Exclusive Or 223static inline uint16_t _uint16_xor( uint16_t a, uint16_t b ) 224{ 225 return (a ^ b); 226} 227 228// And with Complement 229static inline uint16_t _uint16_andc( uint16_t a, uint16_t b ) 230{ 231 return (a & ~b); 232} 233 234// And then Shift Right Logical 235static inline uint16_t _uint16_andsrl( uint16_t a, uint16_t b, int sa ) 236{ 237 return ((a & b) >> sa); 238} 239 240// Shift Right Logical then Mask 241static inline uint16_t _uint16_srlm( uint16_t a, int sa, uint16_t mask ) 242{ 243 return ((a >> sa) & mask); 244} 245 246// Add then Mask 247static inline uint16_t _uint16_addm( uint16_t a, uint16_t b, uint16_t mask ) 248{ 249 return ((a + b) & mask); 250} 251 252 253// Select on Sign bit 254static inline uint16_t _uint16_sels( uint16_t test, uint16_t a, uint16_t b ) 255{ 256 const uint16_t mask = _uint16_ext( test ); 257 const uint16_t sel_a = _uint16_and( a, mask ); 258 const uint16_t sel_b = _uint16_andc( b, mask ); 259 const uint16_t result = _uint16_or( sel_a, sel_b ); 260 261 return (result); 262} 263 264// Count Leading Zeros 265static inline uint32_t _uint32_cntlz( uint32_t x ) 266{ 267#ifdef __GNUC__ 268 /* NOTE: __builtin_clz is undefined for x == 0 */ 269 /* On PowerPC, this will map to insn: cntlzw */ 270 /* On Pentium, this will map to insn: clz */ 271 uint32_t is_x_nez_msb = _uint32_neg( x ); 272 uint32_t nlz = __builtin_clz( x ); 273 uint32_t result = _uint32_sels( is_x_nez_msb, nlz, 0x00000020 ); 274 return (result); 275#else 276 const uint32_t x0 = _uint32_srl( x, 1 ); 277 const uint32_t x1 = _uint32_or( x, x0 ); 278 const uint32_t x2 = _uint32_srl( x1, 2 ); 279 const uint32_t x3 = _uint32_or( x1, x2 ); 280 const uint32_t x4 = _uint32_srl( x3, 4 ); 281 const uint32_t x5 = _uint32_or( x3, x4 ); 282 const uint32_t x6 = _uint32_srl( x5, 8 ); 283 const uint32_t x7 = _uint32_or( x5, x6 ); 284 const uint32_t x8 = _uint32_srl( x7, 16 ); 285 const uint32_t x9 = _uint32_or( x7, x8 ); 286 const uint32_t xA = _uint32_not( x9 ); 287 const uint32_t xB = _uint32_srl( xA, 1 ); 288 const uint32_t xC = _uint32_and( xB, 0x55555555 ); 289 const uint32_t xD = _uint32_sub( xA, xC ); 290 const uint32_t xE = _uint32_and( xD, 0x33333333 ); 291 const uint32_t xF = _uint32_srl( xD, 2 ); 292 const uint32_t x10 = _uint32_and( xF, 0x33333333 ); 293 const uint32_t x11 = _uint32_add( xE, x10 ); 294 const uint32_t x12 = _uint32_srl( x11, 4 ); 295 const uint32_t x13 = _uint32_add( x11, x12 ); 296 const uint32_t x14 = _uint32_and( x13, 0x0f0f0f0f ); 297 const uint32_t x15 = _uint32_srl( x14, 8 ); 298 const uint32_t x16 = _uint32_add( x14, x15 ); 299 const uint32_t x17 = _uint32_srl( x16, 16 ); 300 const uint32_t x18 = _uint32_add( x16, x17 ); 301 const uint32_t x19 = _uint32_and( x18, 0x0000003f ); 302 return ( x19 ); 303#endif 304} 305 306// Count Leading Zeros 307static inline uint16_t _uint16_cntlz( uint16_t x ) 308{ 309#ifdef __GNUC__ 310 uint16_t nlz32 = (uint16_t)_uint32_cntlz( (uint32_t)x ); 311 uint32_t nlz = _uint32_sub( nlz32, 16 ); 312 return (nlz); 313#else 314 const uint16_t x0 = _uint16_srl( x, 1 ); 315 const uint16_t x1 = _uint16_or( x, x0 ); 316 const uint16_t x2 = _uint16_srl( x1, 2 ); 317 const uint16_t x3 = _uint16_or( x1, x2 ); 318 const uint16_t x4 = _uint16_srl( x3, 4 ); 319 const uint16_t x5 = _uint16_or( x3, x4 ); 320 const uint16_t x6 = _uint16_srl( x5, 8 ); 321 const uint16_t x7 = _uint16_or( x5, x6 ); 322 const uint16_t x8 = _uint16_not( x7 ); 323 const uint16_t x9 = _uint16_srlm( x8, 1, 0x5555 ); 324 const uint16_t xA = _uint16_sub( x8, x9 ); 325 const uint16_t xB = _uint16_and( xA, 0x3333 ); 326 const uint16_t xC = _uint16_srlm( xA, 2, 0x3333 ); 327 const uint16_t xD = _uint16_add( xB, xC ); 328 const uint16_t xE = _uint16_srl( xD, 4 ); 329 const uint16_t xF = _uint16_addm( xD, xE, 0x0f0f ); 330 const uint16_t x10 = _uint16_srl( xF, 8 ); 331 const uint16_t x11 = _uint16_addm( xF, x10, 0x001f ); 332 return ( x11 ); 333#endif 334} 335
336uint16_t 337half_from_float( uint32_t f ) 338{ 339 const uint32_t one = _uint32_li( 0x00000001 ); 340 const uint32_t f_s_mask = _uint32_li( 0x80000000 ); 341 const uint32_t f_e_mask = _uint32_li( 0x7f800000 ); 342 const uint32_t f_m_mask = _uint32_li( 0x007fffff ); 343 const uint32_t f_m_hidden_bit = _uint32_li( 0x00800000 ); 344 const uint32_t f_m_round_bit = _uint32_li( 0x00001000 ); 345 const uint32_t f_snan_mask = _uint32_li( 0x7fc00000 ); 346 const uint32_t f_e_pos = _uint32_li( 0x00000017 ); 347 const uint32_t h_e_pos = _uint32_li( 0x0000000a ); 348 const uint32_t h_e_mask = _uint32_li( 0x00007c00 ); 349 const uint32_t h_snan_mask = _uint32_li( 0x00007e00 ); 350 const uint32_t h_e_mask_value = _uint32_li( 0x0000001f ); 351 const uint32_t f_h_s_pos_offset = _uint32_li( 0x00000010 ); 352 const uint32_t f_h_bias_offset = _uint32_li( 0x00000070 ); 353 const uint32_t f_h_m_pos_offset = _uint32_li( 0x0000000d ); 354 const uint32_t h_nan_min = _uint32_li( 0x00007c01 ); 355 const uint32_t f_h_e_biased_flag = _uint32_li( 0x0000008f ); 356 const uint32_t f_s = _uint32_and( f, f_s_mask ); 357 const uint32_t f_e = _uint32_and( f, f_e_mask ); 358 const uint16_t h_s = _uint32_srl( f_s, f_h_s_pos_offset ); 359 const uint32_t f_m = _uint32_and( f, f_m_mask ); 360 const uint16_t f_e_amount = _uint32_srl( f_e, f_e_pos ); 361 const uint32_t f_e_half_bias = _uint32_sub( f_e_amount, f_h_bias_offset ); 362 const uint32_t f_snan = _uint32_and( f, f_snan_mask ); 363 const uint32_t f_m_round_mask = _uint32_and( f_m, f_m_round_bit ); 364 const uint32_t f_m_round_offset = _uint32_sll( f_m_round_mask, one ); 365 const uint32_t f_m_rounded = _uint32_add( f_m, f_m_round_offset ); 366 const uint32_t f_m_denorm_sa = _uint32_sub( one, f_e_half_bias ); 367 const uint32_t f_m_with_hidden = _uint32_or( f_m_rounded, f_m_hidden_bit ); 368 const uint32_t f_m_denorm = _uint32_srl( f_m_with_hidden, f_m_denorm_sa ); 369 const uint32_t h_m_denorm = _uint32_srl( f_m_denorm, f_h_m_pos_offset ); 370 const uint32_t f_m_rounded_overflow = _uint32_and( f_m_rounded, f_m_hidden_bit ); 371 const uint32_t m_nan = _uint32_srl( f_m, f_h_m_pos_offset ); 372 const uint32_t h_em_nan = _uint32_or( h_e_mask, m_nan ); 373 const uint32_t h_e_norm_overflow_offset = _uint32_inc( f_e_half_bias ); 374 const uint32_t h_e_norm_overflow = _uint32_sll( h_e_norm_overflow_offset, h_e_pos ); 375 const uint32_t h_e_norm = _uint32_sll( f_e_half_bias, h_e_pos ); 376 const uint32_t h_m_norm = _uint32_srl( f_m_rounded, f_h_m_pos_offset ); 377 const uint32_t h_em_norm = _uint32_or( h_e_norm, h_m_norm ); 378 const uint32_t is_h_ndenorm_msb = _uint32_sub( f_h_bias_offset, f_e_amount ); 379 const uint32_t is_f_e_flagged_msb = _uint32_sub( f_h_e_biased_flag, f_e_half_bias ); 380 const uint32_t is_h_denorm_msb = _uint32_not( is_h_ndenorm_msb ); 381 const uint32_t is_f_m_eqz_msb = _uint32_dec( f_m ); 382 const uint32_t is_h_nan_eqz_msb = _uint32_dec( m_nan ); 383 const uint32_t is_f_inf_msb = _uint32_and( is_f_e_flagged_msb, is_f_m_eqz_msb ); 384 const uint32_t is_f_nan_underflow_msb = _uint32_and( is_f_e_flagged_msb, is_h_nan_eqz_msb ); 385 const uint32_t is_e_overflow_msb = _uint32_sub( h_e_mask_value, f_e_half_bias ); 386 const uint32_t is_h_inf_msb = _uint32_or( is_e_overflow_msb, is_f_inf_msb ); 387 const uint32_t is_f_nsnan_msb = _uint32_sub( f_snan, f_snan_mask ); 388 const uint32_t is_m_norm_overflow_msb = _uint32_neg( f_m_rounded_overflow ); 389 const uint32_t is_f_snan_msb = _uint32_not( is_f_nsnan_msb ); 390 const uint32_t h_em_overflow_result = _uint32_sels( is_m_norm_overflow_msb, h_e_norm_overflow, h_em_norm ); 391 const uint32_t h_em_nan_result = _uint32_sels( is_f_e_flagged_msb, h_em_nan, h_em_overflow_result ); 392 const uint32_t h_em_nan_underflow_result = _uint32_sels( is_f_nan_underflow_msb, h_nan_min, h_em_nan_result ); 393 const uint32_t h_em_inf_result = _uint32_sels( is_h_inf_msb, h_e_mask, h_em_nan_underflow_result ); 394 const uint32_t h_em_denorm_result = _uint32_sels( is_h_denorm_msb, h_m_denorm, h_em_inf_result ); 395 const uint32_t h_em_snan_result = _uint32_sels( is_f_snan_msb, h_snan_mask, h_em_denorm_result ); 396 const uint32_t h_result = _uint32_or( h_s, h_em_snan_result ); 397 398 return (uint16_t)(h_result); 399} 400
401uint32_t 402half_to_float( uint16_t h ) 403{ 404 const uint32_t h_e_mask = _uint32_li( 0x00007c00 ); 405 const uint32_t h_m_mask = _uint32_li( 0x000003ff ); 406 const uint32_t h_s_mask = _uint32_li( 0x00008000 ); 407 const uint32_t h_f_s_pos_offset = _uint32_li( 0x00000010 ); 408 const uint32_t h_f_e_pos_offset = _uint32_li( 0x0000000d ); 409 const uint32_t h_f_bias_offset = _uint32_li( 0x0001c000 ); 410 const uint32_t f_e_mask = _uint32_li( 0x7f800000 ); 411 const uint32_t f_m_mask = _uint32_li( 0x007fffff ); 412 const uint32_t h_f_e_denorm_bias = _uint32_li( 0x0000007e ); 413 const uint32_t h_f_m_denorm_sa_bias = _uint32_li( 0x00000008 ); 414 const uint32_t f_e_pos = _uint32_li( 0x00000017 ); 415 const uint32_t h_e_mask_minus_one = _uint32_li( 0x00007bff ); 416 const uint32_t h_e = _uint32_and( h, h_e_mask ); 417 const uint32_t h_m = _uint32_and( h, h_m_mask ); 418 const uint32_t h_s = _uint32_and( h, h_s_mask ); 419 const uint32_t h_e_f_bias = _uint32_add( h_e, h_f_bias_offset ); 420 const uint32_t h_m_nlz = _uint32_cntlz( h_m ); 421 const uint32_t f_s = _uint32_sll( h_s, h_f_s_pos_offset ); 422 const uint32_t f_e = _uint32_sll( h_e_f_bias, h_f_e_pos_offset ); 423 const uint32_t f_m = _uint32_sll( h_m, h_f_e_pos_offset ); 424 const uint32_t f_em = _uint32_or( f_e, f_m ); 425 const uint32_t h_f_m_sa = _uint32_sub( h_m_nlz, h_f_m_denorm_sa_bias ); 426 const uint32_t f_e_denorm_unpacked = _uint32_sub( h_f_e_denorm_bias, h_f_m_sa ); 427 const uint32_t h_f_m = _uint32_sll( h_m, h_f_m_sa ); 428 const uint32_t f_m_denorm = _uint32_and( h_f_m, f_m_mask ); 429 const uint32_t f_e_denorm = _uint32_sll( f_e_denorm_unpacked, f_e_pos ); 430 const uint32_t f_em_denorm = _uint32_or( f_e_denorm, f_m_denorm ); 431 const uint32_t f_em_nan = _uint32_or( f_e_mask, f_m ); 432 const uint32_t is_e_eqz_msb = _uint32_dec( h_e ); 433 const uint32_t is_m_nez_msb = _uint32_neg( h_m ); 434 const uint32_t is_e_flagged_msb = _uint32_sub( h_e_mask_minus_one, h_e ); 435 const uint32_t is_zero_msb = _uint32_andc( is_e_eqz_msb, is_m_nez_msb ); 436 const uint32_t is_inf_msb = _uint32_andc( is_e_flagged_msb, is_m_nez_msb ); 437 const uint32_t is_denorm_msb = _uint32_and( is_m_nez_msb, is_e_eqz_msb ); 438 const uint32_t is_nan_msb = _uint32_and( is_e_flagged_msb, is_m_nez_msb ); 439 const uint32_t is_zero = _uint32_ext( is_zero_msb ); 440 const uint32_t f_zero_result = _uint32_andc( f_em, is_zero ); 441 const uint32_t f_denorm_result = _uint32_sels( is_denorm_msb, f_em_denorm, f_zero_result ); 442 const uint32_t f_inf_result = _uint32_sels( is_inf_msb, f_e_mask, f_denorm_result ); 443 const uint32_t f_nan_result = _uint32_sels( is_nan_msb, f_em_nan, f_inf_result ); 444 const uint32_t f_result = _uint32_or( f_s, f_nan_result ); 445 446 return (f_result); 447} 448
449// half_add 450// -------- 451// 452// (SUM) uint16_t z = half_add( x, y ); 453// (DIFFERENCE) uint16_t z = half_add( x, -y ); 454// 455// * Difference of ZEROs is always +ZERO 456// * Sum round with guard + round + sticky bit (grs) 457// * QNaN + [x] = QNaN 458// * [x] + +INF = +INF 459// * [x] - -INF = -INF 460// * INF - INF = SNaN 461// 462// Will have exactly (0 ulps difference) the same result as: 463// (Round up) 464// 465// union FLOAT_32 466// { 467// float f32; 468// uint32_t u32; 469// }; 470// 471// union FLOAT_32 fx = { .u32 = half_to_float( x ) }; 472// union FLOAT_32 fy = { .u32 = half_to_float( y ) }; 473// union FLOAT_32 fz = { .f32 = fx.f32 + fy.f32 }; 474// uint16_t z = float_to_half( fz ); 475// 476uint16_t 477half_add( uint16_t x, uint16_t y ) 478{ 479 const uint16_t one = _uint16_li( 0x0001 ); 480 const uint16_t msb_to_lsb_sa = _uint16_li( 0x000f ); 481 const uint16_t h_s_mask = _uint16_li( 0x8000 ); 482 const uint16_t h_e_mask = _uint16_li( 0x7c00 ); 483 const uint16_t h_m_mask = _uint16_li( 0x03ff ); 484 const uint16_t h_m_msb_mask = _uint16_li( 0x2000 ); 485 const uint16_t h_m_msb_sa = _uint16_li( 0x000d ); 486 const uint16_t h_m_hidden = _uint16_li( 0x0400 ); 487 const uint16_t h_e_pos = _uint16_li( 0x000a ); 488 const uint16_t h_e_bias_minus_one = _uint16_li( 0x000e ); 489 const uint16_t h_m_grs_carry = _uint16_li( 0x4000 ); 490 const uint16_t h_m_grs_carry_pos = _uint16_li( 0x000e ); 491 const uint16_t h_grs_size = _uint16_li( 0x0003 ); 492 const uint16_t h_snan = _uint16_li( 0xfe00 ); 493 const uint16_t h_e_mask_minus_one = _uint16_li( 0x7bff ); 494 const uint16_t h_grs_round_carry = _uint16_sll( one, h_grs_size ); 495 const uint16_t h_grs_round_mask = _uint16_sub( h_grs_round_carry, one ); 496 const uint16_t x_e = _uint16_and( x, h_e_mask ); 497 const uint16_t y_e = _uint16_and( y, h_e_mask ); 498 const uint16_t is_y_e_larger_msb = _uint16_sub( x_e, y_e ); 499 const uint16_t a = _uint16_sels( is_y_e_larger_msb, y, x); 500 const uint16_t a_s = _uint16_and( a, h_s_mask ); 501 const uint16_t a_e = _uint16_and( a, h_e_mask ); 502 const uint16_t a_m_no_hidden_bit = _uint16_and( a, h_m_mask ); 503 const uint16_t a_em_no_hidden_bit = _uint16_or( a_e, a_m_no_hidden_bit ); 504 const uint16_t b = _uint16_sels( is_y_e_larger_msb, x, y); 505 const uint16_t b_s = _uint16_and( b, h_s_mask ); 506 const uint16_t b_e = _uint16_and( b, h_e_mask ); 507 const uint16_t b_m_no_hidden_bit = _uint16_and( b, h_m_mask ); 508 const uint16_t b_em_no_hidden_bit = _uint16_or( b_e, b_m_no_hidden_bit ); 509 const uint16_t is_diff_sign_msb = _uint16_xor( a_s, b_s ); 510 const uint16_t is_a_inf_msb = _uint16_sub( h_e_mask_minus_one, a_em_no_hidden_bit ); 511 const uint16_t is_b_inf_msb = _uint16_sub( h_e_mask_minus_one, b_em_no_hidden_bit ); 512 const uint16_t is_undenorm_msb = _uint16_dec( a_e ); 513 const uint16_t is_undenorm = _uint16_ext( is_undenorm_msb ); 514 const uint16_t is_both_inf_msb = _uint16_and( is_a_inf_msb, is_b_inf_msb ); 515 const uint16_t is_invalid_inf_op_msb = _uint16_and( is_both_inf_msb, b_s ); 516 const uint16_t is_a_e_nez_msb = _uint16_neg( a_e ); 517 const uint16_t is_b_e_nez_msb = _uint16_neg( b_e ); 518 const uint16_t is_a_e_nez = _uint16_ext( is_a_e_nez_msb ); 519 const uint16_t is_b_e_nez = _uint16_ext( is_b_e_nez_msb ); 520 const uint16_t a_m_hidden_bit = _uint16_and( is_a_e_nez, h_m_hidden ); 521 const uint16_t b_m_hidden_bit = _uint16_and( is_b_e_nez, h_m_hidden ); 522 const uint16_t a_m_no_grs = _uint16_or( a_m_no_hidden_bit, a_m_hidden_bit ); 523 const uint16_t b_m_no_grs = _uint16_or( b_m_no_hidden_bit, b_m_hidden_bit ); 524 const uint16_t diff_e = _uint16_sub( a_e, b_e ); 525 const uint16_t a_e_unbias = _uint16_sub( a_e, h_e_bias_minus_one ); 526 const uint16_t a_m = _uint16_sll( a_m_no_grs, h_grs_size ); 527 const uint16_t a_e_biased = _uint16_srl( a_e, h_e_pos ); 528 const uint16_t m_sa_unbias = _uint16_srl( a_e_unbias, h_e_pos ); 529 const uint16_t m_sa_default = _uint16_srl( diff_e, h_e_pos ); 530 const uint16_t m_sa_unbias_mask = _uint16_andc( is_a_e_nez_msb, is_b_e_nez_msb ); 531 const uint16_t m_sa = _uint16_sels( m_sa_unbias_mask, m_sa_unbias, m_sa_default ); 532 const uint16_t b_m_no_sticky = _uint16_sll( b_m_no_grs, h_grs_size ); 533 const uint16_t sh_m = _uint16_srl( b_m_no_sticky, m_sa ); 534 const uint16_t sticky_overflow = _uint16_sll( one, m_sa ); 535 const uint16_t sticky_mask = _uint16_dec( sticky_overflow ); 536 const uint16_t sticky_collect = _uint16_and( b_m_no_sticky, sticky_mask ); 537 const uint16_t is_sticky_set_msb = _uint16_neg( sticky_collect ); 538 const uint16_t sticky = _uint16_srl( is_sticky_set_msb, msb_to_lsb_sa); 539 const uint16_t b_m = _uint16_or( sh_m, sticky ); 540 const uint16_t is_c_m_ab_pos_msb = _uint16_sub( b_m, a_m ); 541 const uint16_t c_inf = _uint16_or( a_s, h_e_mask ); 542 const uint16_t c_m_sum = _uint16_add( a_m, b_m ); 543 const uint16_t c_m_diff_ab = _uint16_sub( a_m, b_m ); 544 const uint16_t c_m_diff_ba = _uint16_sub( b_m, a_m ); 545 const uint16_t c_m_smag_diff = _uint16_sels( is_c_m_ab_pos_msb, c_m_diff_ab, c_m_diff_ba ); 546 const uint16_t c_s_diff = _uint16_sels( is_c_m_ab_pos_msb, a_s, b_s ); 547 const uint16_t c_s = _uint16_sels( is_diff_sign_msb, c_s_diff, a_s ); 548 const uint16_t c_m_smag_diff_nlz = _uint16_cntlz( c_m_smag_diff ); 549 const uint16_t diff_norm_sa = _uint16_sub( c_m_smag_diff_nlz, one ); 550 const uint16_t is_diff_denorm_msb = _uint16_sub( a_e_biased, diff_norm_sa ); 551 const uint16_t is_diff_denorm = _uint16_ext( is_diff_denorm_msb ); 552 const uint16_t is_a_or_b_norm_msb = _uint16_neg( a_e_biased ); 553 const uint16_t diff_denorm_sa = _uint16_dec( a_e_biased ); 554 const uint16_t c_m_diff_denorm = _uint16_sll( c_m_smag_diff, diff_denorm_sa ); 555 const uint16_t c_m_diff_norm = _uint16_sll( c_m_smag_diff, diff_norm_sa ); 556 const uint16_t c_e_diff_norm = _uint16_sub( a_e_biased, diff_norm_sa ); 557 const uint16_t c_m_diff_ab_norm = _uint16_sels( is_diff_denorm_msb, c_m_diff_denorm, c_m_diff_norm ); 558 const uint16_t c_e_diff_ab_norm = _uint16_andc( c_e_diff_norm, is_diff_denorm ); 559 const uint16_t c_m_diff = _uint16_sels( is_a_or_b_norm_msb, c_m_diff_ab_norm, c_m_smag_diff ); 560 const uint16_t c_e_diff = _uint16_sels( is_a_or_b_norm_msb, c_e_diff_ab_norm, a_e_biased ); 561 const uint16_t is_diff_eqz_msb = _uint16_dec( c_m_diff ); 562 const uint16_t is_diff_exactly_zero_msb = _uint16_and( is_diff_sign_msb, is_diff_eqz_msb ); 563 const uint16_t is_diff_exactly_zero = _uint16_ext( is_diff_exactly_zero_msb ); 564 const uint16_t c_m_added = _uint16_sels( is_diff_sign_msb, c_m_diff, c_m_sum ); 565 const uint16_t c_e_added = _uint16_sels( is_diff_sign_msb, c_e_diff, a_e_biased ); 566 const uint16_t c_m_carry = _uint16_and( c_m_added, h_m_grs_carry ); 567 const uint16_t is_c_m_carry_msb = _uint16_neg( c_m_carry ); 568 const uint16_t c_e_hidden_offset = _uint16_andsrl( c_m_added, h_m_grs_carry, h_m_grs_carry_pos ); 569 const uint16_t c_m_sub_hidden = _uint16_srl( c_m_added, one ); 570 const uint16_t c_m_no_hidden = _uint16_sels( is_c_m_carry_msb, c_m_sub_hidden, c_m_added ); 571 const uint16_t c_e_no_hidden = _uint16_add( c_e_added, c_e_hidden_offset ); 572 const uint16_t c_m_no_hidden_msb = _uint16_and( c_m_no_hidden, h_m_msb_mask ); 573 const uint16_t undenorm_m_msb_odd = _uint16_srl( c_m_no_hidden_msb, h_m_msb_sa ); 574 const uint16_t undenorm_fix_e = _uint16_and( is_undenorm, undenorm_m_msb_odd ); 575 const uint16_t c_e_fixed = _uint16_add( c_e_no_hidden, undenorm_fix_e ); 576 const uint16_t c_m_round_amount = _uint16_and( c_m_no_hidden, h_grs_round_mask ); 577 const uint16_t c_m_rounded = _uint16_add( c_m_no_hidden, c_m_round_amount ); 578 const uint16_t c_m_round_overflow = _uint16_andsrl( c_m_rounded, h_m_grs_carry, h_m_grs_carry_pos ); 579 const uint16_t c_e_rounded = _uint16_add( c_e_fixed, c_m_round_overflow ); 580 const uint16_t c_m_no_grs = _uint16_srlm( c_m_rounded, h_grs_size, h_m_mask ); 581 const uint16_t c_e = _uint16_sll( c_e_rounded, h_e_pos ); 582 const uint16_t c_em = _uint16_or( c_e, c_m_no_grs ); 583 const uint16_t c_normal = _uint16_or( c_s, c_em ); 584 const uint16_t c_inf_result = _uint16_sels( is_a_inf_msb, c_inf, c_normal ); 585 const uint16_t c_zero_result = _uint16_andc( c_inf_result, is_diff_exactly_zero ); 586 const uint16_t c_result = _uint16_sels( is_invalid_inf_op_msb, h_snan, c_zero_result ); 587 588 return (c_result); 589} 590
591// half_mul 592// -------- 593// 594// May have 0 or 1 ulp difference from the following result: 595// (Round to nearest) 596// NOTE: Rounding mode differs between conversion and multiply 597// 598// union FLOAT_32 599// { 600// float f32; 601// uint32_t u32; 602// }; 603// 604// union FLOAT_32 fx = { .u32 = half_to_float( x ) }; 605// union FLOAT_32 fy = { .u32 = half_to_float( y ) }; 606// union FLOAT_32 fz = { .f32 = fx.f32 * fy.f32 }; 607// uint16_t z = float_to_half( fz ); 608// 609uint16_t 610half_mul( uint16_t x, uint16_t y ) 611{ 612 const uint32_t one = _uint32_li( 0x00000001 ); 613 const uint32_t h_s_mask = _uint32_li( 0x00008000 ); 614 const uint32_t h_e_mask = _uint32_li( 0x00007c00 ); 615 const uint32_t h_m_mask = _uint32_li( 0x000003ff ); 616 const uint32_t h_m_hidden = _uint32_li( 0x00000400 ); 617 const uint32_t h_e_pos = _uint32_li( 0x0000000a ); 618 const uint32_t h_e_bias = _uint32_li( 0x0000000f ); 619 const uint32_t h_m_bit_count = _uint32_li( 0x0000000a ); 620 const uint32_t h_m_bit_half_count = _uint32_li( 0x00000005 ); 621 const uint32_t h_nan_min = _uint32_li( 0x00007c01 ); 622 const uint32_t h_e_mask_minus_one = _uint32_li( 0x00007bff ); 623 const uint32_t h_snan = _uint32_li( 0x0000fe00 ); 624 const uint32_t m_round_overflow_bit = _uint32_li( 0x00000020 ); 625 const uint32_t m_hidden_bit = _uint32_li( 0x00100000 ); 626 const uint32_t a_s = _uint32_and( x, h_s_mask ); 627 const uint32_t b_s = _uint32_and( y, h_s_mask ); 628 const uint32_t c_s = _uint32_xor( a_s, b_s ); 629 const uint32_t x_e = _uint32_and( x, h_e_mask ); 630 const uint32_t x_e_eqz_msb = _uint32_dec( x_e ); 631 const uint32_t a = _uint32_sels( x_e_eqz_msb, y, x ); 632 const uint32_t b = _uint32_sels( x_e_eqz_msb, x, y ); 633 const uint32_t a_e = _uint32_and( a, h_e_mask ); 634 const uint32_t b_e = _uint32_and( b, h_e_mask ); 635 const uint32_t a_m = _uint32_and( a, h_m_mask ); 636 const uint32_t b_m = _uint32_and( b, h_m_mask ); 637 const uint32_t a_e_amount = _uint32_srl( a_e, h_e_pos ); 638 const uint32_t b_e_amount = _uint32_srl( b_e, h_e_pos ); 639 const uint32_t a_m_with_hidden = _uint32_or( a_m, h_m_hidden ); 640 const uint32_t b_m_with_hidden = _uint32_or( b_m, h_m_hidden ); 641 const uint32_t c_m_normal = _uint32_mul( a_m_with_hidden, b_m_with_hidden ); 642 const uint32_t c_m_denorm_biased = _uint32_mul( a_m_with_hidden, b_m ); 643 const uint32_t c_e_denorm_unbias_e = _uint32_sub( h_e_bias, a_e_amount ); 644 const uint32_t c_m_denorm_round_amount = _uint32_and( c_m_denorm_biased, h_m_mask ); 645 const uint32_t c_m_denorm_rounded = _uint32_add( c_m_denorm_biased, c_m_denorm_round_amount ); 646 const uint32_t c_m_denorm_inplace = _uint32_srl( c_m_denorm_rounded, h_m_bit_count ); 647 const uint32_t c_m_denorm_unbiased = _uint32_srl( c_m_denorm_inplace, c_e_denorm_unbias_e ); 648 const uint32_t c_m_denorm = _uint32_and( c_m_denorm_unbiased, h_m_mask ); 649 const uint32_t c_e_amount_biased = _uint32_add( a_e_amount, b_e_amount ); 650 const uint32_t c_e_amount_unbiased = _uint32_sub( c_e_amount_biased, h_e_bias ); 651 const uint32_t is_c_e_unbiased_underflow = _uint32_ext( c_e_amount_unbiased ); 652 const uint32_t c_e_underflow_half_sa = _uint32_neg( c_e_amount_unbiased ); 653 const uint32_t c_e_underflow_sa = _uint32_sll( c_e_underflow_half_sa, one ); 654 const uint32_t c_m_underflow = _uint32_srl( c_m_normal, c_e_underflow_sa ); 655 const uint32_t c_e_underflow_added = _uint32_andc( c_e_amount_unbiased, is_c_e_unbiased_underflow ); 656 const uint32_t c_m_underflow_added = _uint32_selb( is_c_e_unbiased_underflow, c_m_underflow, c_m_normal ); 657 const uint32_t is_mul_overflow_test = _uint32_and( c_e_underflow_added, m_round_overflow_bit ); 658 const uint32_t is_mul_overflow_msb = _uint32_neg( is_mul_overflow_test ); 659 const uint32_t c_e_norm_radix_corrected = _uint32_inc( c_e_underflow_added ); 660 const uint32_t c_m_norm_radix_corrected = _uint32_srl( c_m_underflow_added, one ); 661 const uint32_t c_m_norm_hidden_bit = _uint32_and( c_m_norm_radix_corrected, m_hidden_bit ); 662 const uint32_t is_c_m_norm_no_hidden_msb = _uint32_dec( c_m_norm_hidden_bit ); 663 const uint32_t c_m_norm_lo = _uint32_srl( c_m_norm_radix_corrected, h_m_bit_half_count ); 664 const uint32_t c_m_norm_lo_nlz = _uint16_cntlz( c_m_norm_lo ); 665 const uint32_t is_c_m_hidden_nunderflow_msb = _uint32_sub( c_m_norm_lo_nlz, c_e_norm_radix_corrected ); 666 const uint32_t is_c_m_hidden_underflow_msb = _uint32_not( is_c_m_hidden_nunderflow_msb ); 667 const uint32_t is_c_m_hidden_underflow = _uint32_ext( is_c_m_hidden_underflow_msb ); 668 const uint32_t c_m_hidden_underflow_normalized_sa = _uint32_srl( c_m_norm_lo_nlz, one ); 669 const uint32_t c_m_hidden_underflow_normalized = _uint32_sll( c_m_norm_radix_corrected, c_m_hidden_underflow_normalized_sa ); 670 const uint32_t c_m_hidden_normalized = _uint32_sll( c_m_norm_radix_corrected, c_m_norm_lo_nlz ); 671 const uint32_t c_e_hidden_normalized = _uint32_sub( c_e_norm_radix_corrected, c_m_norm_lo_nlz ); 672 const uint32_t c_e_hidden = _uint32_andc( c_e_hidden_normalized, is_c_m_hidden_underflow ); 673 const uint32_t c_m_hidden = _uint32_sels( is_c_m_hidden_underflow_msb, c_m_hidden_underflow_normalized, c_m_hidden_normalized ); 674 const uint32_t c_m_normalized = _uint32_sels( is_c_m_norm_no_hidden_msb, c_m_hidden, c_m_norm_radix_corrected ); 675 const uint32_t c_e_normalized = _uint32_sels( is_c_m_norm_no_hidden_msb, c_e_hidden, c_e_norm_radix_corrected ); 676 const uint32_t c_m_norm_round_amount = _uint32_and( c_m_normalized, h_m_mask ); 677 const uint32_t c_m_norm_rounded = _uint32_add( c_m_normalized, c_m_norm_round_amount ); 678 const uint32_t is_round_overflow_test = _uint32_and( c_e_normalized, m_round_overflow_bit ); 679 const uint32_t is_round_overflow_msb = _uint32_neg( is_round_overflow_test ); 680 const uint32_t c_m_norm_inplace = _uint32_srl( c_m_norm_rounded, h_m_bit_count ); 681 const uint32_t c_m = _uint32_and( c_m_norm_inplace, h_m_mask ); 682 const uint32_t c_e_norm_inplace = _uint32_sll( c_e_normalized, h_e_pos ); 683 const uint32_t c_e = _uint32_and( c_e_norm_inplace, h_e_mask ); 684 const uint32_t c_em_nan = _uint32_or( h_e_mask, a_m ); 685 const uint32_t c_nan = _uint32_or( a_s, c_em_nan ); 686 const uint32_t c_denorm = _uint32_or( c_s, c_m_denorm ); 687 const uint32_t c_inf = _uint32_or( c_s, h_e_mask ); 688 const uint32_t c_em_norm = _uint32_or( c_e, c_m ); 689 const uint32_t is_a_e_flagged_msb = _uint32_sub( h_e_mask_minus_one, a_e ); 690 const uint32_t is_b_e_flagged_msb = _uint32_sub( h_e_mask_minus_one, b_e ); 691 const uint32_t is_a_e_eqz_msb = _uint32_dec( a_e ); 692 const uint32_t is_a_m_eqz_msb = _uint32_dec( a_m ); 693 const uint32_t is_b_e_eqz_msb = _uint32_dec( b_e ); 694 const uint32_t is_b_m_eqz_msb = _uint32_dec( b_m ); 695 const uint32_t is_b_eqz_msb = _uint32_and( is_b_e_eqz_msb, is_b_m_eqz_msb ); 696 const uint32_t is_a_eqz_msb = _uint32_and( is_a_e_eqz_msb, is_a_m_eqz_msb ); 697 const uint32_t is_c_nan_via_a_msb = _uint32_andc( is_a_e_flagged_msb, is_b_e_flagged_msb ); 698 const uint32_t is_c_nan_via_b_msb = _uint32_andc( is_b_e_flagged_msb, is_b_m_eqz_msb ); 699 const uint32_t is_c_nan_msb = _uint32_or( is_c_nan_via_a_msb, is_c_nan_via_b_msb ); 700 const uint32_t is_c_denorm_msb = _uint32_andc( is_b_e_eqz_msb, is_a_e_flagged_msb ); 701 const uint32_t is_a_inf_msb = _uint32_and( is_a_e_flagged_msb, is_a_m_eqz_msb ); 702 const uint32_t is_c_snan_msb = _uint32_and( is_a_inf_msb, is_b_eqz_msb ); 703 const uint32_t is_c_nan_min_via_a_msb = _uint32_and( is_a_e_flagged_msb, is_b_eqz_msb ); 704 const uint32_t is_c_nan_min_via_b_msb = _uint32_and( is_b_e_flagged_msb, is_a_eqz_msb ); 705 const uint32_t is_c_nan_min_msb = _uint32_or( is_c_nan_min_via_a_msb, is_c_nan_min_via_b_msb ); 706 const uint32_t is_c_inf_msb = _uint32_or( is_a_e_flagged_msb, is_b_e_flagged_msb ); 707 const uint32_t is_overflow_msb = _uint32_or( is_round_overflow_msb, is_mul_overflow_msb ); 708 const uint32_t c_em_overflow_result = _uint32_sels( is_overflow_msb, h_e_mask, c_em_norm ); 709 const uint32_t c_common_result = _uint32_or( c_s, c_em_overflow_result ); 710 const uint32_t c_zero_result = _uint32_sels( is_b_eqz_msb, c_s, c_common_result ); 711 const uint32_t c_nan_result = _uint32_sels( is_c_nan_msb, c_nan, c_zero_result ); 712 const uint32_t c_nan_min_result = _uint32_sels( is_c_nan_min_msb, h_nan_min, c_nan_result ); 713 const uint32_t c_inf_result = _uint32_sels( is_c_inf_msb, c_inf, c_nan_min_result ); 714 const uint32_t c_denorm_result = _uint32_sels( is_c_denorm_msb, c_denorm, c_inf_result); 715 const uint32_t c_result = _uint32_sels( is_c_snan_msb, h_snan, c_denorm_result ); 716 717 return (uint16_t)(c_result); 718}
half.h
0#ifndef HALF_H 1#define HALF_H 2 3#include <stdint.h> 4 5uint32_t half_to_float( uint16_t h ); 6uint16_t half_from_float( uint32_t f ); 7uint16_t half_add( uint16_t arg0, uint16_t arg1 ); 8uint16_t half_mul( uint16_t arg0, uint16_t arg1 ); 9
10static inline uint16_t 11half_sub( uint16_t ha, uint16_t hb ) 12{ 13 // (a-b) is the same as (a+(-b)) 14 return half_add( ha, hb ^ 0x8000 ); 15} 16 17#endif /* HALF_H */