How to efficiently perform double/int64 conversions with SSE/AVX? - c++

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:
_mm_cvtpd_epi64()
_mm_cvtepi64_pd()
It seems that AVX doesn't have them either.
What is the most efficient way to simulate these intrinsics?

There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.
If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)
If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:
If you don't care about infinity or NaN.
For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
For double <-> uint64_t, you only care about values in the range [0, 2^52).
double -> uint64_t
// Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}
double -> int64_t
// Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
return _mm_sub_epi64(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
);
}
uint64_t -> double
// Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}
int64_t -> double
// Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}
Rounding Behavior:
For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.
How does it work?
Despite this trick being only 2 instructions, it's not entirely self-explanatory.
The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.
To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.
Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.
Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.
The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.
Related: A fast method to round a double to a 32-bit int explained
Full Range int64 -> double:
After many years, I finally had a need for this.
5 instructions for uint64_t -> double
6 instructions for int64_t -> double
uint64_t -> double
__m128d uint64_to_double_full(__m128i x){
__m128i xH = _mm_srli_epi64(x, 32);
xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.))); // 2^84
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.)); // 2^84 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}
int64_t -> double
__m128d int64_to_double_full(__m128i x){
__m128i xH = _mm_srai_epi32(x, 16);
xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.))); // 3*2^67
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.)); // 3*2^67 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}
These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.
These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.

This answer is about 64 bit integer to double conversion, without cutting corners. In a previous version of this answer (see paragraph Fast and accurate conversion by splitting ...., below),
it was shown that
it is quite efficient to split the 64-bit integers in a 32-bit low and a 32-bit high part,
convert these parts to double, and compute low + high * 2^32.
The instruction counts of these conversions were:
int64_to_double_full_range 9 instructions (with mul and add as one fma)
uint64_to_double_full_range 7 instructions (with mul and add as one fma)
Inspired by Mysticial's updated answer, with better optimized accurate conversions,
I further optimized the int64_t to double conversion:
int64_to_double_fast_precise: 5 instructions.
uint64_to_double_fast_precise: 5 instructions.
The int64_to_double_fast_precise conversion takes one instruction less than Mysticial's solution.
The uint64_to_double_fast_precise code is essentially identical to Mysticial's solution (but with a vpblendd
instead of vpblendw). It is included here because of its similarities with the
int64_to_double_fast_precise conversion: The instructions are identical, only the constants differ:
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
__m256d int64_to_double_fast_precise(const __m256i v)
/* Optimized full range int64_t to double conversion */
/* Emulate _mm256_cvtepi64_pd() */
{
__m256i magic_i_lo = _mm256_set1_epi64x(0x4330000000000000); /* 2^52 encoded as floating-point */
__m256i magic_i_hi32 = _mm256_set1_epi64x(0x4530000080000000); /* 2^84 + 2^63 encoded as floating-point */
__m256i magic_i_all = _mm256_set1_epi64x(0x4530000080100000); /* 2^84 + 2^63 + 2^52 encoded as floating-point */
__m256d magic_d_all = _mm256_castsi256_pd(magic_i_all);
__m256i v_lo = _mm256_blend_epi32(magic_i_lo, v, 0b01010101); /* Blend the 32 lowest significant bits of v with magic_int_lo */
__m256i v_hi = _mm256_srli_epi64(v, 32); /* Extract the 32 most significant bits of v */
v_hi = _mm256_xor_si256(v_hi, magic_i_hi32); /* Flip the msb of v_hi and blend with 0x45300000 */
__m256d v_hi_dbl = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_all); /* Compute in double precision: */
__m256d result = _mm256_add_pd(v_hi_dbl, _mm256_castsi256_pd(v_lo)); /* (v_hi - magic_d_all) + v_lo Do not assume associativity of floating point addition !! */
return result; /* With gcc use -O3, then -fno-associative-math is default. Do not use -Ofast, which enables -fassociative-math! */
/* With icc use -fp-model precise */
}
__m256d uint64_to_double_fast_precise(const __m256i v)
/* Optimized full range uint64_t to double conversion */
/* This code is essentially identical to Mysticial's solution. */
/* Emulate _mm256_cvtepu64_pd() */
{
__m256i magic_i_lo = _mm256_set1_epi64x(0x4330000000000000); /* 2^52 encoded as floating-point */
__m256i magic_i_hi32 = _mm256_set1_epi64x(0x4530000000000000); /* 2^84 encoded as floating-point */
__m256i magic_i_all = _mm256_set1_epi64x(0x4530000000100000); /* 2^84 + 2^52 encoded as floating-point */
__m256d magic_d_all = _mm256_castsi256_pd(magic_i_all);
__m256i v_lo = _mm256_blend_epi32(magic_i_lo, v, 0b01010101); /* Blend the 32 lowest significant bits of v with magic_int_lo */
__m256i v_hi = _mm256_srli_epi64(v, 32); /* Extract the 32 most significant bits of v */
v_hi = _mm256_xor_si256(v_hi, magic_i_hi32); /* Blend v_hi with 0x45300000 */
__m256d v_hi_dbl = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_all); /* Compute in double precision: */
__m256d result = _mm256_add_pd(v_hi_dbl, _mm256_castsi256_pd(v_lo)); /* (v_hi - magic_d_all) + v_lo Do not assume associativity of floating point addition !! */
return result; /* With gcc use -O3, then -fno-associative-math is default. Do not use -Ofast, which enables -fassociative-math! */
/* With icc use -fp-model precise */
}
int main(){
int i;
uint64_t j;
__m256i j_4;
__m256d v;
double x[4];
double x0, x1, a0, a1;
j = 0ull;
printf("\nAccurate int64_to_double\n");
for (i = 0; i < 260; i++){
j_4= _mm256_set_epi64x(0, 0, -j, j);
v = int64_to_double_fast_precise(j_4);
_mm256_storeu_pd(x,v);
x0 = x[0];
x1 = x[1];
a0 = _mm_cvtsd_f64(_mm_cvtsi64_sd(_mm_setzero_pd(),j));
a1 = _mm_cvtsd_f64(_mm_cvtsi64_sd(_mm_setzero_pd(),-j));
printf(" j =%21li v =%23.1f v=%23.1f -v=%23.1f -v=%23.1f d=%.1f d=%.1f\n", j, x0, a0, x1, a1, x0-a0, x1-a1);
j = j+(j>>2)-(j>>5)+1ull;
}
j = 0ull;
printf("\nAccurate uint64_to_double\n");
for (i = 0; i < 260; i++){
if (i==258){j=-1;}
if (i==259){j=-2;}
j_4= _mm256_set_epi64x(0, 0, -j, j);
v = uint64_to_double_fast_precise(j_4);
_mm256_storeu_pd(x,v);
x0 = x[0];
x1 = x[1];
a0 = (double)((uint64_t)j);
a1 = (double)((uint64_t)-j);
printf(" j =%21li v =%23.1f v=%23.1f -v=%23.1f -v=%23.1f d=%.1f d=%.1f\n", j, x0, a0, x1, a1, x0-a0, x1-a1);
j = j+(j>>2)-(j>>5)+1ull;
}
return 0;
}
The conversions may fail if unsafe math optimization options are enabled. With gcc, -O3 is
safe, but -Ofast may lead to wrong results, because we may not assume associativity
of floating point addition here (the same holds for Mysticial's conversions).
With icc use -fp-model precise.
Fast and accurate conversion by splitting the 64-bit integers in a 32-bit low and a 32-bit high part.
We assume that both the integer input and the double output are in 256 bit wide AVX registers.
Two approaches are considered:
int64_to_double_based_on_cvtsi2sd(): as suggested in the comments on the question, use cvtsi2sd 4 times together with some data shuffling.
Unfortunately both cvtsi2sd and the data shuffling instructions need execution port 5. This limits the performance of this approach.
int64_to_double_full_range(): we can use Mysticial's fast conversion method twice in order to get
an accurate conversion for the full 64 bit integer range. The 64-bit integer is split in a 32-bit low and a 32-bit high part
,similarly as in the answers to this question: How to perform uint32/float conversion with SSE? .
Each of these pieces is suitable for Mysticial's integer to double conversion.
Finally the high part is multiplied by 2^32 and added to the low part.
The signed conversion is a little bit more complicted than the unsigned conversion (uint64_to_double_full_range()),
because srai_epi64() doesn't exist.
Code:
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
/*
gcc -O3 -Wall -m64 -mfma -mavx2 -march=broadwell cvt_int_64_double.c
./a.out A
time ./a.out B
time ./a.out C
etc.
*/
inline __m256d uint64_to_double256(__m256i x){ /* Mysticial's fast uint64_to_double. Works for inputs in the range: [0, 2^52) */
x = _mm256_or_si256(x, _mm256_castpd_si256(_mm256_set1_pd(0x0010000000000000)));
return _mm256_sub_pd(_mm256_castsi256_pd(x), _mm256_set1_pd(0x0010000000000000));
}
inline __m256d int64_to_double256(__m256i x){ /* Mysticial's fast int64_to_double. Works for inputs in the range: (-2^51, 2^51) */
x = _mm256_add_epi64(x, _mm256_castpd_si256(_mm256_set1_pd(0x0018000000000000)));
return _mm256_sub_pd(_mm256_castsi256_pd(x), _mm256_set1_pd(0x0018000000000000));
}
__m256d int64_to_double_full_range(const __m256i v)
{
__m256i msk_lo =_mm256_set1_epi64x(0xFFFFFFFF);
__m256d cnst2_32_dbl =_mm256_set1_pd(4294967296.0); /* 2^32 */
__m256i v_lo = _mm256_and_si256(v,msk_lo); /* extract the 32 lowest significant bits of v */
__m256i v_hi = _mm256_srli_epi64(v,32); /* 32 most significant bits of v. srai_epi64 doesn't exist */
__m256i v_sign = _mm256_srai_epi32(v,32); /* broadcast sign bit to the 32 most significant bits */
v_hi = _mm256_blend_epi32(v_hi,v_sign,0b10101010); /* restore the correct sign of v_hi */
__m256d v_lo_dbl = int64_to_double256(v_lo); /* v_lo is within specified range of int64_to_double */
__m256d v_hi_dbl = int64_to_double256(v_hi); /* v_hi is within specified range of int64_to_double */
v_hi_dbl = _mm256_mul_pd(cnst2_32_dbl,v_hi_dbl); /* _mm256_mul_pd and _mm256_add_pd may compile to a single fma instruction */
return _mm256_add_pd(v_hi_dbl,v_lo_dbl); /* rounding occurs if the integer doesn't exist as a double */
}
__m256d int64_to_double_based_on_cvtsi2sd(const __m256i v)
{ __m128d zero = _mm_setzero_pd(); /* to avoid uninitialized variables in_mm_cvtsi64_sd */
__m128i v_lo = _mm256_castsi256_si128(v);
__m128i v_hi = _mm256_extracti128_si256(v,1);
__m128d v_0 = _mm_cvtsi64_sd(zero,_mm_cvtsi128_si64(v_lo));
__m128d v_2 = _mm_cvtsi64_sd(zero,_mm_cvtsi128_si64(v_hi));
__m128d v_1 = _mm_cvtsi64_sd(zero,_mm_extract_epi64(v_lo,1));
__m128d v_3 = _mm_cvtsi64_sd(zero,_mm_extract_epi64(v_hi,1));
__m128d v_01 = _mm_unpacklo_pd(v_0,v_1);
__m128d v_23 = _mm_unpacklo_pd(v_2,v_3);
__m256d v_dbl = _mm256_castpd128_pd256(v_01);
v_dbl = _mm256_insertf128_pd(v_dbl,v_23,1);
return v_dbl;
}
__m256d uint64_to_double_full_range(const __m256i v)
{
__m256i msk_lo =_mm256_set1_epi64x(0xFFFFFFFF);
__m256d cnst2_32_dbl =_mm256_set1_pd(4294967296.0); /* 2^32 */
__m256i v_lo = _mm256_and_si256(v,msk_lo); /* extract the 32 lowest significant bits of v */
__m256i v_hi = _mm256_srli_epi64(v,32); /* 32 most significant bits of v */
__m256d v_lo_dbl = uint64_to_double256(v_lo); /* v_lo is within specified range of uint64_to_double */
__m256d v_hi_dbl = uint64_to_double256(v_hi); /* v_hi is within specified range of uint64_to_double */
v_hi_dbl = _mm256_mul_pd(cnst2_32_dbl,v_hi_dbl);
return _mm256_add_pd(v_hi_dbl,v_lo_dbl); /* rounding may occur for inputs >2^52 */
}
int main(int argc, char **argv){
int i;
uint64_t j;
__m256i j_4, j_inc;
__m256d v, v_acc;
double x[4];
char test = argv[1][0];
if (test=='A'){ /* test the conversions for several integer values */
j = 1ull;
printf("\nint64_to_double_full_range\n");
for (i = 0; i<30; i++){
j_4= _mm256_set_epi64x(j-3,j+3,-j,j);
v = int64_to_double_full_range(j_4);
_mm256_storeu_pd(x,v);
printf("j =%21li v =%23.1f -v=%23.1f v+3=%23.1f v-3=%23.1f \n",j,x[0],x[1],x[2],x[3]);
j = j*7ull;
}
j = 1ull;
printf("\nint64_to_double_based_on_cvtsi2sd\n");
for (i = 0; i<30; i++){
j_4= _mm256_set_epi64x(j-3,j+3,-j,j);
v = int64_to_double_based_on_cvtsi2sd(j_4);
_mm256_storeu_pd(x,v);
printf("j =%21li v =%23.1f -v=%23.1f v+3=%23.1f v-3=%23.1f \n",j,x[0],x[1],x[2],x[3]);
j = j*7ull;
}
j = 1ull;
printf("\nuint64_to_double_full_range\n");
for (i = 0; i<30; i++){
j_4= _mm256_set_epi64x(j-3,j+3,j,j);
v = uint64_to_double_full_range(j_4);
_mm256_storeu_pd(x,v);
printf("j =%21lu v =%23.1f v+3=%23.1f v-3=%23.1f \n",j,x[0],x[2],x[3]);
j = j*7ull;
}
}
else{
j_4 = _mm256_set_epi64x(-123,-4004,-312313,-23412731);
j_inc = _mm256_set_epi64x(1,1,1,1);
v_acc = _mm256_setzero_pd();
switch(test){
case 'B' :{
printf("\nLatency int64_to_double_cvtsi2sd()\n"); /* simple test to get a rough idea of the latency of int64_to_double_cvtsi2sd() */
for (i = 0; i<1000000000; i++){
v =int64_to_double_based_on_cvtsi2sd(j_4);
j_4= _mm256_castpd_si256(v); /* cast without conversion, use output as an input in the next step */
}
_mm256_storeu_pd(x,v);
}
break;
case 'C' :{
printf("\nLatency int64_to_double_full_range()\n"); /* simple test to get a rough idea of the latency of int64_to_double_full_range() */
for (i = 0; i<1000000000; i++){
v = int64_to_double_full_range(j_4);
j_4= _mm256_castpd_si256(v);
}
_mm256_storeu_pd(x,v);
}
break;
case 'D' :{
printf("\nThroughput int64_to_double_cvtsi2sd()\n"); /* simple test to get a rough idea of the throughput of int64_to_double_cvtsi2sd() */
for (i = 0; i<1000000000; i++){
j_4 = _mm256_add_epi64(j_4,j_inc); /* each step a different input */
v = int64_to_double_based_on_cvtsi2sd(j_4);
v_acc = _mm256_xor_pd(v,v_acc); /* use somehow the results */
}
_mm256_storeu_pd(x,v_acc);
}
break;
case 'E' :{
printf("\nThroughput int64_to_double_full_range()\n"); /* simple test to get a rough idea of the throughput of int64_to_double_full_range() */
for (i = 0; i<1000000000; i++){
j_4 = _mm256_add_epi64(j_4,j_inc);
v = int64_to_double_full_range(j_4);
v_acc = _mm256_xor_pd(v,v_acc);
}
_mm256_storeu_pd(x,v_acc);
}
break;
default : {}
}
printf("v =%23.1f -v =%23.1f v =%23.1f -v =%23.1f \n",x[0],x[1],x[2],x[3]);
}
return 0;
}
The actual performance of these functions may depend on the surrounding code and the cpu generation.
Timing results for 1e9 conversions (256 bit wide) with simple tests B, C, D, and E in the code above, on an intel skylake i5 6500 system:
Latency experiment int64_to_double_based_on_cvtsi2sd() (test B) 5.02 sec.
Latency experiment int64_to_double_full_range() (test C) 3.77 sec.
Throughput experiment int64_to_double_based_on_cvtsi2sd() (test D) 2.82 sec.
Throughput experiment int64_to_double_full_range() (test E) 1.07 sec.
The difference in throughput between int64_to_double_full_range() and int64_to_double_based_on_cvtsi2sd() is larger than I expected.

Thanks #mysticial and #wim for the full-range i64->f64. I came up with a full-range truncating f64->i64 for the Highway SIMD wrapper.
The first version tried to change the rounding mode, but Clang reorders them and ignores asm volatile, memory/cc clobbers, and even atomic fence. It's not clear to me how to make that safe; NOINLINE works but causes lots of spilling.
A second version (Compiler Explorer link) emulates FP renormalization and turns out to be faster according to llvm-mca (8-10 cycles rthroughput/total).
// Full-range F64 -> I64 conversion
#include <hwy/highway.h>
namespace hwy {
namespace HWY_NAMESPACE {
HWY_API Vec256<int64_t> I64FromF64(Full256<int64_t> di, const Vec256<double> v) {
const RebindToFloat<decltype(di)> dd;
using VD = decltype(v);
using VI = decltype(Zero(di));
const VI k0 = Zero(di);
const VI k1 = Set(di, 1);
const VI k51 = Set(di, 51);
// Exponent indicates whether the number can be represented as int64_t.
const VI biased_exp = ShiftRight<52>(BitCast(di, v)) & Set(di, 0x7FF);
const VI exp = biased_exp - Set(di, 0x3FF);
const auto in_range = exp < Set(di, 63);
// If we were to cap the exponent at 51 and add 2^52, the number would be in
// [2^52, 2^53) and mantissa bits could be read out directly. We need to
// round-to-0 (truncate), but changing rounding mode in MXCSR hits a
// compiler reordering bug: https://gcc.godbolt.org/z/4hKj6c6qc . We instead
// manually shift the mantissa into place (we already have many of the
// inputs anyway).
const VI shift_mnt = Max(k51 - exp, k0);
const VI shift_int = Max(exp - k51, k0);
const VI mantissa = BitCast(di, v) & Set(di, (1ULL << 52) - 1);
// Include implicit 1-bit; shift by one more to ensure it's in the mantissa.
const VI int52 = (mantissa | Set(di, 1ULL << 52)) >> (shift_mnt + k1);
// For inputs larger than 2^52, insert zeros at the bottom.
const VI shifted = int52 << shift_int;
// Restore the one bit lost when shifting in the implicit 1-bit.
const VI restored = shifted | ((mantissa & k1) << (shift_int - k1));
// Saturate to LimitsMin (unchanged when negating below) or LimitsMax.
const VI sign_mask = BroadcastSignBit(BitCast(di, v));
const VI limit = Set(di, LimitsMax<int64_t>()) - sign_mask;
const VI magnitude = IfThenElse(in_range, restored, limit);
// If the input was negative, negate the integer (two's complement).
return (magnitude ^ sign_mask) - sign_mask;
}
void Test(const double* pd, int64_t* pi) {
Full256<int64_t> di;
Full256<double> dd;
for (size_t i = 0; i < 256; i += Lanes(di)) {
Store(I64FromF64(di, Load(dd, pd + i)), di, pi + i);
}
}
}
}
If anyone sees any potential for simplifying the algorithm, please leave a comment.

Related

How to do mask / conditional / branchless arithmetic operations in AVX2

I understand how to do general arithmetic operations in AVX2. However, there are conditional operations in scalar code I would like to translate to AVX2. How shall I do it?
For example, I would like to vectorize
double arr[4] = {1.0,2.0,3.0,4.0};
double condition = 3.0;
for (int i = 0; i < 4; i++) {
if (arr[i] < condition) {
arr[i] *= 1.75;
}
else {
arr[i] *= 6.5;
}
}
for (auto i : arr) {
std::cout << i << '\t';
}
Expected output:
1.75 3.5 19.5 26
How can I perform conditional operations like above in AVX2?
Use AVX2 conditional operations. Calculate both possible outputs on whole vectors. After that save those particular results that satisfy your conditions (mask). For your case:
double arr[4] = { 1.0,2.0,3.0,4.0 };
double condition = 3.0;
__m256d vArr = _mm256_loadu_pd(&arr[0]);
__m256d vMultiplier1 = _mm256_set1_pd(1.75);
__m256d vMultiplier2 = _mm256_set1_pd(6.5);
__m256d vFirstResult = _mm256_mul_pd(vArr, vMultiplier1); //if-branch
__m256d vSecondResult = _mm256_mul_pd(vArr, vMultiplier2); //else-branch
__m256d vCondition = _mm256_set1_pd(condition);
vCondition= _mm256_cmp_pd(vArr, vCondition, _CMP_LT_OQ); //a < b ordered (non-signalling)
// Use mask to choose between _firstResult and _secondResult for each element
vFirstResult = _mm256_blendv_pd(vSecondResult, vFirstResult, vCondition);
double res[4];
_mm256_storeu_pd(&res[0], vFirstResult);
for (auto i : res) {
std::cout << i << '\t';
}
Possible alternative approach instead of BLENDV is combination of AND, ANDNOT and OR. However BLENDV is much better both in simplicity and performance. Use BLENDV as long as you have as least SSE4.1 and don't have AVX512 yet.
For information about what _CMP_LT_OQ mean and can be see Dave Dopson's table. You can do whatever comparisons you want changing this accordingly.
There are detailed notes by Peter Cordes about conditional operations in AVX2 and AVX512. There are more examples on conditional vectorization (with SSE and AVX512 examples) in Agner Fog's "Optimizing C++" in chapter 12.4 on pages 121-124.
Maybe you aren't want to do some computations in else-branch or explicitly want to zero it. So that your expected output will look like
1.75 3.5 0.0 0.0
It that case you can you a bit more faster instruction sequence since you're not have to think about else-branch. There are at least 2 ways to achieve speedup:
Removing second multiplication, but keeping blendv. Instead of _secondResult just use zeroed vector (it can be global const).
Removing both second multiplication and blendv, and replacing blendv with AND mask. This variant uses zeroed vector as well.
Second way will be better. For example, according to uops table VBLENDVB on Skylake microarchitecture takes 2 uops, 2 clocks latency and can be done only once per clock. Meanwhile VANDPD have 1 uops, 1 clock latency and can be done 3 times in a single clock.
Worse way, just blending with zero
double arr[4] = { 1.0,2.0,3.0,4.0 };
double condition = 3.0;
__m256d vArr = _mm256_loadu_pd(&arr[0]);
__m256d vMultiplier1 = _mm256_set1_pd(1.75);
__m256d vFirstResult = _mm256_mul_pd(vArr, vMultiplier1); //if-branch
__m256d vZeroes = _mm256_setzero_pd();
__m256d vCondition = _mm256_set1_pd(condition);
vCondition = _mm256_cmp_pd(vArr, vCondition, _CMP_LT_OQ); //a < b ordered (non-signalling)
//Conditionally blenv _firstResult when IF statement satisfied, zeroes otherwise
vFirstResult = _mm256_blendv_pd(vZeroes, vFirstResult, vCondition);
double res[4];
_mm256_storeu_pd(&res[0], vFirstResult);
for (auto i : res) {
std::cout << i << '\t';
}
Better way, bitwise AND with a compare result is a cheaper way to conditionally zero.
double arr[4] = { 1.0,2.0,3.0,4.0 };
double condition = 3.0;
__m256d vArr = _mm256_loadu_pd(&arr[0]);
__m256d vMultiplier1 = _mm256_set1_pd(1.75);
__m256d vFirstResult = _mm256_mul_pd(vArr, vMultiplier1); //if-branch
__m256d vCondition = _mm256_set1_pd(condition);
vCondition = _mm256_cmp_pd(vArr, vCondition, _CMP_LT_OQ); //a < b ordered (non-signalling)
// If result not satisfied condition, after bitwise AND it becomes zero
vFirstResult = _mm256_and_pd(vFirstResult, vCondition);
double res[4] = {0.0,0.0,0.0,0.0};
_mm256_storeu_pd(&res[0], vFirstResult);
for (auto i : res) {
std::cout << i << '\t';
This takes advantage of what a compare-result vector really is, and that the bit-pattern for IEEE 0.0 is all bits zeroed.

can i speed up more than _mm256_i32gather_epi32

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?
Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

Generate random numbers in a given range with AVX2, faster than SVML _mm256_rem_epu32 remainder?

I'm currently trying to implement an XOR_SHIFT Random Number Generator using AVX2, it's actually quite easy and very fast. However I need to be able to specify a range. This usually requires modulo.
This is a major problem for me for 2 reasons:
Adding the _mm256_rem_epu32() / _mm256_rem_epi32() SVML function to my code takes the run time of my loop from around 270ms to 1.8 seconds. Ouch!
SVML is only available on MSVC and Intel Compilers
Are the any significantly faster ways to do modulo using AVX2?
Non Vector Code:
std::srand(std::time(nullptr));
std::mt19937_64 e(std::rand());
uint32_t seed = static_cast<uint32_t>(e());
for (; i != end; ++i)
{
seed ^= (seed << 13u);
seed ^= (seed >> 7u);
seed ^= (seed << 17u);
arr[i] = static_cast<T>(low + (seed % ((up + 1u) - low)));
}//End for
Vectorized:
constexpr uint32_t thirteen = 13u;
constexpr uint32_t seven = 7u;
constexpr uint32_t seventeen = 17u;
const __m256i _one = _mm256_set1_epi32(1);
const __m256i _lower = _mm256_set1_epi32(static_cast<uint32_t>(low));
const __m256i _upper = _mm256_set1_epi32(static_cast<uint32_t>(up));
__m256i _temp = _mm256_setzero_si256();
__m256i _res = _mm256_setzero_si256();
__m256i _seed = _mm256_set_epi32(
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e())
);
for (; (i + 8uz) < end; ++i)
{
//Generate Random Numbers
_temp = _mm256_slli_epi32(_seed, thirteen);
_seed = _mm256_xor_si256(_seed, _temp);
_temp = _mm256_srai_epi32(_seed, seven);
_seed = _mm256_xor_si256(_seed, _temp);
_temp = _mm256_slli_epi32(_seed, seventeen);
_seed = _mm256_xor_si256(_seed, _temp);
//Narrow
_temp = _mm256_add_epi32(_upper, _one);
_temp = _mm256_sub_epi32(_temp, _lower);
_temp = _mm256_rem_epu32(_seed, _temp); //Comment this line out for a massive speed up but incorrect results
_res = _mm256_add_epi32(_lower, _temp);
_mm256_store_si256((__m256i*) &arr[i], _res);
}//End for
If you range is smaller than ~16.7 million, and you don’t need cryptography-grade quality of the distribution, an easy and relatively fast method of narrowing these random numbers is FP32 math.
Here’s an example, untested.
The function below takes integer vector with random bits, and converts these bits into integer numbers in [ 0 .. range - 1 ] interval.
// Ideally, make sure this function is inlined,
// by applying __forceinline for vc++ or __attribute__((always_inline)) for gcc/clang
inline __m256i narrowRandom( __m256i bits, int range )
{
assert( range > 1 );
// Convert random bits into FP32 number in [ 1 .. 2 ) interval
const __m256i mantissaMask = _mm256_set1_epi32( 0x7FFFFF );
const __m256i mantissa = _mm256_and_si256( bits, mantissaMask );
const __m256 one = _mm256_set1_ps( 1 );
__m256 val = _mm256_or_ps( _mm256_castsi256_ps( mantissa ), one );
// Scale the number from [ 1 .. 2 ) into [ 0 .. range ),
// the formula is ( val * range ) - range
const __m256 rf = _mm256_set1_ps( (float)range );
val = _mm256_fmsub_ps( val, rf, rf );
// Convert to integers
// The instruction below always truncates towards 0 regardless on MXCSR register.
// If you want ranges like [ -10 .. +10 ], use _mm256_add_epi32 afterwards
return _mm256_cvttps_epi32( val );
}
When inlined, it should compile into 4 instructions, vpand, vorps, vfmsub132ps, vcvttps2dq Probably an order of magnitude faster than _mm256_rem_epu32 in your example.

Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

According to articles like this, half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?
Naive code:
uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;
The problem here is that first the number i is converted into float losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.
Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.
Update: the solution with a double is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?
You can do the calculation using double instead so you don't lose any precision on the uint32_t value, then assign the result to a float.
float f = (double)i / (1ui32<<31) - 1.0;
In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:
//---------------------------------------------------------------------------
float i32_to_f32(int x)
{
int exp;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0x00000000) return 0.0f;
if (x< -0x1FFFFFFF) return -1.0f;
if (x> +0x1FFFFFFF) return +1.0f;
// conversion
y.u=0; // reset bits
if (x<0){ y.u|=0x80000000; x=-x; } // sign (31 bits left)
exp=((x>>23)&63)-64; // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
y.u|=(exp+127)<<23; // exponent bias and bit position
y.u|=x&0x007FFFFF; // mantissa
return y.f;
}
//---------------------------------------------------------------------------
int f32_to_i32(float x)
{
int exp,man,i;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0.0f) return 0x00000000;
if (x<=-1.0f) return -0x1FFFFFFF;
if (x>=+1.0f) return +0x1FFFFFFF;
// conversion
y.f=x;
exp=(y.u>>23)&255; exp-=127; // exponent bias and bit position
if (exp<-64) return 0.0f;
man=y.u&0x007FFFFF; // mantissa
i =(exp<<23)&0x1F800000;
i|= man;
if (y.u>=0x80000000) i=-i; // sign
return i;
}
//---------------------------------------------------------------------------
I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).
But the distribution is not linear nor uniform at all:
in Red is the float in range <-1,+1> and Blue is integer in range <-1FFFFFFF,+1FFFFFFF>.
On the other hand there is no rounding at all in both conversions ...
PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).
The thing to realize is while (float)i does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i is smaller, it will lose less than 8-bits. But things will work out well).
So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX.
This expression works: (float)(int)(value^0x80000000)/0x80000000.
Here's how it works:
The (int)(value^0x80000000) part flips the sign bit, so 0x0 gets mapped to INT_MIN, and 0xffffffff gets mapped to INT_MAX.
Then there is conversion to float. This is where some rounding happens, and we lose precision (but it is not a problem).
Then just divide by 0x80000000 to get into the range [-1..1]. As this division just adjusts the exponent part, this division doesn't lose any precision.
So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float (this theoretical rounding has the same effect as the rounding at the 2. step)
But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0).
I suggest (if yout want to avoid division and use an accurately float-representable start value of 1.0*2^-32):
float e = i * ldexp(1.0,-32) - 1.0;
Any ideas how to achieve this without using double-precision floating-point?
Without assuming too much about the insides of float:
Shift u until the most significant bit is set, halving the float conversion value.
"keeping the uniform distribution"
50% of the uint32_t values will be in the [0.5 ... 1.0)
25% of the uint32_t values will be in the [0.25 ... 0.5)
12.5% of the uint32_t values will be in the [0.125 ... 0.25)
6.25% of the uint32_t values will be in the [0.0625 ... 0.125)
...
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
float ui32to0to1(uint32_t u) {
if (u) {
float band = 1.0f/(1llu<<32);
while ((u & 0x80000000) == 0) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
return 0.0f;
}
Some test code to show functional equivalence to double.
int test(uint32_t u) {
volatile float f0 = (float) ((double)u / (1llu<<32));
volatile float f1 = ui32to0to1(u);
if (f0 != f1) {
printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
return 1;
}
return 0;
}
int main(void) {
for (int i=0; i<100000000; i++) {
test(rand()*65535u ^ rand());
}
return 0;
}
Various optimizations are possible, especially with assuming properties of float. Yet for an initial answer, I'll stick to a general approach.
For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG which is usually 24.
float ui32to0to1(uint32_t u) {
float band = 1.0f/(1llu<<32);
for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
This answers maps [0 to 232-1] to [0.0 to 1.0)
To map to [0 to 232-1] to (-1.0 to 1.0). It can form -0.0.
if (u >= 0x80000000) {
return ui32to0to1((u - 0x80000000)*2);
} else
return -ui32to0to1((0x7FFFFFFF - u)*2);
}

How can I add together two SSE registers

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word.
Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.
To add two 128-bit numbers x and y to give z with SSE you can do it like this
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.
The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__ // AMD XOP instruction set
return _mm_comgt_epu64(b,a));
#else // SSE2 instruction set
__m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword
__m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped
__m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped
__m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords
__m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords
__m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords
__m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger
__m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part
__m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part
return big;
#endif
}
int main() {
__m128i x,y,z,c;
x = _mm_set_epi64x(3,0xffffffffffffffffll);
y = _mm_set_epi64x(1,0x2ll);
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
int out[4];
//int64_t out[2];
_mm_storeu_si128((__m128i*)out, z);
printf("%d %d\n", out[2], out[0]);
}
Edit:
The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.
The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:
__m256i x4, y4, z4;
uint64_t x[4], uint64_t y[4], uint64_t z[4]
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);
void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
uint64_t c1 = 0, c2 = 0, tmp;
//add low 128-bits
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
c1 += z[1]<x[1];
tmp = z[1];
z[1] += z[0]<x[0];
c1 += z[1]<tmp;
//add high 128-bits + carry from low 128-bits
z[2] = x[2] + y[2];
c2 += z[2]<x[2];
tmp = z[2];
z[2] += c1;
c2 += z[2]<tmp;
z[3] = x[3] + y[3] + c2;
}
int main() {
uint64_t x[4], y[4], z[4];
x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,1,0)
//x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
//y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,0,0)
add_u256(x,y,z);
for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}
Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.
__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);