What is the best way to check whether a AVX intrinsic __m256 (vector of 8 float) contains any inf? I tried
__m256 X=_mm256_set1_ps(1.0f/0.0f);
_mm256_cmp_ps(X,X,_CMP_EQ_OQ);
but this compares to true. Note that this method will find nan (which compare to false). So one way is to check for X!=nan && 0*X==nan:
__m256 Y=_mm256_mul_ps(X,_mm256_setzero_ps()); // 0*X=nan if X=inf
_mm256_andnot_ps(_mm256_cmp_ps(Y,Y,_CMP_EQ_OQ),
_mm256_cmp_ps(X,X,_CMP_EQ_OQ));
However, this appears somewhat lengthy. Is there a faster way?
If you want to check if a vector has any infinities:
#include <limits>
bool has_infinity(__m256 x){
const __m256 SIGN_MASK = _mm256_set1_ps(-0.0);
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_andnot_ps(SIGN_MASK, x);
x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ);
return _mm256_movemask_ps(x) != 0;
}
If you want a vector mask of the values that are infinity:
#include <limits>
__m256 is_infinity(__m256 x){
const __m256 SIGN_MASK = _mm256_set1_ps(-0.0);
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_andnot_ps(SIGN_MASK, x);
x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ);
return x;
}
I think a better solution is to use vptest rather than vmovmskps.
bool has_infinity(const __m256 &x) {
__m256 s = _mm256_andnot_ps(_mm256_set1_ps(-0.0), x);
__m256 cmp = _mm256_cmp_ps(s,_mm256_set1_ps(1.0f/0.0f),0);
__m256i cmpi = _mm256_castps_si256(cmp);
return !_mm256_testz_si256(cmpi,cmpi);
}
The intrinsic _mm256_castps_si256 is only to make the compiler happy "This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency."
vptest is superior to vmovmskps because it sets the zero flag while vmovmskps does not. With vmovmskps the compiler has to generate test to set the zero flag.
This short one tests whether any of floats in a vector are corrupted (NAN or INFINITY) or not:
int is_corrupted( const __m256 & float_v8 ) {
__m256 self_sub_v8 = _mm256_sub_ps( float_v8, float_v8 );
return _mm256_movemask_epi8( _mm256_castps_si256( self_sub_v8 ) );
}
It's 2 AVX2 instructions only without additional constants, and uses a trick -- any normal "self"-subtraction should end as a zero, so a movemask_epi8 later should extract some of its bits indicating whether it's a zero or a NAN/INFINITY. I haven't tested it on different platforms.
Edit: see Peter's important comments on rounding toward negative.
If you don't mind also detecting NaNs, i.e. to check for numbers that aren't finite, see #gox's answer suggesting subtraction from itself (producing +0.0 in the default rounding mode for finite inputs, else NaN) and then using _mm256_movemask_epi8 to take one bit from each byte, including one from the exponent which will be non-zero for NaNs, or zero for 0.0. Testing movemask & 0x77777777 would let you ignore the sign bit so it works even with FP rounding mode = roundTowardNegative where x-x gives -0.0
If you need to detect infinity specifically, not also NaN
AVX-512F+VL has _mm256_fpclass_ps_mask + _kortestz_mask16_u8. But without AVX-512, it might be most efficient to use AVX2 integer stuff on the bit-pattern.
The IEEE binary32 bit-pattern for infinity is an all-ones exponent field and an all-zero mantissa. And the sign bit indicates whether it's + or - infinity. (NaN is the same exponent but a non-zero mantissa) So there are 2 bit-patterns we want to detect, which differ only in the high bit.
We can do this using AVX2 integer shift + cmpeq operations with only one vector constant, with lower latency than vcmpps even accounting for the bypass latency if the input came from an FP math instruction. And potentially a throughput benefit, as vpslld and/or vpcmpeqd can run on different ports than FP math/compare instructions on some CPUs. (Using a bitwise AND, ANDN, or OR to force the sign bit to a known state, clear or set, could further help with bypass latency on some CPUs, and be even better for throughput, able to execute on a wider choice of back-end execution units on more CPUs.)
(https://uops.info/ / https://agner.org/optimize/)
You could do this with integer operations, like left-shift by 1 to remove the sign bit, then _mm256_cmpeq_epi32 against set1_epi32(0xff000000) (the bit pattern for infinity, left-shifted by 1. All bits set in the exponent, all bits clear in the mantissa, otherwise it's a NaN). Then you'd only need one constant, and the lower latency of integer compare should make up for the possible bypass latency.
int has_infinity_avx2(__m256 v)
{
__m256i bits = _mm256_castps_si256(v);
bits = _mm256_slli_epi32(bits, 1); // shift out sign bits. Requires AVX2
bits = _mm256_cmpeq_epi32(bits, _mm256_set1_epi32(0xff000000)); // infinity << 1
return _mm256_movemask_epi8(bits);
// or cast for _mm256_movemask_ps if you want to std::countr_zero to find out where in terms of elements instead of byte offsets
}
I had an earlier idea, but it ends up only helping if you want to test for ALL elements being infinite. Oops.
With AVX2, you can test for all elements being infinity with PTEST. I got this idea for using xor to compare for equality from EOF's comment on this question, which I used for my answer there. I thought I was going to be able to make a shorter version of a test-for-any-inf, but of course pxor only works as a test for all 256b being equal.
#include <limits>
bool all_infinity(__m256 x){
const __m256i SIGN_MASK = _mm256_set1_epi32(0x7FFFFFFF); // -0.0f inverted
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_xor_si256(x, INF); // other than sign bit, x will be all-zero only if all the bits match.
return _mm256_testz_si256(x, SIGN_MASK); // flags are ready to branch on directly
}
With AVX512, there's a __mmask8 _mm512_fpclass_pd_mask (__m512d a, int imm8). (vfpclasspd). (See Intel's guide). Its output is a mask register, which you can branch on directly. You can test for any/all of +/- zero, +/- inf, Q/S NaN, Denormal, Negative.
Related
is there any simd that can ceil a float (round up) and than cast it to unsigned int without wrap? (i.e. any negative number become 0)?
maybe _mm_cvttps_epi32? not sure about rounding and wrap though (I think i got ub):
__m128 indices = _mm_mul_ps(phaseIncrement.v, mMinTopFreq.v);
indices = sse_mathfun_log_ps(indices) / std::log(2.0f);
__m128i finalIndices = _mm_cvttps_epi32(indices);
or any fancy way? I'm compiling with -O3 -march=nehalem -funsafe-math-optimizations -fno-omit-frame-pointer (so sse4.1 are allowed).
EDIT
here's a version of the revisited code, after Peter Cordes's suggestions:
__m128 indexesFP = _mm_mul_ps(phaseIncrement.v, mMinTopFreq.v);
indexesFP = sse_mathfun_log_ps(indexesFP) * (1.0f / std::log(2.0f));
indexesFP = _mm_ceil_ps(indexesFP);
__m128i indexes = _mm_cvttps_epi32(indexesFP);
indexes = _mm_max_epi32(indexes, _mm_set_epi32(0, 0, 0, 0));
for (int i = 0; i < 4; i++) {
int waveTableIndex = _mm_extract_epi32(indexes, i);
waveTablesData[i] = &mWaveTables[waveTableIndex];
}
any improvements that can be made?
since the range will be limited (such as [0, 16] in the extreme case
Oh, this doesn't need to work for numbers greater than INT_MAX, up to UINT_MAX? That's much easier than the problem as stated at the top of the question. Yeah just _mm_ceil_ps and use a signed conversion to epi32 (int32_t) and use _mm_min_epi32 for the upper limit, and probably _mm_max_epi32 for the lower limit. (Only one instruction instead of shift/and).
Or possibly _mm_sub_ps to range-shift to -16..0 / _mm_cvttps_epi32 to truncate (upwards towards zero), then integer subtract from zero. _mm_ceil_ps costs 2 uops on most CPUs, so that's about break-even, trading an FP operation for integer though. But requires more setup.
Integer min/max are cheaper (lower latency, and better throughput) than FP, so prefer clamp after conversion. Out-of-range floats convert to INT_MIN (high-bit set, others zero, what Intel calls the "integer indefinite" value) so will clamp to 0.
If you have a lot of this to do in a loop that doesn't do other FP computation, change the MXCSR rounding mode for this loop to round towards +Inf. Use _mm_cvtps_epi32 (which uses the current FP rounding mode, like lrint / (int)nearbyint) instead of ceil + cvtt (truncation).
This use-case: ceil(log2(float))
You could just pull that out of the FP bit pattern directly, and round up based on a non-zero mantissa. Binary floating point already contains a power-of-2 exponent field, so you just need to extract that with a bit of massaging.
Like _mm_and_ps / _mm_cmpeq_epi32 / _mm_add_epi32 to add the -1 compare result for FP values with a zero mantissa, so you treat powers of 2 differently from anything higher.
Should be faster than computing an FP log base e with a fractional part, even if it's only a quick approximation. Values smaller than 1.0 whose biased exponent is negative may need some extra handling.
Also, since you want all four indices, probably faster to just store to an array of 4 uint32_t values and access it, instead of using movd + 3x pextrd.
An even better way to round up to the next exponent for floats with a non-zero mantissa is to simply do an integer add of 0x007fffff to the bit-pattern. (23 set bits: https://en.wikipedia.org/wiki/Single-precision_floating-point_format).
// we round up the exponent separately from unbiasing.
// Might be possible to do better
__m128i ceil_log2_not_fully_optimized(__m128 v)
{
// round up to the next power of 2 (exponent value) by carry-out from mantissa field into exponent
__m128i floatbits = _mm_add_epi32(_mm_castps_si128(v), _mm_set1_epi32(0x007fffff));
__m128i exp = _mm_srai_epi32(floatbits, 23); // arithmetic shift so negative numbers stay negative
exp = _mm_sub_epi32(exp, _mm_set1_epi32(127)); // undo the bias
exp = _mm_max_epi32(exp, _mm_setzero_si128()); // clamp negative numbers to zero.
return exp;
}
If the exponent field was already all-ones, that means +Inf with an all-zero mantissa, else NaN. So it's only possible for carry propagation from the first add to flip the sign bit if the input was already NaN. +Inf gets treated as one exponent higher than FLT_MAX. 0.0 and 0.01 should both come out to 0, if I got this right.
According to GCC on Godbolt, I think so: https://godbolt.org/z/9G9orWj16 GCC doesn't fully constant-propagate through it, so we can actually see the input to pmaxsd, and see that 0.0 and 0.01 come out to max(0, -127) and max(0,-3) = 0 each. And 3.0 and 4.0 both come out to max(0, 2) = 2.
We can maybe even combine that +0x7ff... idea with adding a negative number to the exponent field to undo the bias.
Or to get carry-out into the sign bit correct, subtracting from it, with a 1 in the mantissa field so an all-zero mantissa will propagate a borrow and subtract one more from the exponent field? But small exponents less than the bias could still carry/borrow out and flip the sign bit. But that might be ok if we're going to clamp such values to zero anyway if they come out as small positive instead of negative.
I haven't worked out the details of this yet; if we need to handle the original input being zero, this might be a problem. If we can assume the original sign bit was cleared, but log(x) might be negative (i.e. exponent field below the bias), this should work just fine; carry out of the exponent field into the sign bit is exactly what we want in that case, so srai keeps it negative and max chooses the 0.
// round up to the next power of 2 (exponent value) while undoing the bias
const uint32_t f32_unbias = ((-127)<<23) + 0x007fffffU;
???
profit
With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element.
Is there an efficient way to implement this using AVX and AVX2 instructions only?
Currently I'm using a loop which extracts each element and applies the _lzcnt_u32 function.
Related: to bit-scan one large bitmap, see Count leading zeros in __m256i word which uses pmovmskb -> bitscan to find which byte to do a scalar bitscan on.
This question is about doing 8 separate lzcnts on 8 separate 32-bit elements when you're actually going to use all 8 results, not just select one.
float represents numbers in an exponential format, so int->FP conversion gives us the position of the highest set bit encoded in the exponent field.
We want int->float with magnitude rounded down (truncate the value towards 0), not the default rounding of nearest. That could round up and make 0x3FFFFFFF look like 0x40000000. If you're doing a lot of these conversions without doing any FP math, you could set the rounding mode in the MXCSR1 to truncation then set it back when you're done.
Otherwise you can use v & ~(v>>8) to keep the 8 most-significant bits and zero some or all lower bits, including a potentially-set bit 8 below the MSB. That's enough to ensure all rounding modes never round up to the next power of two. It always keeps the 8 MSB because v>>8 shifts in 8 zeros, so inverted that's 8 ones. At lower bit positions, wherever the MSB is, 8 zeros are shifted past there from higher positions, so it will never clear the most significant bit of any integer. Depending on how set bits below the MSB line up, it might or might not clear more below the 8 most significant.
After conversion, we use an integer shift on the bit-pattern to bring the exponent (and sign bit) to the bottom and undo the bias with a saturating subtract. We use min to set the result to 32 if no bits were set in the original 32-bit input.
__m256i avx2_lzcnt_epi32 (__m256i v) {
// prevent value from being rounded up to the next power of two
v = _mm256_andnot_si256(_mm256_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm256_castps_si256(_mm256_cvtepi32_ps(v)); // convert an integer to float
v = _mm256_srli_epi32(v, 23); // shift down the exponent
v = _mm256_subs_epu16(_mm256_set1_epi32(158), v); // undo bias
v = _mm256_min_epi16(v, _mm256_set1_epi32(32)); // clamp at 32
return v;
}
Footnote 1: fp->int conversion is available with truncation (cvtt), but int->fp conversion is only available with default rounding (subject to MXCSR).
AVX512F introduces rounding-mode overrides for 512-bit vectors which would solve the problem, __m512 _mm512_cvt_roundepi32_ps( __m512i a, int r);. But all CPUs with AVX512F also support AVX512CD so you could just use _mm512_lzcnt_epi32. And with AVX512VL, _mm256_lzcnt_epi32
#aqrit's answer looks like a more-clever use of FP bithacks. My answer below is based on the first place I looked for a bithack which was old and aimed at scalar so it didn't try to avoid double (which is wider than int32 and thus a problem for SIMD).
It uses HW signed int->float conversion and saturating integer subtracts to handle the MSB being set (negative float), instead of stuffing bits into a mantissa for manual uint->double. If you can set MXCSR to round down across a lot of these _mm256_lzcnt_epi32, that's even more efficient.
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogIEEE64Float suggests stuffing integers into the mantissa of a large double, then subtracting to get the FPU hardware to get a normalized double. (I think this bit of magic is doing uint32_t -> double, with the technique #Mysticial explains in How to efficiently perform double/int64 conversions with SSE/AVX? (which works for uint64_t up to 252-1)
Then grab the exponent bits of the double and undo the bias.
I think integer log2 is the same thing as lzcnt, but there might be an off-by-1 at powers of 2.
The Standford Graphics bithack page lists other branchless bithacks you could use that would probably still be better than 8x scalar lzcnt.
If you knew your numbers were always small-ish (like less than 2^23) you could maybe do this with float and avoid splitting and blending.
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
The code above loads a 64-bit (IEEE-754 floating-point) double with a 32-bit integer (with no paddding bits) by storing the integer in the mantissa while the exponent is set to 252. From this newly minted double, 252 (expressed as a double) is subtracted, which sets the resulting exponent to the log base 2 of the input value, v. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias, 0x3FF (which is 1023 decimal).
To do this with AVX2, blend and shift+blend odd/even halves with set1_epi32(0x43300000) and _mm256_castps_pd to get a __m256d. And after subtracting, _mm256_castpd_si256 and shift / blend the low/high halves into place then mask to get the exponents.
Doing integer operations on FP bit-patterns is very efficient with AVX2, just 1 cycle of extra latency for a bypass delay when doing integer shifts on the output of an FP math instruction.
(TODO: write it with C++ intrinsics, edit welcome or someone else could just post it as an answer.)
I'm not sure if you can do anything with int -> double conversion and then reading the exponent field. Negative numbers have no leading zeros and positive numbers give an exponent that depends on the magnitude.
If you did want that, you'd go one 128-bit lane at a time, shuffling to feed xmm -> ymm packed int32_t -> packed double conversion.
The question is also tagged AVX, but there are no instructions for integer processing in AVX, which means one needs to fall back to SSE on platforms that support AVX but not AVX2. I am showing an exhaustively tested, but a bit pedestrian version below. The basic idea here is as in the other answers, in that the count of leading zeros is determined by the floating-point normalization that occurs during integer to floating-point conversion. The exponent of the result has a one-to-one correspondence with the count of leading zeros, except that the result is wrong in the case of an argument of zero. Conceptually:
clz (a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
where float_as_uint32() is a re-interpreting cast and uint32_to_float_rz() is a conversion from unsigned integer to floating-point with truncation. A normal, rounding, conversion could bump up the conversion result to the next power of two, resulting in an incorrect count of leading zero bits.
SSE does not provide truncating integer to floating-point conversion as a single instruction, nor conversions from unsigned integers. This functionality needs to be emulated. The emulation does not need to be exact, as long as it does not change the magnitude of the conversion result. The truncation part is handled by the invert - right shift - andn technique from aqrit's answer. To use signed conversion, we cut the number in half before the conversion, then double and increment after the conversion:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
This approach is translated into SSE intrinsics in sse_clz() below.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"
/* compute count of leading zero bits using floating-point normalization.
clz(a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
The problematic part here is uint32_to_float_rz(). SSE does not offer
conversion of unsigned integers, and no rounding modes in integer to
floating-point conversion. Since all we need is an approximate version
that preserves order of magnitude:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
*/
__m128i sse_clz (__m128i a)
{
__m128 fp1 = _mm_set_ps1 (1.0f);
__m128i zero = _mm_set1_epi32 (0);
__m128i i158 = _mm_set1_epi32 (158);
__m128i iszero = _mm_cmpeq_epi32 (a, zero);
__m128i lsr1 = _mm_srli_epi32 (a, 1);
__m128i lsr2 = _mm_srli_epi32 (a, 2);
__m128i atrunc = _mm_andnot_si128 (lsr2, lsr1);
__m128 atruncf = _mm_cvtepi32_ps (atrunc);
__m128 atruncf2 = _mm_add_ps (atruncf, atruncf);
__m128 conv = _mm_add_ps (atruncf2, fp1);
__m128i convi = _mm_castps_si128 (conv);
__m128i lsr23 = _mm_srli_epi32 (convi, 23);
__m128i res = _mm_sub_epi32 (i158, lsr23);
return _mm_sub_epi32 (res, iszero);
}
/* Portable reference implementation of 32-bit count of leading zeros */
int clz32 (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
/* Test floating-point based count leading zeros exhaustively */
int main (void)
{
__m128i res;
uint32_t resi[4], refi[4];
uint32_t count = 0;
do {
refi[0] = clz32 (count);
refi[1] = clz32 (count + 1);
refi[2] = clz32 (count + 2);
refi[3] = clz32 (count + 3);
res = sse_clz (_mm_set_epi32 (count + 3, count + 2, count + 1, count));
memcpy (resi, &res, sizeof resi);
if ((resi[0] != refi[0]) || (resi[1] != refi[1]) ||
(resi[2] != refi[2]) || (resi[3] != refi[3])) {
printf ("error # %08x %08x %08x %08x\n",
count, count+1, count+2, count+3);
return EXIT_FAILURE;
}
count += 4;
} while (count);
return EXIT_SUCCESS;
}
In runtime I have 2 ranges defined by their uint32_t borders a..b and c..d. The first range tends to be much greater than the second: 8 < (b - a) / (d - c) < 64.
Exact limits: a >= 0, b <= 2^31 - 1, c >= 0, d <= 2^20 - 1.
I need a routine that performs linear mapping of an integer from the first range onto the second one: f(uint32_t x) -> round_to_uint32_t((float)(x - a) / (b - a) * (d - c) + c).
When b - a >= d - c it is important to mantain the ratio as close to ideal as possible, otherwise in cases when element from [a; b] can be mapped on more than one integer from [c; d] it is okay to return any of these integers.
Sounds like a simple ratio problem and was already answered in many questions like
Convert a number range to another range, maintaining ratio
but here I need a really really fast solution.
This routine is a pivotal part of a specialized sorting algorithm and will be called at least once for every element of a sorted array.
SIMD solution is also acceptable if it doesn't drop overall performance.
Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math); the compiler can't turn x / foo * bar into x * (bar/foo) for you, even though that's very good with loop-invariant bar/foo. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.
_mm256_fmadd_ps looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a). If float rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a)). Using double for this calculation avoids rounding error during conversion to FP of the division operands.
You're reusing the same a,b,c,d for many x, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)
Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i before looping through that scalar.
Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.
Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.
If not SIMD, then even scalar C++ fma() is still good, for vfmadd213sd, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq rather than vcvttps2dq).
Note that uint32_t <-> float conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.
It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x and x-a (and b-a) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF. (Or at least non-negative. Zero is fine.)
Float Rounding
For SIMD, single-precision float is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t can be exactly represented as a float. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.
Using SIMD double is possible but requires some shuffling.
I don't think float is usually a problem for the formula as-written with (float)(x-a). If the b-a input range is large, that means both ranges are large and rounding error isn't going to map all possible x values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a values.
But if we want to factor out the -a * (d - c) / (b - a) part and combine it with the +c at the end, then
We potentially have precision loss from catastrophic cancellation in that value to be added.
We need to do (float)x on the raw input value. If a is huge and b-a is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x values to the same float.
To make best use of FMA, we want to do the +c before converting back to integer, which again risks output rounding error if the d-c is a small output range but c is huge. In your case not a problem; with d <= 2^20 - 1 we know that float can exactly represent every output integer value in that c..d range.
If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U on input and ...+c+0x80000000U on output (after rounding to nearest int32_t). But that would introduce huge float rounding error for small uint32_t inputs (close to 0) which get range-shifted to close to INT_MIN.
We don't need to range-shift for the b-a or d-c because the + or - or XOR with 0x80000000U would cancel out in the subtractions.
Example:
The const vectors should be hoisted out of a loop by the compiler after this inlines,
or you can do that manually.
This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.
// fastest but not safe if b-a is small and a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
double scale_scalar = (d - c) / (double)(b - a);
const __m256 scale = _mm256_set1_ps(scale_scalar);
const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
// (x-a) * scale + c
// = x * scale + (-a*scale + c) but with different rounding error from doing -a*scale + c
__m256 in = _mm256_cvtepi32_ps(data);
__m256 out = _mm256_fmadd_ps(in, scale, add);
return _mm256_cvtps_epi32(out); // convert back with round to nearest-even
// _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
}
Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer
static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
const __m256 cvec = _m256_set1_ps(c);
__m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a)); // add can more easily fold a load of a memory operand than sub because it's commutative. Only some compilers will do this for you.
__m256 in_fp = _mm256_cvtepi32_ps(in_offset);
__m256 out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c)); // in*scale + c
return _mm256_cvtps_epi32(out);
}
Without FMA you could still use vmulps. You might as well convert back to integer before adding c if you're doing that, although vaddps would be safe.
You might use this in a loop like
void foo(uint32_t *arr, ptrdiff_t len)
{
if (len < 24) special case;
alignas(32) uint32_t tmpbuf[16];
// peel half of first iteration for software pipelining / loop rotation
__m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
__m256i outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)tmpbuf, outrange);
// could have used an unsigned loop counter
// since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {
// prep next 8 elements
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[8], outrange);
// use first 8 elements
for (int j=0 ; j<8 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
// prep 8 more for next iteration
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[0], outrange);
// use 2nd 8 elements
for (int j=8 ; j<16 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
}
// use tmpbuf[0..7]
// then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
}
These variable-names sound dumb but I couldn't think of anything better.
This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange)) if you want.)
Special cases
If there cases where you know (b - a) is a power of 2, you could bitscan with tzcnt or bsf, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz() to count trailing zeros.)
Or can you ensure that (b - a) is always a power of 2?
Or better, if (b - a) / (d - c) is an exact power of 2 the whole thing can just be sub / right shift / add.
If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.
I am writing a unit test for a math function and I would like to be able to "walk" all possible floats/doubles.
Due to IEEE shenanigans, floating types cannot be incremented (++) at their extremities. See this question for more details. That answer states :
one can only add multiples of 2^(n-N)
But never mentions what little n is.
A solution to iterate all possible values from +0.0 to +infinity is given in this great blog post. The technique involves using a union with an int to walk the different values of a float. This works due to the following properties explained in the post, though they are only valid for positive numbers.
Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero
His solution for +0.0 to +infinity (0.f to std::numeric_limits<float>::max()) :
union Float_t {
int32_t RawExponent() const { return (i >> 23) & 0xFF; }
int32_t i;
float f;
};
Float_t allFloats;
allFloats.f = 0.0f;
while (allFloats.RawExponent() < 255) {
allFloats.i += 1;
}
Is there a solution for -infinity to +0.0 (std::numeric_limits<float>::lowest() to 0.f)?
I've tested std::nextafter and std::nexttoward and couldn't get them to work. Maybe this is an MSVC issue?
I would be ok with any sort of hack since this is a unit test. Thanks!
You can walk all 32-bit bit representations by using all values of a 32-bit unsigned int. Then you will walk really all representations, positive and negative, including both nulls (there are two) and also all the not a number representations (NaN). You may or may not want to filter out the NaN representations, or just filter out the signaling ones and leave the non signaling ones in. This depends on your use case.
Example:
for (uint32_t i = 0;;)
{
float f;
// Type punning: Force the bit representation of i into f.
// Type punning is hard because mostly undefined in C/C++.
// Using memcpy() usually avoids any type punning warning.
memcpy(&f, &i, sizeof(f));
// Use f here.
// Warning: Using signaling NaNs may throw exceptions or raise signals.
i++;
if (i == 0)
break;
}
Instead you can also walk a 32-bit int from -2**31 to +(2**31-1). This makes no difference.
Pascal Cuoq correctly points out std::nextafter is the right solution. I had a problem elsewhere in my code. Sorry for the unnecessary question.
#include <cassert>
#include <cmath>
#include <limits>
float i = std::numeric_limits<float>::lowest();
float hi = std::numeric_limits<float>::max();
float new_i = std::nextafterf(i, hi);
assert(i != new_i);
double d = std::numeric_limits<double>::lowest();
double hi_d = std::numeric_limits<double>::max();
double new_d = std::nextafter(d, hi_d);
assert(d != new_d);
long double ld = std::numeric_limits<long double>::lowest();
long double hi_ld = std::numeric_limits<long double>::max();
long double new_ld = std::nextafterl(ld, hi_ld);
assert(ld != new_ld);
for (float d = std::numeric_limits<float>::lowest();
d < std::numeric_limits<float>::max();
d = std::nextafterf(
d, std::numeric_limits<float>::max())) {
// Wait a lifetime?
}
Iterating through all the float values can be done with simple understanding of the floating-point representation:
The distance between consecutive subnormal values is the minimum normal times the “epsilon”. Simply iterate through all the subnormals using this distance as an increment.
The distance between the normal values at the lowest exponent is the same. Step through them with the same increment.
For each exponent, the distance increases according to the floating-point radix. Simply multiply the increment by the radix and step through all the values for the next exponent.
Repeat until infinity is reached.
Observe that the inner loop in the code below is simply:
for (; x < Limit; x += Increment)
Test(x);
This has the advantage that only normal floating-point arithmetic is used. The inner loop contains only one addition and one comparison (plus any tests you want to perform with each number). No library functions are called in the loop, no representations are dissected or copied to general registers or otherwise manipulated. There is nothing to impede performance.
This code steps through only the non-negative numbers. The negative numbers can be tested separately in the same way or can share this code by inserting a call Test(-x).
#include <limits>
static void Test(float x)
{
// Insert unit test for value x here.
}
int main(void)
{
typedef float T;
static const int Radix = std::numeric_limits<T>::radix;
static const T Infinity = std::numeric_limits<T>::infinity();
/* Increment is the current distance between floating-point numbers. We
start it at distance between subnormal numbers.
*/
T Increment =
std::numeric_limits<T>::min() * std::numeric_limits<T>::epsilon();
/* Limit is the next boundary where the distance between floating-point
numbers changes. We will increment up to that limit and then adjust
the limit and increment. We start it at the top of the first set of
normals, which allows the first loop to increment first through the
subnormals and then through the normals with the lowest exponent.
(These two sets have the same step size between adjacent values.)
*/
T Limit = std::numeric_limits<T>::min() * Radix;
/* Start with zero and continue until we reach infinity.
We execute an inner loop that iterates through all the significands of
one floating-point exponent. Each time it completes, we step up the
limit and increment.
*/
for (T x = 0; x < Infinity; Limit *= Radix, Increment *= Radix)
// Increment x through all the significands with the current exponent.
for (; x < Limit; x += Increment)
// Test with the current value of x.
Test(x);
// Also test infinity.
Test(Infinity);
}
(This code assumes the floating-point type has subnormals, and that they are not flushed to zero. The code can be readily adjusted to support these alternatives as well.)
I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float
Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..
After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).
The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?
a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them in vectors.
Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
Fixed point
If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.
There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a2^4 and b2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.
float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale; // fixed point scale for a: 2^4
const int bscale = 1<<logbscale; // fixed point scale for b: 2^1
const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale)); // re-scale b to match a in the 16bit temporary result
for (i=0 ; i<n; i+=16) {
//__m128i avec = get_scaled_a(i);
//__m128i bvec = get_scaled_b(i);
//__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
//__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);
__m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) ); // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.
__m128i block = _mm_load_si128(&buf[i]); // call this { v[0] .. v[15] }
__m128i lo = _mm_unpacklo_epi8(block, brescale); // {v[0], 8, v[1], 8, ...}
__m128i hi = _mm_unpackhi_epi8(block, brescale); // {v[8], 8, v[9], 8, ...
lo = _mm_maddubs_epi16(lo, abvec); // first arg is unsigned bytes, 2nd arg is signed bytes
hi = _mm_maddubs_epi16(hi, abvec);
// lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }
lo = _mm_srli_epi16(lo, logascale); // truncate from scaled fixed-point to integer
hi = _mm_srli_epi16(hi, logascale);
// and re-pack. Logical, not arithmetic right shift means sign bits can't be set
block = _mm_packuswb(lo, hi);
_mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.
Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.
For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:
_mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
_mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
_mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
_mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.
To use these, you'd get scaled 16b vectors of a and b values, then:
unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
left shift the words by 7.
multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
right shift here if you wanted different scales for a and b, to get more fractional precision for a
add b to that.
right shift to do the final truncation back from fixed point to [0..255].
With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.
If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.
expand bytes to floats
If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.
For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.
This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)
char *buf;
// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
__m128 a = get_a(i);
__m128 b = get_b(i);
// IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128. (unlike punpck*)
__m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i)); // load 4B at once.
// Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
floats = _mm_fmadd_ps(floats, a, b); // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
// or without FMA, do this with _mm_mul_ps and _mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(floats);
// repeat 3 more times for buf+4, buf+8, and buf+12, then:
__m128i packed01 = _mm_packss_epi32(dwords0, dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
// packuswb wants SIGNED input, so do signed saturation on the first step
// saturate into [0..255] range
__m12i8 packedbytes=_mm_packus_epi16(packed01, packed23); // SSE2
_mm_store_si128(buf+i, packedbytes); // or storeu if buf isn't aligned.
}
// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0
(Re: a load that can be a memory source operand for pmovzxbd, see also Loading 8 chars from memory into an __m256 variable as packed single precision floats re: the problems compilers have with this.) And see also GCC bug 99754 - wrong code for _mm_loadu_si32 - reversed vector elements.
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.
See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.
I guess you're looking fro the __m128 _mm_cvtpi8_ps(__m64 a ) composite intrinsic.
Here is a minimal example:
#include <xmmintrin.h>
#include <stdio.h>
int main() {
unsigned char a[4] __attribute__((aligned(32)))= {1,2,3,4};
float b[4] __attribute__((aligned(32)));
_mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
return 0;
}