Map integer range onto another range - c++

In runtime I have 2 ranges defined by their uint32_t borders a..b and c..d. The first range tends to be much greater than the second: 8 < (b - a) / (d - c) < 64.
Exact limits: a >= 0, b <= 2^31 - 1, c >= 0, d <= 2^20 - 1.
I need a routine that performs linear mapping of an integer from the first range onto the second one: f(uint32_t x) -> round_to_uint32_t((float)(x - a) / (b - a) * (d - c) + c).
When b - a >= d - c it is important to mantain the ratio as close to ideal as possible, otherwise in cases when element from [a; b] can be mapped on more than one integer from [c; d] it is okay to return any of these integers.
Sounds like a simple ratio problem and was already answered in many questions like
Convert a number range to another range, maintaining ratio
but here I need a really really fast solution.
This routine is a pivotal part of a specialized sorting algorithm and will be called at least once for every element of a sorted array.
SIMD solution is also acceptable if it doesn't drop overall performance.

Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math); the compiler can't turn x / foo * bar into x * (bar/foo) for you, even though that's very good with loop-invariant bar/foo. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.
_mm256_fmadd_ps looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a). If float rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a)). Using double for this calculation avoids rounding error during conversion to FP of the division operands.
You're reusing the same a,b,c,d for many x, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)
Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i before looping through that scalar.
Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.
Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.
If not SIMD, then even scalar C++ fma() is still good, for vfmadd213sd, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq rather than vcvttps2dq).
Note that uint32_t <-> float conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.
It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x and x-a (and b-a) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF. (Or at least non-negative. Zero is fine.)
Float Rounding
For SIMD, single-precision float is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t can be exactly represented as a float. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.
Using SIMD double is possible but requires some shuffling.
I don't think float is usually a problem for the formula as-written with (float)(x-a). If the b-a input range is large, that means both ranges are large and rounding error isn't going to map all possible x values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a values.
But if we want to factor out the -a * (d - c) / (b - a) part and combine it with the +c at the end, then
We potentially have precision loss from catastrophic cancellation in that value to be added.
We need to do (float)x on the raw input value. If a is huge and b-a is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x values to the same float.
To make best use of FMA, we want to do the +c before converting back to integer, which again risks output rounding error if the d-c is a small output range but c is huge. In your case not a problem; with d <= 2^20 - 1 we know that float can exactly represent every output integer value in that c..d range.
If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U on input and ...+c+0x80000000U on output (after rounding to nearest int32_t). But that would introduce huge float rounding error for small uint32_t inputs (close to 0) which get range-shifted to close to INT_MIN.
We don't need to range-shift for the b-a or d-c because the + or - or XOR with 0x80000000U would cancel out in the subtractions.
Example:
The const vectors should be hoisted out of a loop by the compiler after this inlines,
or you can do that manually.
This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.
// fastest but not safe if b-a is small and a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
double scale_scalar = (d - c) / (double)(b - a);
const __m256 scale = _mm256_set1_ps(scale_scalar);
const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
// (x-a) * scale + c
// = x * scale + (-a*scale + c) but with different rounding error from doing -a*scale + c
__m256 in = _mm256_cvtepi32_ps(data);
__m256 out = _mm256_fmadd_ps(in, scale, add);
return _mm256_cvtps_epi32(out); // convert back with round to nearest-even
// _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
}
Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer
static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
const __m256 cvec = _m256_set1_ps(c);
__m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a)); // add can more easily fold a load of a memory operand than sub because it's commutative. Only some compilers will do this for you.
__m256 in_fp = _mm256_cvtepi32_ps(in_offset);
__m256 out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c)); // in*scale + c
return _mm256_cvtps_epi32(out);
}
Without FMA you could still use vmulps. You might as well convert back to integer before adding c if you're doing that, although vaddps would be safe.
You might use this in a loop like
void foo(uint32_t *arr, ptrdiff_t len)
{
if (len < 24) special case;
alignas(32) uint32_t tmpbuf[16];
// peel half of first iteration for software pipelining / loop rotation
__m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
__m256i outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)tmpbuf, outrange);
// could have used an unsigned loop counter
// since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {
// prep next 8 elements
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[8], outrange);
// use first 8 elements
for (int j=0 ; j<8 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
// prep 8 more for next iteration
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[0], outrange);
// use 2nd 8 elements
for (int j=8 ; j<16 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
}
// use tmpbuf[0..7]
// then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
}
These variable-names sound dumb but I couldn't think of anything better.
This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange)) if you want.)
Special cases
If there cases where you know (b - a) is a power of 2, you could bitscan with tzcnt or bsf, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz() to count trailing zeros.)
Or can you ensure that (b - a) is always a power of 2?
Or better, if (b - a) / (d - c) is an exact power of 2 the whole thing can just be sub / right shift / add.
If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.

Related

Vectorized function to count numbers in an array when a number is a specified power

I am attempting to vectorize this fairly expensive function (Scaler Now working!):
template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));
return (x - std::trunc(x)) < 0.000001;
}//End of isPower
Here's what I have so far (for 32-bit int only):
template<typename RETURN_T>
inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
{
RETURN_T cnt = 0;
const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));
const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);
__m256i _count = _mm256_setzero_si256();
__m256i _N_INT = _mm256_setzero_si256();
__m256 _N_DBL = _mm256_setzero_ps();
__m256 LOG_OF_N = _mm256_setzero_ps();
__m256 DIVIDE_LOG = _mm256_setzero_ps();
__m256 TRUNCATED = _mm256_setzero_ps();
__m256 CMP_MASK = _mm256_setzero_ps();
for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
{
//Set Values
_N_INT = _mm256_load_si256((__m256i*) &arr[i]);
_N_DBL = _mm256_cvtepi32_ps(_N_INT);
LOG_OF_N = _mm256_log_ps(_N_DBL);
DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);
TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));
CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);
_count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
}//End for
cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
}//End of count_powers_of
The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.
The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.
Testing:
int64_t count = 0;
for (size_t i = 0; i < vec.size(); ++i)
{
if (isPower(vec[i], 4))
{
++count;
}//End if
}//End for
std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000
std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array
This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).
It could also be this sum function (based off of code by #PeterCordes ):
inline uint32_t _mm_sum_epi32(__m128i& x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
}
inline uint32_t _mm256_sum_epi32(__m256i& v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1));
return _mm_sum_epi32(sum128);
}
I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?
Thanks for all your insights and suggestions thus far.
A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.
Re: the original version:
This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.
Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.
I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.
I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.
You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.
Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.
That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!
Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).
Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:
_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.
Updated version using (x - std::trunc(x)) < 0.000001;
This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)
Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.
float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.
I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.
__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)
Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.

Fast method to multiply integer by proper fraction without floats or overflow

My program frequently requires the following calculation to be performed:
Given:
N is a 32-bit integer
D is a 32-bit integer
abs(N) <= abs(D)
D != 0
X is a 32-bit integer of any value
Find:
X * N / D as a rounded integer that is X scaled to N/D (i.e. 10 * 2 / 3 = 7)
Obviously I could just use r=x*n/d directly but I will often get overflow from the x*n. If I instead do r=x*(n/d) then I only get 0 or x due to integer division dropping the fractional component. And then there's r=x*(float(n)/d) but I can't use floats in this case.
Accuracy would be great but isn't as critical as speed and being a deterministic function (always returning the same value given the same inputs).
N and D are currently signed but I could work around them being always unsigned if it helps.
A generic function that works with any value of X (and N and D, as long as N <= D) is ideal since this operation is used in various different ways but I also have a specific case where the value of X is a known constant power of 2 (2048, to be precise), and just getting that specific call sped up would be a big help.
Currently I am accomplishing this using 64-bit multiply and divide to avoid overflow (essentially int multByProperFraction(int x, int n, int d) { return (__int64)x * n / d; } but with some asserts and extra bit fiddling for rounding instead of truncating).
Unfortunately, my profiler is reporting the 64-bit divide function as taking up way too much CPU (this is a 32-bit application). I've tried to reduce how often I need to do this calculation but am running out of ways around it, so I'm trying to figure out a faster method, if it is even possible. In the specific case where X is a constant 2048, I use a bit shift instead of multiply but that doesn't help much.
Tolerate imprecision and use the 16 MSBits of n,d,x
Algorithm
while (|n| > 0xffff) n/2, sh++
while (|x| > 0xffff) x/2, sh++
while (|d| > 0xffff) d/2, sh--
r = n*x/d // A 16x16 to 32 multiply followed by a 32/16-bit divide.
shift r by sh.
When 64 bit divide is expensive, the pre/post processing here may be worth to do a 32-bit divide - which will certainly be the big chunk of CPU.
If the compiler cannot be coaxed into doing a 32-bit/16-bit divide, then skip the while (|d| > 0xffff) d/2, sh-- step and do a 32/32 divide.
Use unsigned math as possible.
The basic correct approach to this is simply (uint64_t)x*n/d. That's optimal assuming d is variable and unpredictable. But if d is constant or changes infrequently, you can pre-generate constants such that exact division by d can be performed as a multiplication followed by a bitshift. A good description of the algorithm, which is roughly what GCC uses internally to transform division by a constant into multiplication, is here:
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
I'm not sure how easy it is to make it work for a "64/32" division (i.e. dividing the result of (uint64_t)x*n), but you should be able to just break it up into high and low parts if nothing else.
Note that these algorithms are also available as libdivide.
I've now benchmarked several possible solutions, including weird/clever ones from other sources like combining 32-bit div & mod & add or using peasant math, and here are my conclusions:
First, if you are only targeting Windows and using VSC++, just use MulDiv(). It is quite fast (faster than directly using 64-bit variables in my tests) while still being just as accurate and rounding the result for you. I could not find any superior method to do this kind of thing on Windows with VSC++, even taking into account restrictions like unsigned-only and N <= D.
However, in my case having a function with deterministic results even across platforms is even more important than speed. On another platform I was using as a test, the 64-bit divide is much, much slower than the 32-bit one when using the 32-bit libraries, and there is no MulDiv() to use. The 64-bit divide on this platform takes ~26x as long as a 32-bit divide (yet the 64-bit multiply is just as fast as the 32-bit version...).
So if you have a case like me, I will share the best results I got, which turned out to be just optimizations of chux's answer.
Both of the methods I will share below make use of the following function (though the compiler-specific intrinsics only actually helped in speed with MSVC in Windows):
inline u32 bitsRequired(u32 val)
{
#ifdef _MSC_VER
DWORD r = 0;
_BitScanReverse(&r, val | 1);
return r+1;
#elif defined(__GNUC__) || defined(__clang__)
return 32 - __builtin_clz(val | 1);
#else
int r = 1;
while (val >>= 1) ++r;
return r;
#endif
}
Now, if x is a constant that's 16-bit in size or smaller and you can pre-compute the bits required, I found the best results in speed and accuracy from this function:
u32 multConstByPropFrac(u32 x, u32 nMaxBits, u32 n, u32 d)
{
//assert(nMaxBits == 32 - bitsRequired(x));
//assert(n <= d);
const int bitShift = bitsRequired(n) - nMaxBits;
if( bitShift > 0 )
{
n >>= bitShift;
d >>= bitShift;
}
// Remove the + d/2 part if don't need rounding
return (x * n + d/2) / d;
}
On the platform with the slow 64-bit divide, the above function ran ~16.75x as fast as return ((u64)x * n + d/2) / d; and with an average 99.999981% accuracy (comparing difference in return value from expected to range of x, i.e. returning +/-1 from expected when x is 2048 would be 100 - (1/2048 * 100) = 99.95% accurate) when testing it with a million or so randomized inputs where roughly half of them would normally have been an overflow. Worst-case accuracy was 99.951172%.
For the general use case, I found the best results from the following (and without needing to restrict N <= D to boot!):
u32 scaleToFraction(u32 x, u32 n, u32 d)
{
u32 bits = bitsRequired(x);
int bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
int sh = bitShift;
x >>= bitShift;
bits = bitsRequired(n);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh += bitShift;
n >>= bitShift;
bits = bitsRequired(d);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh -= bitShift;
d >>= bitShift;
// Remove the + d/2 part if don't need rounding
u32 r = (x * n + d/2) / d;
if( sh < 0 )
r >>= (-sh);
else //if( sh > 0 )
r <<= sh;
return r;
}
On the platform with the slow 64-bit divide, the above function ran ~18.5x as fast as using 64-bit variables and with 99.999426% average and 99.947479% worst-case accuracy.
I was able to get more speed or more accuracy by messing with the shifting, such as trying to not shift all the way down to 16-bit if it wasn't strictly necessary, but any increase in speed came at a high cost in accuracy and vice versa.
None of the other methods I tested came even close to the same speed or accuracy, most being slower than just using the 64-bit method or having huge loss in precision, so not worth going into.
Obviously, no guarantee that anyone else will get similar results on other platforms!
EDIT: Replaced some bit-twiddling hacks with plain code that actually ran faster anyway by letting the compiler do its job.

Operating on two shorts at once by combining them into an integer

I'm using the following code to map two signed 16-bit integers to the upper and lower 16 bits of an unsigned 32 bit integer.
inline uint32_t to_score(int16_t mg, int16_t eg) {
return ((1u * mg) << 16 | (eg & 0xFFFF));
}
inline int16_t extract_mg(uint32_t score) {
return int16_t(score >> 16);
}
inline int16_t extract_eg(uint32_t score) {
return int16_t(score & 0xFFFF);
}
I need to perform various calculations on both the mg and eg parts simultaneously, before interpolating the two parts at the end of a function.
As I understand it, as long as there is no overflow, it should be safe to add two uint32_ts created to_score, and then extract the int16_ts to find the results of the individual calculations: i.e. the results if I added the the values for mg and eg separately.
I'm not sure whether this assumption holds if either mg or eg are negative, or whether this method can be used for subtraction, multiplication and/or division.
Which operations can I expect to function correctly? Are there alternative ways of representing two integers which can be added/subtracted/multiplied quickly?
There will be a problem with a carry going from the low half into the high half, but it can be avoided with extra operations, as detailed on for example chessprogramming.org/SIMD_and_SWAR_Techniques
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
Where in this case H = 0x80008000.
As an other alternative, it could be done with two additions, but with optimized extraction/recombination:
// low half addition, leaving upper half corrupted but it will be ignored
l = x + y
// high half addition, adding 0 to the bottom so no carry
h = x + (y & 0xFFFF0000)
// recombine
z = (l & 0xFFFF) | (h & 0xFFFF0000)
Subtraction is a minor variation on addition.
Multiplication unfortunately cares about absolute bit-positions, so values have to be moved (shifted) to their notional position for it to work. Actual SIMD can still be used though, such as _mm_mullo_epi16 with SSE2.
C++ signed integers are two's complement, it is on the way to be standardized in C++20, in practice you may already assume that.
Some cases of addition and subtraction would work, those cases that don't cause either of following: eg to overflow, mg to overflow, mg to change sign.
The optimization does not make much sense.
If there's larger array, you can try to get your operations vectorized with proper SIMD instruction, if they are available for your platform by enabling compiler optimization or by using intrinsics ( _mm_adds_pi16 might be the one you need ).
If you have just two integers, just compute them one by one.

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float
Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..
After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).
The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?
a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them in vectors.
Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
Fixed point
If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.
There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a2^4 and b2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.
float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale; // fixed point scale for a: 2^4
const int bscale = 1<<logbscale; // fixed point scale for b: 2^1
const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale)); // re-scale b to match a in the 16bit temporary result
for (i=0 ; i<n; i+=16) {
//__m128i avec = get_scaled_a(i);
//__m128i bvec = get_scaled_b(i);
//__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
//__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);
__m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) ); // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.
__m128i block = _mm_load_si128(&buf[i]); // call this { v[0] .. v[15] }
__m128i lo = _mm_unpacklo_epi8(block, brescale); // {v[0], 8, v[1], 8, ...}
__m128i hi = _mm_unpackhi_epi8(block, brescale); // {v[8], 8, v[9], 8, ...
lo = _mm_maddubs_epi16(lo, abvec); // first arg is unsigned bytes, 2nd arg is signed bytes
hi = _mm_maddubs_epi16(hi, abvec);
// lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }
lo = _mm_srli_epi16(lo, logascale); // truncate from scaled fixed-point to integer
hi = _mm_srli_epi16(hi, logascale);
// and re-pack. Logical, not arithmetic right shift means sign bits can't be set
block = _mm_packuswb(lo, hi);
_mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.
Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.
For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:
_mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
_mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
_mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
_mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.
To use these, you'd get scaled 16b vectors of a and b values, then:
unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
left shift the words by 7.
multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
right shift here if you wanted different scales for a and b, to get more fractional precision for a
add b to that.
right shift to do the final truncation back from fixed point to [0..255].
With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.
If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.
expand bytes to floats
If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.
For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.
This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)
char *buf;
// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
__m128 a = get_a(i);
__m128 b = get_b(i);
// IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128. (unlike punpck*)
__m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i)); // load 4B at once.
// Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
floats = _mm_fmadd_ps(floats, a, b); // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
// or without FMA, do this with _mm_mul_ps and _mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(floats);
// repeat 3 more times for buf+4, buf+8, and buf+12, then:
__m128i packed01 = _mm_packss_epi32(dwords0, dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
// packuswb wants SIGNED input, so do signed saturation on the first step
// saturate into [0..255] range
__m12i8 packedbytes=_mm_packus_epi16(packed01, packed23); // SSE2
_mm_store_si128(buf+i, packedbytes); // or storeu if buf isn't aligned.
}
// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0
(Re: a load that can be a memory source operand for pmovzxbd, see also Loading 8 chars from memory into an __m256 variable as packed single precision floats re: the problems compilers have with this.) And see also GCC bug 99754 - wrong code for _mm_loadu_si32 - reversed vector elements.
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.
See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.
I guess you're looking fro the __m128 _mm_cvtpi8_ps(__m64 a ) composite intrinsic.
Here is a minimal example:
#include <xmmintrin.h>
#include <stdio.h>
int main() {
unsigned char a[4] __attribute__((aligned(32)))= {1,2,3,4};
float b[4] __attribute__((aligned(32)));
_mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
return 0;
}

Fast fixed point pow, log, exp and sqrt

I've got a fixed point class (10.22) and I have a need of a pow, a sqrt, an exp and a log function.
Alas I have no idea where to even start on this. Can anyone provide me with some links to useful articles or, better yet, provide me with some code?
I'm assuming that once I have an exp function then it becomes relatively easy to implement pow and sqrt as they just become.
pow( x, y ) => exp( y * log( x ) )
sqrt( x ) => pow( x, 0.5 )
Its just those exp and log functions that I'm finding difficult (as though I remember a few of my log rules, I can't remember much else about them).
Presumably, there would also be a faster method for sqrt and pow so any pointers on that front would be appreciated even if its just to say use the methods i outline above.
Please note: This HAS to be cross platform and in pure C/C++ code so I cannot use any assembler optimisations.
A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.
Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.
You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).
[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.
[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits by table lookup. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.
[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.
Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.
[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.
You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)
[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.
Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:
logₘ(x)
logₙ(x) = ───────
logₘ(n)
where, for this algorithm, m equals 2.
A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.
When using these functions, x is expected to be a fixed-point value scaled according to the
specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.
#include <errno.h>
#include <stddef.h>
#include "log2fix.h"
#define INV_LOG2_E_Q1DOT31 UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10
int32_t log2fix (uint32_t x, size_t precision)
{
int32_t b = 1U << (precision - 1);
int32_t y = 0;
if (precision < 1 || precision > 31) {
errno = EINVAL;
return INT32_MAX; // indicates an error
}
if (x == 0) {
return INT32_MIN; // represents negative infinity
}
while (x < 1U << precision) {
x <<= 1;
y -= 1U << precision;
}
while (x >= 2U << precision) {
x >>= 1;
y += 1U << precision;
}
uint64_t z = x;
for (size_t i = 0; i < precision; i++) {
z = z * z >> precision;
if (z >= 2U << (uint64_t)precision) {
z >>= 1;
y += b;
}
b >>= 1;
}
return y;
}
int32_t logfix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;
return t >> 31;
}
int32_t log10fix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;
return t >> 31;
}
The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.
[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.
A good starting point is Jack Crenshaw's book, "Math Toolkit for Real-Time Programming". It has a good discussion of algorithms and implementations for various transcendental functions.
Check my fixed point sqrt implementation using only integer operations.
It was fun to invent. Quite old now.
https://groups.google.com/forum/?hl=fr%05aacf5997b615c37&fromgroups#!topic/comp.lang.c/IpwKbw0MAxw/discussion
Otherwise check the CORDIC set of algorithms. That's the way to implement all the functions you listed and the trigonometric functions.
EDIT : I published the reviewed source on GitHub here