How do I quickly compute the product of 100bit numbers - c++

I am trying to calculate the product of two 100-bit numbers. It is supposed to mimic the behavior of multiplication of unsigned integers native to 100-bit CPU architecture. That is, the program must calculate the actual product, modulo 2^100.
To do this QUICKLY, I have opted to implement 100bit numbers as uint64_t[2], a two element array of 64bit numbers. More precisely, x = 2^64 * a + b. I need to quickly perform arithmetic and logical operations (products, bit shifts, bit rotate, xor etc). I have chosen this representation because it allows me to use the fast, native operations on the 64bit constituents. For example, rotating a 128bit 'number' is only twice as slow as rotating a 64bit int. Boost::128bit is MUCH slower and bitset and valarray don't have arithmetic. I COULD use the arrays for all operations except multiplication, and then convert the arrays to say boost:128bit and then just multiply, but that is a last resort and probably slow as hell.
I have tried to following. Let us have two such pairs of 64bit numbers, say 2^64 a + b and 2^64 x + y. Then the product can be expressed as
2^128 ax + 2^64 (ay + bx) + by
We may ignore the first term, for it is too large. It would be almost sufficient to take the pair
ay + bx, by
to be our answer, but the more significant half is 'missing' the overflow from the b*y operation. I don't know how to calculate this without breaking the numbers b,y into four different 32bits, and using a divide and conquer approach that will ensure the expanded terms of the product each don't overflow.
This is for a 'chess engine' with magic multiplication hashing on a 10x10 board

You only care about the most significant 32 bits of each number in b * y for the overflow it might produce:
struct Num {
uint64_t low;
uint64_t high;
Num &operator*=(const Num &o) {
high = low * o.high +
high * o.low +
(low >> 32u) * (o.low >> 32u); // <- handles overflow
low *= o.low;
high &= 0xFFFFFFFFF; // keeping number 100 bits
return *this;
}
};
See if your cpu supports any native 128 bit ints, because that would be optimal (though not portable).
Good luck with your chess engine!

Come to think of it and borrowing basket's notation:
hell bent on 100 bits, the error would be smaller using 64 bits of high and only 36 of low:
you can compute the most significant 64 bits of "low×low" using (low >> 4u) * (o.low >> 4u), using the upper 36 bits of this as an overflow to high.
With no effort to coin names for magic literals:
Bits100 &operator*=(const Bits100 &o) {
high = low * o.high + // ignore high * o.high
high * o.low +
(low >> 4u) * (o.low >> 4u) >> 28; // handles overflow in most cases
low = low * o.low & 0xFFFFFFFFF; // keep low to 100-64 bits
return *this;
}

Related

How to efficiently perform double/int128 conversions with AVX2?

I'm trying to make a software that users can move in a wide range(at least 1Mly diameter range and at least 0.1mm position representation precision). I think of 128bit fixed point number to represent position. However, mathematical calculation(e.g. distance, sqrt, divide, integration) is not suitable for fixed(or integer), so I use double or single floating point for math. (Usually on the result of subtracting two int128 coordinates to get a relative distance, so usually the value is small enough to not lose too much precision, or the big diff values needn't so many precision.)
So I encountered a problem when implementing fixed128: how to do fast int128-double conversion with AVX2 SIMD? (AVX512 is not popular so I can't use it in this software)
What I've tried(A bit long, maybe it can be ignored):
I've referred to this answer:How to efficiently perform double/int64 conversions with SSE/AVX?
Wim's answer showed that when we need convert int64 to double, splitting multiple integer to less than 52bits long as significand and concating exponent bits in the left, the do fp math to reduce the extra exponents is efficient.
So I tried to split uint128 (consisting of two uint64s: ilow and ihigh) into three parts:
part1 v_lo: ilow's low 48 bits;
part2 v_mi: ilow's high 16 bits and ihigh's low 16bits;
part3 v_hi: lhigh's high 48 bits;
We can get the v_lo and v_hi with the method almost same as wim's "uint64_to_double_fast_precise", but part2 "v_mi" become a problem. it increased 4 instructions which is more than low+high(1+2).(my code following)
Maybe there's faster way by some magical swizzle with permute/shuffle/unpackhi/unpacklo/broadcast/blend or their combination? These swizzle intrinsic really swizzled me.
my code for ufixed128-double conversion:
constexpr auto fix128_frac_bits = 32;
__m256d ufixed128_to_double_fast(const __m256i& ihigh, const __m256i& ilow)
{
//constants
__m256d magic_d_hm = _mm256_set1_pd(pow(2.0, 52 + 48 - fix128_frac_bits) + pow(2.0, 52 + 80 - fix128_frac_bits));
__m256d magic_d_lo = _mm256_set1_pd(pow(2.0, 52 - fix128_frac_bits));
__m256i magic_i_lo = _mm256_castpd_si256(magic_d_lo);
__m256i magic_i_mi = _mm256_castpd_si256(_mm256_set1_pd(pow(2.0, 52 + 48 - fix128_frac_bits)));
__m256i magic_i_hi = _mm256_castpd_si256(_mm256_set1_pd(pow(2.0, 52 + 80 - fix128_frac_bits)));
//majik operations
__m256i v_lo = _mm256_blend_epi16(ilow, magic_i_lo, 0b10001000);
__m256i v_mi = _mm256_slli_epi64(ihigh, 16);
__m256i losr48 = _mm256_srli_epi64(ilow, 48);
v_mi = _mm256_xor_si256(v_mi, losr48);
v_mi = _mm256_blend_epi32(magic_i_mi, v_mi, 0b01010101);
__m256i v_hi = _mm256_srli_epi64(ihigh, 16);
v_hi = _mm256_xor_si256(v_hi, magic_i_hi);
//final fp
__m256d loresult = _mm256_sub_pd(_mm256_castsi256_pd(v_lo), magic_d_lo);
__m256d result = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_hm);
result = _mm256_add_pd(result, _mm256_castsi256_pd(v_mi));
result = _mm256_add_pd(result, loresult);
return result;
}
Edit: I've successfully made signed fixed128_to_double, just fp64 add '2.0^(127 - fix128_frac_bits)' into constant 'magic_d_hm' and 'magic_i_hi'.
But there's no fast 'double_to_int128' and 'double_to_uint128' which I have no idea. I can do it faster than C++ 'static_cast' scalar convert
with do bit operstions(mask out exponent and sign, and concat hidden 1,and do left/right shift), but it's much slower than thouse magical ops and use a lot of registers for constants.
Can anyone help me?
If I'm in a blind alley, and there's a better method than fixed128/double-double to represent the wide range position, please tell me. (Except floating-origin or floating-grid(int64-double):they are unstable for physics, or exposes a lot of complexity to the upper construction, or hard to do AVX acceleration.)
About double-double: I planned to compare performance between fixed128 and double-double after highly optimized them, and decide which to use after that. That's another work I'm doing.
my current codes: https://github.com/Veloctor/Int128

Fast method to multiply integer by proper fraction without floats or overflow

My program frequently requires the following calculation to be performed:
Given:
N is a 32-bit integer
D is a 32-bit integer
abs(N) <= abs(D)
D != 0
X is a 32-bit integer of any value
Find:
X * N / D as a rounded integer that is X scaled to N/D (i.e. 10 * 2 / 3 = 7)
Obviously I could just use r=x*n/d directly but I will often get overflow from the x*n. If I instead do r=x*(n/d) then I only get 0 or x due to integer division dropping the fractional component. And then there's r=x*(float(n)/d) but I can't use floats in this case.
Accuracy would be great but isn't as critical as speed and being a deterministic function (always returning the same value given the same inputs).
N and D are currently signed but I could work around them being always unsigned if it helps.
A generic function that works with any value of X (and N and D, as long as N <= D) is ideal since this operation is used in various different ways but I also have a specific case where the value of X is a known constant power of 2 (2048, to be precise), and just getting that specific call sped up would be a big help.
Currently I am accomplishing this using 64-bit multiply and divide to avoid overflow (essentially int multByProperFraction(int x, int n, int d) { return (__int64)x * n / d; } but with some asserts and extra bit fiddling for rounding instead of truncating).
Unfortunately, my profiler is reporting the 64-bit divide function as taking up way too much CPU (this is a 32-bit application). I've tried to reduce how often I need to do this calculation but am running out of ways around it, so I'm trying to figure out a faster method, if it is even possible. In the specific case where X is a constant 2048, I use a bit shift instead of multiply but that doesn't help much.
Tolerate imprecision and use the 16 MSBits of n,d,x
Algorithm
while (|n| > 0xffff) n/2, sh++
while (|x| > 0xffff) x/2, sh++
while (|d| > 0xffff) d/2, sh--
r = n*x/d // A 16x16 to 32 multiply followed by a 32/16-bit divide.
shift r by sh.
When 64 bit divide is expensive, the pre/post processing here may be worth to do a 32-bit divide - which will certainly be the big chunk of CPU.
If the compiler cannot be coaxed into doing a 32-bit/16-bit divide, then skip the while (|d| > 0xffff) d/2, sh-- step and do a 32/32 divide.
Use unsigned math as possible.
The basic correct approach to this is simply (uint64_t)x*n/d. That's optimal assuming d is variable and unpredictable. But if d is constant or changes infrequently, you can pre-generate constants such that exact division by d can be performed as a multiplication followed by a bitshift. A good description of the algorithm, which is roughly what GCC uses internally to transform division by a constant into multiplication, is here:
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
I'm not sure how easy it is to make it work for a "64/32" division (i.e. dividing the result of (uint64_t)x*n), but you should be able to just break it up into high and low parts if nothing else.
Note that these algorithms are also available as libdivide.
I've now benchmarked several possible solutions, including weird/clever ones from other sources like combining 32-bit div & mod & add or using peasant math, and here are my conclusions:
First, if you are only targeting Windows and using VSC++, just use MulDiv(). It is quite fast (faster than directly using 64-bit variables in my tests) while still being just as accurate and rounding the result for you. I could not find any superior method to do this kind of thing on Windows with VSC++, even taking into account restrictions like unsigned-only and N <= D.
However, in my case having a function with deterministic results even across platforms is even more important than speed. On another platform I was using as a test, the 64-bit divide is much, much slower than the 32-bit one when using the 32-bit libraries, and there is no MulDiv() to use. The 64-bit divide on this platform takes ~26x as long as a 32-bit divide (yet the 64-bit multiply is just as fast as the 32-bit version...).
So if you have a case like me, I will share the best results I got, which turned out to be just optimizations of chux's answer.
Both of the methods I will share below make use of the following function (though the compiler-specific intrinsics only actually helped in speed with MSVC in Windows):
inline u32 bitsRequired(u32 val)
{
#ifdef _MSC_VER
DWORD r = 0;
_BitScanReverse(&r, val | 1);
return r+1;
#elif defined(__GNUC__) || defined(__clang__)
return 32 - __builtin_clz(val | 1);
#else
int r = 1;
while (val >>= 1) ++r;
return r;
#endif
}
Now, if x is a constant that's 16-bit in size or smaller and you can pre-compute the bits required, I found the best results in speed and accuracy from this function:
u32 multConstByPropFrac(u32 x, u32 nMaxBits, u32 n, u32 d)
{
//assert(nMaxBits == 32 - bitsRequired(x));
//assert(n <= d);
const int bitShift = bitsRequired(n) - nMaxBits;
if( bitShift > 0 )
{
n >>= bitShift;
d >>= bitShift;
}
// Remove the + d/2 part if don't need rounding
return (x * n + d/2) / d;
}
On the platform with the slow 64-bit divide, the above function ran ~16.75x as fast as return ((u64)x * n + d/2) / d; and with an average 99.999981% accuracy (comparing difference in return value from expected to range of x, i.e. returning +/-1 from expected when x is 2048 would be 100 - (1/2048 * 100) = 99.95% accurate) when testing it with a million or so randomized inputs where roughly half of them would normally have been an overflow. Worst-case accuracy was 99.951172%.
For the general use case, I found the best results from the following (and without needing to restrict N <= D to boot!):
u32 scaleToFraction(u32 x, u32 n, u32 d)
{
u32 bits = bitsRequired(x);
int bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
int sh = bitShift;
x >>= bitShift;
bits = bitsRequired(n);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh += bitShift;
n >>= bitShift;
bits = bitsRequired(d);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh -= bitShift;
d >>= bitShift;
// Remove the + d/2 part if don't need rounding
u32 r = (x * n + d/2) / d;
if( sh < 0 )
r >>= (-sh);
else //if( sh > 0 )
r <<= sh;
return r;
}
On the platform with the slow 64-bit divide, the above function ran ~18.5x as fast as using 64-bit variables and with 99.999426% average and 99.947479% worst-case accuracy.
I was able to get more speed or more accuracy by messing with the shifting, such as trying to not shift all the way down to 16-bit if it wasn't strictly necessary, but any increase in speed came at a high cost in accuracy and vice versa.
None of the other methods I tested came even close to the same speed or accuracy, most being slower than just using the 64-bit method or having huge loss in precision, so not worth going into.
Obviously, no guarantee that anyone else will get similar results on other platforms!
EDIT: Replaced some bit-twiddling hacks with plain code that actually ran faster anyway by letting the compiler do its job.

Operating on two shorts at once by combining them into an integer

I'm using the following code to map two signed 16-bit integers to the upper and lower 16 bits of an unsigned 32 bit integer.
inline uint32_t to_score(int16_t mg, int16_t eg) {
return ((1u * mg) << 16 | (eg & 0xFFFF));
}
inline int16_t extract_mg(uint32_t score) {
return int16_t(score >> 16);
}
inline int16_t extract_eg(uint32_t score) {
return int16_t(score & 0xFFFF);
}
I need to perform various calculations on both the mg and eg parts simultaneously, before interpolating the two parts at the end of a function.
As I understand it, as long as there is no overflow, it should be safe to add two uint32_ts created to_score, and then extract the int16_ts to find the results of the individual calculations: i.e. the results if I added the the values for mg and eg separately.
I'm not sure whether this assumption holds if either mg or eg are negative, or whether this method can be used for subtraction, multiplication and/or division.
Which operations can I expect to function correctly? Are there alternative ways of representing two integers which can be added/subtracted/multiplied quickly?
There will be a problem with a carry going from the low half into the high half, but it can be avoided with extra operations, as detailed on for example chessprogramming.org/SIMD_and_SWAR_Techniques
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
Where in this case H = 0x80008000.
As an other alternative, it could be done with two additions, but with optimized extraction/recombination:
// low half addition, leaving upper half corrupted but it will be ignored
l = x + y
// high half addition, adding 0 to the bottom so no carry
h = x + (y & 0xFFFF0000)
// recombine
z = (l & 0xFFFF) | (h & 0xFFFF0000)
Subtraction is a minor variation on addition.
Multiplication unfortunately cares about absolute bit-positions, so values have to be moved (shifted) to their notional position for it to work. Actual SIMD can still be used though, such as _mm_mullo_epi16 with SSE2.
C++ signed integers are two's complement, it is on the way to be standardized in C++20, in practice you may already assume that.
Some cases of addition and subtraction would work, those cases that don't cause either of following: eg to overflow, mg to overflow, mg to change sign.
The optimization does not make much sense.
If there's larger array, you can try to get your operations vectorized with proper SIMD instruction, if they are available for your platform by enabling compiler optimization or by using intrinsics ( _mm_adds_pi16 might be the one you need ).
If you have just two integers, just compute them one by one.

C++: Binary to Decimal Conversion

I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?
Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];
It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.
The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.
You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.
Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length

Efficient (bit-wise) division by 24

Coding for an embedded platform with no integer divisor (nor multiplier), is there a quick way to perform a 'divide by 24'?
Multiply by 24 is simply
int a;
int b = (a << 4) + (a << 3); // a*16 + a*8
But division? It's a really simple divisor, with only two bits set?
If you don't need the result to be bit-exact, then you could consider multiplying by 1/24:
uint16_t a = ...;
uint32_t b = (uint32_t)a * (65536L / 24);
uint16_t c = b / 65536;
Of course, if your platform doesn't have a hardware multiplier, then you will need to optimise that multiplication. As it turns out, (65536 / 24) is approximately equal to 2730, which is 101010101010 in binary. So that multiply can be achieved with 3 shifts and adds.
Well first of all you can use the fact that 24=8*3, so you can divide by 8 using shifting once again: a / 8 == a >> 3. Afterwards you have to divide the result by 3. A discussion about how to do that efficiently can be found here. Of course if you are coding in c (or any other higher level language really), it might be worthwile to simply look at the compileroutput first, it is possible that the compiler already has some tricks for this.