Fast bignum square computation

Fast bignum square computation - c++

To speed up my bignum divisons I need to speed up operation y = x^2 for bigints which are represented as dynamic arrays of unsigned DWORDs. To be clear:
DWORD x[n+1] = { LSW, ......, MSW };
where n+1 is number of used DWORDs
so value of number x = x[0]+x[1]<<32 + ... x[N]<<32*(n)
The question is: How do I compute y = x^2 as fast as possible without precision loss?
- Using C++ and with integer arithmetics (32bit with Carry) at disposal.
My current approach is applying multiplication y = x*x and avoid multiple multiplications.
For example:
x = x[0] + x[1]<<32 + ... x[n]<<32*(n)
For simplicity, let me rewrite it:
x = x0+ x1 + x2 + ... + xn
where index represent the address inside the array, so:
y = x*x
y = (x0 + x1 + x2 + ...xn)*(x0 + x1 + x2 + ...xn)
y = x0*(x0 + x1 + x2 + ...xn) + x1*(x0 + x1 + x2 + ...xn) + x2*(x0 + x1 + x2 + ...xn) + ...xn*(x0 + x1 + x2 + ...xn)
y0 = x0*x0
y1 = x1*x0 + x0*x1
y2 = x2*x0 + x1*x1 + x0*x2
y3 = x3*x0 + x2*x1 + x1*x2
...
y(2n-3) = xn(n-2)*x(n ) + x(n-1)*x(n-1) + x(n )*x(n-2)
y(2n-2) = xn(n-1)*x(n ) + x(n )*x(n-1)
y(2n-1) = xn(n )*x(n )
After a closer look, it is clear that almost all xi*xj appears twice (not the first and last one) which means that N*N multiplications can be replaced by (N+1)*(N/2) multiplications. P.S. 32bit*32bit = 64bit so the result of every mul+add operation is handled as 64+1 bit.
Is there a better way to compute this fast? All I found during searches were sqrts algorithms, not sqr...
Fast sqr
!!! Beware that all numbers in my code are MSW first,... not as in above test (there are LSW first for simplicity of equations, otherwise it would be an index mess).
Current functional fsqr implementation
void arbnum::sqr(const arbnum &x)
{
// O((N+1)*N/2)
arbnum c;
DWORD h, l;
int N, nx, nc, i, i0, i1, k;
c._alloc(x.siz + x.siz + 1);
nx = x.siz - 1;
nc = c.siz - 1;
N = nx + nx;
for (i=0; i<=nc; i++)
c.dat[i]=0;
for (i=1; i<N; i++)
for (i0=0; (i0<=nx) && (i0<=i); i0++)
{
i1 = i - i0;
if (i0 >= i1)
break;
if (i1 > nx)
continue;
h = x.dat[nx-i0];
if (!h)
continue;
l = x.dat[nx-i1];
if (!l)
continue;
alu.mul(h, l, h, l);
k = nc - i;
if (k >= 0)
alu.add(c.dat[k], c.dat[k], l);
k--;
if (k>=0)
alu.adc(c.dat[k], c.dat[k],h);
k--;
for (; (alu.cy) && (k>=0); k--)
alu.inc(c.dat[k]);
}
c.shl(1);
for (i = 0; i <= N; i += 2)
{
i0 = i>>1;
h = x.dat[nx-i0];
if (!h)
continue;
alu.mul(h, l, h, h);
k = nc - i;
if (k >= 0)
alu.add(c.dat[k], c.dat[k],l);
k--;
if (k>=0)
alu.adc(c.dat[k], c.dat[k], h);
k--;
for (; (alu.cy) && (k >= 0); k--)
alu.inc(c.dat[k]);
}
c.bits = c.siz<<5;
c.exp = x.exp + x.exp + ((c.siz - x.siz - x.siz)<<5) + 1;
c.sig = sig;
*this = c;
}
Use of Karatsuba multiplication
(thanks to Calpis)
I implemented Karatsuba multiplication but the results are massively slower even than by use of simple O(N^2) multiplication, probably because of that horrible recursion that I can't see any way to avoid. It's trade-off must be at really large numbers (bigger than hundreds of digits) ... but even then there are a lot of memory transfers. Is there a way to avoid recursion calls (non-recursive variant,... Almost all recursive algorithms can be done that way). Still, I will try to tweak things up and see what happens (avoid normalizations, etc..., also it could be some silly mistake in the code). Anyway, after solving Karatsuba for case x*x there is not much performance gain.
Optimized Karatsuba multiplication
Performance test for y = x^2 looped 1000x times, 0.9 < x < 1 ~ 32*98 bits:
x = 0.98765588997654321000000009876... | 98*32 bits
sqr [ 213.989 ms ] ... O((N+1)*N/2) fast sqr
mul1[ 363.472 ms ] ... O(N^2) classic multiplication
mul2[ 349.384 ms ] ... O(3*(N^log2(3))) optimized Karatsuba multiplication
mul3[ 9345.127 ms] ... O(3*(N^log2(3))) unoptimized Karatsuba multiplication
x = 0.98765588997654321000... | 195*32 bits
sqr [ 883.01 ms ]
mul1[ 1427.02 ms ]
mul2[ 1089.84 ms ]
x = 0.98765588997654321000... | 389*32 bits
sqr [ 3189.19 ms ]
mul1[ 5553.23 ms ]
mul2[ 3159.07 ms ]
After optimizations for Karatsuba, the code is massively faster than before. Still, for smaller numbers it is slightly less than half speed of my O(N^2) multiplication. For bigger numbers, it is faster with the ratio given by the complexities of Booth multiplications. The threshold for multiplication is around 32*98 bits and for sqr around 32*389 bits, so if the sum of input bits cross this threshold then Karatsuba multiplication will be used for speeding up multiplication and that goes similar for sqr too.
BTW, optimizations included:
Minimize heap trashing by too-big recursion argument
Avoidance of any bignum aritmetics (+,-) 32-bit ALU with carry is used instead.
Ignoring 0*y or x*0 or 0*0 cases
Reformatting input x,y number sizes to power of two to avoid reallocating
Implement modulo multiplication for z1 = (x0 + x1)*(y0 + y1) to minimize recursion
Modified Schönhage-Strassen multiplication to sqr implementation
I have tested use of FFT and NTT transforms to speed up sqr computation. The results are these:
FFT
Lose accuracy and therefore need high precision complex numbers. This actually slows things down considerably so no speedup is present. The result is not precise (can be wrongly rounded)so FFT is unusable (for now)
NTT
NTT is finite field DFT and so no accuracy loss occurs. It need modular arithmetics on unsigned integers: modpow, modmul, modadd and modsub.
I use DWORD (32bit unsigned integer numbers). The NTT input/otput vector size is limited because of overflow issues!!! For 32-bit modular arithmetics, N is limited to (2^32)/(max(input[])^2) so bigint must be divided to smaller chunks (I use BYTES so maximum size of bigint processed is
(2^32)/((2^8)^2) = 2^16 bytes = 2^14 DWORDs = 16384 DWORDs)
The sqr uses only 1xNTT + 1xINTT instead of 2xNTT + 1xINTT for multiplication but NTT usage is too slow and the threshold number size is too large for practical use in my implementation (for mul and also for sqr).
Is possible that is even over the overflow limit so 64-bit modular arithmetics should be used which can slow things down even more. So NTT is for my purposes also unusable too.
Some measurements:
a = 0.98765588997654321000 | 389*32 bits
looped 1x times
sqr1[ 3.177 ms ] fast sqr
sqr2[ 720.419 ms ] NTT sqr
mul1[ 5.588 ms ] simpe mul
mul2[ 3.172 ms ] karatsuba mul
mul3[ 1053.382 ms ] NTT mul
My implementation:
void arbnum::sqr_NTT(const arbnum &x)
{
// O(N*log(N)*(log(log(N)))) - 1x NTT
// Schönhage-Strassen sqr
// To prevent NTT overflow: n <= 48K * 8 bit -> result siz <= 12K * 32 bit -> x.siz + y.siz <= 12K!!!
int i, j, k, n;
int s = x.sig*x.sig, exp0 = x.exp + x.exp - ((x.siz+x.siz)<<5) + 2;
i = x.siz;
for (n = 1; n < i; n<<=1)
;
if (n + n > 0x3000) {
_error(_arbnum_error_TooBigNumber);
zero();
return;
}
n <<= 3;
DWORD *xx, *yy, q, qq;
xx = new DWORD[n+n];
#ifdef _mmap_h
if (xx)
mmap_new(xx, (n+n) << 2);
#endif
if (xx==NULL) {
_error(_arbnum_error_NotEnoughMemory);
zero();
return;
}
yy = xx + n;
// Zero padding (and split DWORDs to BYTEs)
for (i--, k=0; i >= 0; i--)
{
q = x.dat[i];
xx[k] = q&0xFF; k++; q>>=8;
xx[k] = q&0xFF; k++; q>>=8;
xx[k] = q&0xFF; k++; q>>=8;
xx[k] = q&0xFF; k++;
}
for (;k<n;k++)
xx[k] = 0;
//NTT
fourier_NTT ntt;
ntt.NTT(yy,xx,n); // init NTT for n
// Convolution
for (i=0; i<n; i++)
yy[i] = modmul(yy[i], yy[i], ntt.p);
//INTT
ntt.INTT(xx, yy);
//suma
q=0;
for (i = 0, j = 0; i<n; i++) {
qq = xx[i];
q += qq&0xFF;
yy[n-i-1] = q&0xFF;
q>>=8;
qq>>=8;
q+=qq;
}
// Merge WORDs to DWORDs and copy them to result
_alloc(n>>2);
for (i = 0, j = 0; i<siz; i++)
{
q =(yy[j]<<24)&0xFF000000; j++;
q |=(yy[j]<<16)&0x00FF0000; j++;
q |=(yy[j]<< 8)&0x0000FF00; j++;
q |=(yy[j] )&0x000000FF; j++;
dat[i] = q;
}
#ifdef _mmap_h
if (xx)
mmap_del(xx);
#endif
delete xx;
bits = siz<<5;
sig = s;
exp = exp0 + (siz<<5) - 1;
// _normalize();
}
Conclusion
For smaller numbers, it is the best option my fast sqr approach, and after
threshold Karatsuba multiplication is better. But I still think there should be something trivial which we have overlooked. Has anyone other ideas?
NTT optimization
After massively-intense optimizations (mostly NTT): Stack Overflow question Modular arithmetics and NTT (finite field DFT) optimizations.
Some values have changed:
a = 0.98765588997654321000 | 1553*32bits
looped 10x times
mul2[ 28.585 ms ] Karatsuba mul
mul3[ 26.311 ms ] NTT mul
So now NTT multiplication is finally faster than Karatsuba after about 1500*32-bit threshold.
Some measurements and bug spotted
a = 0.99991970486 | 1553*32 bits
looped: 10x
sqr1[ 58.656 ms ] fast sqr
sqr2[ 13.447 ms ] NTT sqr
mul1[ 102.563 ms ] simpe mul
mul2[ 28.916 ms ] Karatsuba mul Error
mul3[ 19.470 ms ] NTT mul
I found out that my Karatsuba (over/under)flows the LSB of each DWORD segment of bignum. When I have researched, I will update the code...
Also, after further NTT optimizations the thresholds changed, so for NTT sqr it is 310*32 bits = 9920 bits of operand, and for NTT mul it is 1396*32 bits = 44672 bits of result (sum of bits of operands).
Karatsuba code repaired thanks to #greybeard
//---------------------------------------------------------------------------
void arbnum::_mul_karatsuba(DWORD *z, DWORD *x, DWORD *y, int n)
{
// Recursion for Karatsuba
// z[2n] = x[n]*y[n];
// n=2^m
int i;
for (i=0; i<n; i++)
if (x[i]) {
i=-1;
break;
} // x==0 ?
if (i < 0)
for (i = 0; i<n; i++)
if (y[i]) {
i = -1;
break;
} // y==0 ?
if (i >= 0) {
for (i = 0; i < n + n; i++)
z[i]=0;
return;
} // 0.? = 0
if (n == 1) {
alu.mul(z[0], z[1], x[0], y[0]);
return;
}
if (n< 1)
return;
int n2 = n>>1;
_mul_karatsuba(z+n, x+n2, y+n2, n2); // z0 = x0.y0
_mul_karatsuba(z , x , y , n2); // z2 = x1.y1
DWORD *q = new DWORD[n<<1], *q0, *q1, *qq;
BYTE cx,cy;
if (q == NULL) {
_error(_arbnum_error_NotEnoughMemory);
return;
}
#define _add { alu.add(qq[i], q0[i], q1[i]); for (i--; i>=0; i--) alu.adc(qq[i], q0[i], q1[i]); } // qq = q0 + q1 ...[i..0]
#define _sub { alu.sub(qq[i], q0[i], q1[i]); for (i--; i>=0; i--) alu.sbc(qq[i], q0[i], q1[i]); } // qq = q0 - q1 ...[i..0]
qq = q;
q0 = x + n2;
q1 = x;
i = n2 - 1;
_add;
cx = alu.cy; // =x0+x1
qq = q + n2;
q0 = y + n2;
q1 = y;
i = n2 - 1;
_add;
cy = alu.cy; // =y0+y1
_mul_karatsuba(q + n, q + n2, q, n2); // =(x0+x1)(y0+y1) mod ((2^N)-1)
if (cx) {
qq = q + n;
q0 = qq;
q1 = q + n2;
i = n2 - 1;
_add;
cx = alu.cy;
}// += cx*(y0 + y1) << n2
if (cy) {
qq = q + n;
q0 = qq;
q1 = q;
i = n2 -1;
_add;
cy = alu.cy;
}// +=cy*(x0+x1)<<n2
qq = q + n; q0 = qq; q1 = z + n; i = n - 1; _sub; // -=z0
qq = q + n; q0 = qq; q1 = z; i = n - 1; _sub; // -=z2
qq = z + n2; q0 = qq; q1 = q + n; i = n - 1; _add; // z1=(x0+x1)(y0+y1)-z0-z2
DWORD ccc=0;
if (alu.cy)
ccc++; // Handle carry from last operation
if (cx || cy)
ccc++; // Handle carry from before last operation
if (ccc)
{
i = n2 - 1;
alu.add(z[i], z[i], ccc);
for (i--; i>=0; i--)
if (alu.cy)
alu.inc(z[i]);
else
break;
}
delete[] q;
#undef _add
#undef _sub
}
//---------------------------------------------------------------------------
void arbnum::mul_karatsuba(const arbnum &x, const arbnum &y)
{
// O(3*(N)^log2(3)) ~ O(3*(N^1.585))
// Karatsuba multiplication
//
int s = x.sig*y.sig;
arbnum a, b;
a = x;
b = y;
a.sig = +1;
b.sig = +1;
int i, n;
for (n = 1; (n < a.siz) || (n < b.siz); n <<= 1)
;
a._realloc(n);
b._realloc(n);
_alloc(n + n);
for (i=0; i < siz; i++)
dat[i]=0;
_mul_karatsuba(dat, a.dat, b.dat, n);
bits = siz << 5;
sig = s;
exp = a.exp + b.exp + ((siz-a.siz-b.siz)<<5) + 1;
// _normalize();
}
//---------------------------------------------------------------------------
My arbnum number representation:
// dat is MSDW first ... LSDW last
DWORD *dat; int siz,exp,sig,bits;
dat[siz] is the mantisa. LSDW means least significant DWORD.
exp is the exponent of MSB of dat[0]
The first nonzero bit is present in the mantissa!!!
// |-----|---------------------------|---------------|------|
// | sig | MSB mantisa LSB | exponent | bits |
// |-----|---------------------------|---------------|------|
// | +1 | 0.(0 ... 0) | 2^0 | 0 | +zero
// | -1 | 0.(0 ... 0) | 2^0 | 0 | -zero
// |-----|---------------------------|---------------|------|
// | +1 | 1.(dat[0] ... dat[siz-1]) | 2^exp | n | +number
// | -1 | 1.(dat[0] ... dat[siz-1]) | 2^exp | n | -number
// |-----|---------------------------|---------------|------|
// | +1 | 1.0 | 2^+0x7FFFFFFE | 1 | +infinity
// | -1 | 1.0 | 2^+0x7FFFFFFE | 1 | -infinity
// |-----|---------------------------|---------------|------|

If I understand your algorithm correctly, it seems O(n^2) where n is the number of digits.
Have you looked at Karatsuba Algorithm?
It speeds up multiplication using the divide and conquer approach. It may be worth taking a look at.

Great question you have, thanks!
Decided to implement from scratch a huge C++ solution for you, based on Number Theoretic Transform (NTT) and Discrete Fourier Transform.
To tell in advance, my FFT/NTT code achieves 330x speedup on 2-core old laptop compared to naive school-grade multiplication for the case of array size 2^16 32-bit words. Even bigger arrays above 2^20 in size will give millions times speedup.
Squaring a number with 2^22 words of 32-bit size (i.e. 4 Million words) takes 7 seconds on my NTT and 13 seconds on my FFT, on old 2GHz 2-core laptop with SSE2 only.
To remind, FFT and NTT give multiplication time O(N * Log(N)), while naive school grade algorithm has O(N^2) time. That's why I have so huge speedup described in previous paragraph.
Both together with code are well described in this article, mainly I was inspired by this article when writing below code. Another good article is Nayuki's NTT article.
I was convinced that for quite large numbers these two transforms will beat any other methods, like Karatsuba.
Besides basic approach described in article I also did dozens of optimizations:
For NTT computed set of my own primitive roots and modulos. And used biggest one closest to 2^62.
Used multi threading almost on every loop of NTT and FFT computation. Through OpenMP.
For squaring definitely I used 2 transforms instead of 3 (used for multiply). This gives 33% speed boost.
For NTT used Montgomery Reduction in all arrays when computing modulus. This gave about 2x-3x speedup.
Used constexpr functions and values and templated programming everywhere where I can. Reduction of runtime values to compile time values where possible gives a lot of speedup.
Re-designed swap/shuffle function that is used at every start of FFT/NTT transforms. Used precomputed table and caching for re-using previous results. Also did swapping in blocks to make cache-friendly reads/writes. Also bit twidling is done not in a loop but using pre-computed bit-table.
Inside main loop of transform factored out computation of W multiplier into separate loop together with pre-computation/caching. This gave about 2x speedup.
Used Intel SIMD instruction sets, currently SSE2 and AVX. These are used only for FFT, as NTT uses 128-bit integer division and multiplication and add/sub-with-carry, these are not available in SIMD. Also for SIMD in FFT I designed loop unrolling with special cache-friendly storage of complex numbers in std::array<>.
Did time/performance measurement of NTT/FFT multiplication versus naive.
Did analysis of error rate inside FFT. To remind NTT has no errors at all.
My code is self-contained, if you compile+run it then it will run tests measuring speed. Inside test function you can see how to use my library. Test runs FFT/NTT/Naive multiplication, measures time and compares if all multiplication results are correct, i.e. equal to naive version.
Note: No matter how I struggled to speedup FFT through SIMD, yet my NTT is so optimized that it is 1.3-1.8x times faster than FFT. As you know FFT gives errors which grow with size of big number. And if to take into account a fact that my NTT got faster then NTT is the only option for you!
It appeared that FFT can be used only for array sizes like 2^16 32-bit words, no more, then error size becomes to critical and destructs final result. Or you can decrease size of input 32-bit numbers, to 10-12 bits, this helps to reduce errors, yet you can't go bigger than 2^18 array size with critical error. You have to compute error size experimentally to figure out what is best.
Code can be compiled in CLang/MSVC/GCC. Maybe other compilers too. It has no external libraries dependencies at all, maybe except OpenMP library which is usually shipped with compiler. Only computation of Primitive Roots (NTT modulus) requires Boost library but only for MSVC and uses only 128-bit integer from there.
CODE GOES HERE. Only because code size is 65 KB, I can't inline it inside this post, as StackOverflow post size limit is 30 000 symbols. Hence I'm providing my code in below Github Gist link. Also click Try it online! link to run my code on online server of GodBolt.
Try it online!
Github Gist source code
Example console output:
Using SIMD SSE2
Test FindNttMod
FindNttEntry<T>{.k = 57, .c = 29, .p = 4179340454199820289, .g = 3, .root = 68630377364883, .plog2 = 61.86},
FindNttEntry<T>{.k = 54, .c = 177, .p = 3188548536178311169, .g = 7, .root = 3055434446054240334, .plog2 = 61.47},
FindNttEntry<T>{.k = 54, .c = 163, .p = 2936346957045563393, .g = 3, .root = 83050791888939419, .plog2 = 61.35},
FindNttEntry<T>{.k = 55, .c = 69, .p = 2485986994308513793, .g = 5, .root = 1700750308946223057, .plog2 = 61.11},
FindNttEntry<T>{.k = 54, .c = 127, .p = 2287828610704211969, .g = 3, .root = 878887558841786394, .plog2 = 60.99},
FindNttEntry<T>{.k = 55, .c = 57, .p = 2053641430080946177, .g = 7, .root = 640559856471874596, .plog2 = 60.83},
FindNttEntry<T>{.k = 56, .c = 27, .p = 1945555039024054273, .g = 5, .root = 1613915479851665306, .plog2 = 60.75},
FindNttEntry<T>{.k = 53, .c = 161, .p = 1450159080013299713, .g = 3, .root = 359678689516082930, .plog2 = 60.33},
FindNttEntry<T>{.k = 53, .c = 143, .p = 1288029493427961857, .g = 3, .root = 531113314168589713, .plog2 = 60.16},
FindNttEntry<T>{.k = 55, .c = 35, .p = 1261007895663738881, .g = 6, .root = 397650301651152680, .plog2 = 60.13},
0.025 sec
Test CompareNttMultWithReg
Time NTT 0.035 FFT 0.081 Reg 11.614 Boost_NTT 333.588x (FFT 142.644)
Swap 0.776 (Slow 0.000) ToMontg 0.079 Main 3.056 (0.399, 2.656) Invert 0.000 All 3.911
MidMul 0.110
Swap 0.510 (Slow 0.000) ToMontg 0.000 Main 2.535 (0.336, 2.198) Invert 0.094 All 3.139
AssignComplex 0.495
Swap 1.373 FromComplex 0.309 Main 4.875 (0.382, 4.493) Invert 0.000 ToComplex 0.224 All 6.781
MidMul 0.147
Swap 1.106 FromComplex 0.296 Main 4.209 (0.277, 3.931) Invert 0.166 ToComplex 0.199 All 5.975
Round 0.143
Time NTT 7.457 FFT 14.097 Boost_NTT 1.891x
Run Time: 33.719 sec

If you're looking to write a new better exponent you might have to write it in assembly. This is the code from golang.
https://code.google.com/p/go/source/browse/src/pkg/math/exp_amd64.s

Related

C++ performance optimization for linear combination of large matrices?

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.

First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.

As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}

Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.

Using series to approximate log(2)

double k = 0;
int l = 1;
double digits = pow(0.1, 5);
do
{
k += (pow(-1, l - 1)/l);
l++;
} while((log(2)-k)>=digits);
I'm trying to write a little program based on an example I seen using a series of Σ_(l=1) (pow(-1, l - 1)/l) to estimate log(2);
It's supposed to be a guess refinement thing where time it gets closer and closer to the right value until so many digits match.
The above is what I tried but but it's not coming out right. After messing with it for quite a while I can't figure out where I'm messing up.

I assume that you are trying to extimate the natural logarithm of 2 by its Taylor series expansion:
∞ (-1)n + 1
ln(x) = ∑ ――――――――(x - 1)n
n=1 n
One of the problems of your code is the condition choosen to stop the iterations at a specified precision:
do { ... } while((log(2)-k)>=digits);
Besides using log(2) directly (aren't you supposed to find it out instead of using a library function?), at the second iteration (and for every other even iteration) log(2) - k gets negative (-0.3068...) ending the loop.
A possible (but not optimal) fix could be to use std::abs(log(2) - k) instead, or to end the loop when the absolute value of 1.0 / l (which is the difference between two consecutive iterations) is small enough.
Also, using pow(-1, l - 1) to calculate the sequence 1, -1, 1, -1, ... Is really a waste, especially in a series with such a slow convergence rate.
A more efficient series (see here) is:
∞ 1
ln(x) = 2 ∑ ――――――― ((x - 1) / (x + 1))2n + 1
n=0 2n + 1
You can extimate it without using pow:
double x = 2.0; // I want to calculate ln(2)
int n = 1;
double eps = 0.00001,
kpow = (x - 1.0) / (x + 1.0),
kpow2 = kpow * kpow,
dk,
k = 2 * kpow;
do {
n += 2;
kpow *= kpow2;
dk = 2 * kpow / n;
k += dk;
} while ( std::abs(dk) >= eps );

Optimizing Fixed-Point Sqrt

I made what I think is a good fixed-point square root algorithm:
template<int64_t M, int64_t P>
typename enable_if<M + P == 32, FixedPoint<M, P>>::type sqrt(FixedPoint<M, P> f)
{
if (f.num == 0)
return 0;
//Reduce it to the 1/2 to 2 range (based around FixedPoint<2, 30> to avoid left/right shift branching)
int64_t num{ f.num }, faux_half{ 1 << 29 };
ptrdiff_t mag{ 0 };
while (num < (faux_half)) {
num <<= 2;
++mag;
}
int64_t res = (M % 2 == 0 ? SQRT_32_EVEN_LOOKUP : SQRT_32_ODD_LOOKUP)[(num >> (30 - 4)) - (1LL << 3)];
res >>= M / 2 + mag - 1; //Finish making an excellent guess
for (int i = 0; i < 2; ++i)
// \ | /
// \ | /
// _| V L
res = (res + (int64_t(f.num) << P) / res) >> 1; //Use Newton's method to improve greatly on guess
// 7 A r
// / | \
// / | \
// The Infamous Time Eater
return FixedPoint<M, P>(res, true);
}
However, after profiling (in release mode) I found out that the division here takes up 83% of the time this algorithm spends. I can speed it up 6x by replacing the division with multiplication, but that's just wrong. I found out that integer division is much slower than multiplication, unfortunately. Is there any way to optimize this?
In case this table is necessary.
const array<int32_t, 24> SQRT_32_EVEN_LOOKUP = {
0x2d413ccd, //magic numbers calculated by taking 0.5 + 0.5 * i / 8 from i = 0 to 23, multiplying by 2^30, and converting to hex
0x30000000,
0x3298b076,
0x3510e528,
0x376cf5d1,
0x39b05689,
0x3bddd423,
0x3df7bd63,
0x40000000,
0x41f83d9b,
0x43e1db33,
0x45be0cd2,
0x478dde6e,
0x49523ae4,
0x4b0bf165,
0x4cbbb9d6,
0x4e623850,
0x50000000,
0x5195957c,
0x532370b9,
0x54a9fea7,
0x5629a293,
0x57a2b749,
0x59159016
};
SQRT_32_ODD_LOOKUP is just SQRT_32_EVEN_LOOKUP divided by sqrt(2).

Reinventing the wheel, really, and not in a good way. The correct solution is to calculate 1/sqrt(x) using NR, and then multiply once to get x/sqrt(x) - just check for x==0 up front.
The reason why this is so much better is that the NR step for y=1/sqrt(x) is just y = (3-x*y*y)*y/2. That's all straightforward multiplication.

performance of log10 function returning an int

Today I needed a cheap log10 function, of which I only used the int part. Assuming the result is floored, so the log10 of 999 would be 2. Would it be beneficial writing a function myself? And if so, which way would be the best to go. Assuming the code would not be optimized.
The alternatives to log10 I've though of;
use a for loop dividing or multiplying by 10;
use a string parser(probably extremely expensive);
using an integer log2() function multiplying by a constant.
Thank you on beforehand:)

The operation can be done in (fast) constant time on any architecture that has a count-leading-zeros or similar instruction (which is most architectures). Here's a C snippet I have sitting around to compute the number of digits in base ten, which is essentially the same task (assumes a gcc-like compiler and 32-bit int):
unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
static unsigned int baseTenDigits(unsigned int x) {
static const unsigned char guess[33] = {
0, 0, 0, 0, 1, 1, 1, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9
};
static const unsigned int tenToThe[] = {
1, 10, 100, 1000, 10000, 100000,
1000000, 10000000, 100000000, 1000000000,
};
unsigned int digits = guess[baseTwoDigits(x)];
return digits + (x >= tenToThe[digits]);
}
GCC and clang compile this down to ~10 instructions on x86. With care, one can make it faster still in assembly.
The key insight is to use the (extremely cheap) base-two logarithm to get a fast estimate of the base-ten logarithm; at that point we only need to compare against a single power of ten to decide if we need to adjust the guess. This is much more efficient than searching through multiple powers of ten to find the right one.
If the inputs are overwhelmingly biased to one- and two-digit numbers, a linear scan is sometimes faster; for all other input distributions, this implementation tends to win quite handily.

One way to do it would be loop with subtracting powers of 10. This powers could be computed and stored in table. Here example in python:
table = [10**i for i in range(1, 10)]
# [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]
def fast_log10(n):
for i, k in enumerate(table):
if n - k < 0:
return i
Usage example:
>>> fast_log10(1)
0
>>> fast_log10(10)
1
>>> fast_log10(100)
2
>>> fast_log10(999)
2
fast_log10(1000)
3
Also you may use binary search with this table. Then algorithm complexity would be only O(lg(n)), where n is number of digits.
Here example with binary search in C:
long int table[] = {10, 100, 1000, 10000, 1000000};
#define TABLE_LENGHT sizeof(table) / sizeof(long int)
int bisect_log10(long int n, int s, int e) {
int a = (e - s) / 2 + s;
if(s >= e)
return s;
if((table[a] - n) <= 0)
return bisect_log10(n, a + 1, e);
else
return bisect_log10(n, s, a);
}
int fast_log10(long int n){
return bisect_log10(n, 0, TABLE_LENGHT);
}
Note for small numbers this method would slower then upper method.
Full code here.

Well, there's the old standby - the "poor man's log function".
(If you want to handle more than 63 integer digits, change the first "if" to a "while".)
n = 1;
if (v >= 1e32){n += 32; v /= 1e32;}
if (v >= 1e16){n += 16; v /= 1e16;}
if (v >= 1e8){n += 8; v /= 1e8;}
if (v >= 1e4){n += 4; v /= 1e4;}
if (v >= 1e2){n += 2; v /= 1e2;}
if (v >= 1e1){n += 1; v /= 1e1;}
so if you feed in 123456.7, here's how it goes:
n = 1;
if (v >= 1e32) no
if (v >= 1e16) no
if (v >= 1e8) no
if (v >= 1e4) yes, so n = 5, v = 12.34567
if (v >= 1e2) no
if (v >= 1e1) yes, so n = 6, v = 1.234567
so result is n = 6
Here's a variation that uses multiplication, rather than division:
int n = 1;
double d = 1, temp;
temp = d * 1e32; if (v >= temp){n += 32; d = temp;}
temp = d * 1e16; if (v >= temp){n += 16; d = temp;}
temp = d * 1e8; if (v >= temp){n += 8; d = temp;}
temp = d * 1e4; if (v >= temp){n += 4; d = temp;}
temp = d * 1e2; if (v >= temp){n += 2; d = temp;}
temp = d * 1e1; if (v >= temp){n += 1; d = temp;}
and an execution looks like this
v = 123456.7
n = 1
d = 1
temp = 1e32, if (v >= 1e32) no
temp = 1e16, if (v >= 1e16) no
temp = 1e8, if (v >= 1e8) no
temp = 1e4, if (v >= 1e4) yes, so n = 5, d = 1e4;
temp = 1e6, if (v >= 1e6) no
temp = 1e5, if (v >= 1e5) yes, so n = 6, d = 1e5;

If you want to have a faster log function you need to approximate their result. E.g. the exp function can be approximated using a 'short' taylor approximation. You can find example approximations for exp, log, root and power here
edit:
You can find a short performance comparsion here

Because an unsigned < or >= test is done simply by subtracting and checking the carry flag, it is possible to put both arrays (guess and negated tenToThe) in a single 64-bit value, combine both array lookups into one, and use the carry from 32-bit addition to adjust the guess. The high 32 bits of guess[n] provide the value of log10(2^n*2-1), while the low 32 bits contain -10^log10(2^n*2-1).
static unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
unsigned int baseTenDigits(unsigned int x) {
static uint64_t guess[33] = {
/* 1 */ 0, 0, 0,
/* 8 */ (1ull<<32)-10, (1ull<<32)-10, (1ull<<32)-10,
/* 64 */ (2ull<<32)-100, (2ull<<32)-100, (2ull<<32)-100,
/* 512 */ (3ull<<32)-1000, (3ull<<32)-1000, (3ull<<32)-1000,
(3ull<<32)-1000,
/* 8192 */ (4ull<<32)-10000, (4ull<<32)-10000, (4ull<<32)-10000,
/* 65536 */ (5ull<<32)-100000, (5ull<<32)-100000, (5ull<<32)-100000,
/* 524288 */ (6ull<<32)-1000000, (6ull<<32)-1000000, (6ull<<32)-1000000,
(6ull<<32)-1000000,
/* 8388608 */ (7ull<<32)-10000000, (7ull<<32)-10000000,
(7ull<<32)-10000000,
/* 67108864 */ (8ull<<32)-100000000, (8ull<<32)-100000000,
(8ull<<32)-100000000,
/* 536870912 */ (9ull<<32)-1000000000, (9ull<<32)-1000000000,
(9ull<<32)-1000000000,
};
uint64_t adjust = guess[baseTwoDigits(x)];
return (adjust + x) >> 32;
}

Without any specifications, I will just give a general answer:
The log function will be pretty efficient in most languages as it is such a basic function.
The fact that you are only interested in integers could give you some leverage, but probably this is not enough to easily beat the builtin standard solutions.
One of the few things that I can think of to be faster than a builtin function is a table lookup, so if you are only interested in the numbers upto 10000 for instance, you could simply create a table that you could use to lookup any of these values when you need them.
Obviously this solution will not scale well, but it may be just what you need.
Sidenote: If you are importing the data for example, it may actually be faster to look at the string diecty length (rather than first converting the string to a number and than looking at the value of the string). Of course this will require the input to be stored in just the right format, otherwise it won't gain you anything.

How to compute sum of evenly spaced binomial coefficients

How to find sum of evenly spaced Binomial coefficients modulo M?
ie. (nCa + nCa+r + nCa+2r + nCa+3r + ... + nCa+kr) % M = ?
given: 0 <= a < r, a + kr <= n < a + (k+1)r, n < 105, r < 100
My first attempt was:
int res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res = (res + mod_nCr(n, a+r*k, mod)) % mod;
}
but this is not efficient. So after reading here
and this paper I found out the above sum is equivalent to:
summation[ω-ja * (1 + ωj)n / r], for 0 <= j < r; and ω = ei2π/r is a primitive rth root of unity.
What can be the code to find this sum in Order(r)?
Edit:
n can go upto 105 and r can go upto 100.
Original problem source: https://www.codechef.com/APRIL14/problems/ANUCBC
Editorial for the problem from the contest: https://discuss.codechef.com/t/anucbc-editorial/5113
After revisiting this post 6 years later, I'm unable to recall how I transformed the original problem statement into mine version, nonetheless, I shared the link to the original solution incase anyone wants to have a look at the correct solution approach.

Binomial coefficients are coefficients of the polynomial (1+x)^n. The sum of the coefficients of x^a, x^(a+r), etc. is the coefficient of x^a in (1+x)^n in the ring of polynomials mod x^r-1. Polynomials mod x^r-1 can be specified by an array of coefficients of length r. You can compute (1+x)^n mod (x^r-1, M) by repeated squaring, reducing mod x^r-1 and mod M at each step. This takes about log_2(n)r^2 steps and O(r) space with naive multiplication. It is faster if you use the Fast Fourier Transform to multiply or exponentiate the polynomials.
For example, suppose n=20 and r=5.
(1+x) = {1,1,0,0,0}
(1+x)^2 = {1,2,1,0,0}
(1+x)^4 = {1,4,6,4,1}
(1+x)^8 = {1,8,28,56,70,56,28,8,1}
{1+56,8+28,28+8,56+1,70}
{57,36,36,57,70}
(1+x)^16 = {3249,4104,5400,9090,13380,9144,8289,7980,4900}
{3249+9144,4104+8289,5400+7980,9090+4900,13380}
{12393,12393,13380,13990,13380}
(1+x)^20 = (1+x)^16 (1+x)^4
= {12393,12393,13380,13990,13380}*{1,4,6,4,1}
{12393,61965,137310,191440,211585,203373,149620,67510,13380}
{215766,211585,204820,204820,211585}
This tells you the sums for the 5 possible values of a. For example, for a=1, 211585 = 20c1+20c6+20c11+20c16 = 20+38760+167960+4845.

Something like that, but you have to check a, n and r because I just put anything without regarding about the condition:
#include <complex>
#include <cmath>
#include <iostream>
using namespace std;
int main( void )
{
const int r = 10;
const int a = 2;
const int n = 4;
complex<double> i(0.,1.), res(0., 0.), w;
for( int j(0); j<r; ++j )
{
w = exp( i * 2. * M_PI / (double)r );
res += pow( w, -j * a ) * pow( 1. + pow( w, j ), n ) / (double)r;
}
return 0;
}

the mod operation is expensive, try avoiding it as much as possible
uint64_t res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res += mod_nCr(n, a+r*k, mod);
if(res > mod)
res %= mod;
}
I did not test this code

I don't know if you reached something or not in this question, but the key to implementing this formula is to actually figure out that w^i are independent and therefore can form a ring. In simpler terms you should think of implement
(1+x)^n%(x^r-1) or finding out (1+x)^n in the ring Z[x]/(x^r-1)
If confused I will give you an easy implementation right now.
make a vector of size r . O(r) space + O(r) time
initialization this vector with zeros every where O(r) space +O(r) time
make the first two elements of that vector 1 O(1)
calculate (x+1)^n using the fast exponentiation method. each multiplication takes O(r^2) and there are log n multiplications therefore O(r^2 log(n) )
return first element of the vector.O(1)
Complexity
O(r^2 log(n) ) time and O(r) space.
this r^2 can be reduced to r log(r) using fourier transform.
How is the multiplication done, this is regular polynomial multiplication with mod in the power
vector p1(r,0);
vector p2(r,0);
p1[0]=p1[1]=1;
p2[0]=p2[1]=1;
now we want to do the multiplication
vector res(r,0);
for(int i=0;i<r;i++)
{
for(int j=0;j<r;j++)
{
res[(i+j)%r]+=(p1[i]*p2[j]);
}
}
return res[0];
I have implemented this part before, if you are still cofused about something let me know. I would prefer that you implement the code yourself, but if you need the code let me know.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fast bignum square computation - c++

If I understand your algorithm correctly, it seems O(n^2) where n is the number of digits. Have you looked at Karatsuba Algorithm? It speeds up multiplication using the divide and conquer approach. It may be worth taking a look at.

If you're looking to write a new better exponent you might have to write it in assembly. This is the code from golang. https://code.google.com/p/go/source/browse/src/pkg/math/exp_amd64.s

Related

C++ performance optimization for linear combination of large matrices?

Using series to approximate log(2)

Optimizing Fixed-Point Sqrt

performance of log10 function returning an int

How to compute sum of evenly spaced binomial coefficients

Categories

Resources