I use two bitsets to store two polynomials. I want one of them to be divided by 2nd and I want to get remainder after division. For example if I would like it on the paper:
w1= 110011010000000
w2 = 1111001
101000100
110011010000000 : 1111001
1111001
--1111110
1111001
----1110000
1111001
---100100 = remainder
Very few CPUs have builtin instructions for GF(2) division like this, so you'll need to implement it yourself with shifts and xors. Basically, you implement it exactly like you did it on paper -- shift the divisor up until its top bit matches that of dividend, then xor and shift back down, recording each position where you need an xor as a bit of the quotient. If all the polynomials in question fit in a single word, you can just use unsigned integer types for it. Otherwise, you'll need some multiprecision bitset type. The C++ std::bitset can be used for this, despite its problems (no easy way to convert between bitsets of different sizes, no bitscan functions).
template<size_t N> int top_bit_set(const bitset<N> &a) {
int i;
for (i = N-1; i >= 0; i--)
if (a.test(i)) break;
return i;
}
template<size_t N>
bitset<N> gf2_div(bitset<N> dividend, bitset<N> divisor, bitset<N> &remainder) {
bitset<N> quotient(0);
int divisor_size = top_bit_set(divisor);
if (divisor_size < 0) throw divide_by_zero();
int bit;
while ((bit = top_bit_set(dividend)) >= divisor_size) {
quotient.set(bit - divisor_size);
dividend ^= divisor << (bit - divisor_size); }
remainder = dividend;
return quotient;
}
Related
int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<char> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
const int x = corr_indexes[i] / n_attrs;
const int y = corr_indexes[i] % n_attrs;
if (selected[x] || selected[y]) continue; // fit inside L1 cache
// below lines are called max 2500 times, so they're insignificant
selected[x] = true;
selected[y] = true;
selectedPairs.emplace_back(x, y);
if (selectedPairs.size() == n_attrs / 2) break;
}
I have a function that looks like this. The bottleneck is in
const int x = corr_indexes[i] / n_attrs;
const int y = corr_indexes[i] % n_attrs;
n_attrs is const during the loop, so I wish to find a way to speed up this loop. corr_indexes[i], n_attrs > 0, < max_int32. Edit: please note that n_attrs isn't compile-time const.
How can I optimize this loop? No extra library is allowed.
Also, is their any way to parallelize this loop (either CPU or GPU are okay, everything is already on GPU memory before this loop).
I am restricting my comments to integer division, because to first order the modulo operation in C++ can be viewed and implemented as an integer division plus back-multiply and subtraction, although in some cases, there are cheaper ways of computing the modulo directly, e.g. when computing modulo 2n.
Integer division is pretty slow on most platforms, based on either software emulation or an iterative hardware implementation. But it was widely reported last year that based on microbenchmarking on Apple's M1, it has a blazingly fast integer division, presumably by using dedicated circuitry.
Ever since a seminal paper by Torbjörn Granlund and Peter Montgomery almost thirty years ago it has been widely known how to replace integer divisions with constant divisors by using an integer multiply plus possibly a shift and / or other correction steps. This algorithm is often referred to as the magic-multiplier technique. It requires precomputation of some relevant parameters from the integer divisor for use in the multiply-based emulation sequence.
Torbjörn Granlund and Peter L. Montgomery, "Division by invariant integers using multiplication," ACM SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72 (online).
At current, all major toolchains incorporate variants of the Granlund-Montgomery algorithm when dealing with integer divisors that are compile-time constant. The pre-computation occurs at compilation time inside the compiler, which then emits code using the computed parameters. Some toolchains may also use this algorithm for divisions by run-time constant divisors that are used repeatedly. For run-time constant divisors in loops, this could involve emitting a pre-computation block prior to a loop to compute the necessary parameters, and then using those for the division emulation code inside the loop.
If one's toolchain does not optimize divisions with run-time constant divisor one can use the same approach manually as demonstrated by the code below. However, this is unlikely to achieve the same efficiency as a compiler-based solution, because not all machine operations used in the desired emulation sequence can be expressed efficiently at C++ level in a portable manner. This applies in particular to arithmetic right shifts and add-with-carry.
The code below demonstrates the principle of parameter precomputation and integer division emulation via multiplication. It is quite likely that by investing more time into the design than I was willing to expend for this answer more efficient implementations of both parameter precomputation and emulation can be identified.
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#define PORTABLE (1)
uint32_t ilog2 (uint32_t i)
{
uint32_t t = 0;
i = i >> 1;
while (i) {
i = i >> 1;
t++;
}
return (t);
}
/* Based on: Granlund, T.; Montgomery, P.L.: "Division by Invariant Integers
using Multiplication". SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72
*/
void prepare_magic (int32_t divisor, int32_t &multiplier, int32_t &add_mask, int32_t &sign_shift)
{
uint32_t divisoru, d, n, i, j, two_to_31 = uint32_t (1) << 31;
uint64_t m_lower, m_upper, k, msb, two_to_32 = uint64_t (1) << 32;
divisoru = uint32_t (divisor);
d = (divisor < 0) ? (0 - divisoru) : divisoru;
i = ilog2 (d);
j = two_to_31 % d;
msb = two_to_32 << i;
k = msb / (two_to_31 - j);
m_lower = msb / d;
m_upper = (msb + k) / d;
n = ilog2 (uint32_t (m_lower ^ m_upper));
n = (n > i) ? i : n;
m_upper = m_upper >> n;
i = i - n;
multiplier = int32_t (uint32_t (m_upper));
add_mask = (m_upper >> 31) ? (-1) : 0;
sign_shift = int32_t ((divisoru & two_to_31) | i);
}
int32_t arithmetic_right_shift (int32_t a, int32_t s)
{
uint32_t msb = uint32_t (1) << 31;
uint32_t ua = uint32_t (a);
ua = ua >> s;
msb = msb >> s;
return int32_t ((ua ^ msb) - msb);
}
int32_t magic_division (int32_t dividend, int32_t multiplier, int32_t add_mask, int32_t sign_shift)
{
int64_t prod = int64_t (dividend) * multiplier;
int32_t quot = (int32_t)(uint64_t (prod) >> 32);
quot = int32_t (uint32_t (quot) + (uint32_t (dividend) & uint32_t (add_mask)));
#if PORTABLE
const int32_t byte_mask = 0xff;
quot = arithmetic_right_shift (quot, sign_shift & byte_mask);
#else // PORTABLE
quot = quot >> sign_shift; // must mask shift count & use arithmetic right shift
#endif // PORTABLE
quot = int32_t (uint32_t (quot) + (uint32_t (dividend) >> 31));
if (sign_shift < 0) quot = -quot;
return quot;
}
int main (void)
{
int32_t multiplier;
int32_t add_mask;
int32_t sign_shift;
int32_t divisor;
for (divisor = -20; divisor <= 20; divisor++) {
/* avoid division by zero */
if (divisor == 0) {
divisor++;
continue;
}
printf ("divisor=%d\n", divisor);
prepare_magic (divisor, multiplier, add_mask, sign_shift);
printf ("multiplier=%d add_mask=%d sign_shift=%d\n",
multiplier, add_mask, sign_shift);
printf ("exhaustive test of dividends ... ");
uint32_t dividendu = 0;
do {
int32_t dividend = (int32_t)dividendu;
/* avoid overflow in signed integer division */
if ((divisor == (-1)) && (dividend == ((-2147483647)-1))) {
dividendu++;
continue;
}
int32_t res = magic_division (dividend, multiplier, add_mask, sign_shift);
int32_t ref = dividend / divisor;
if (res != ref) {
printf ("\nERR dividend=%d (%08x) divisor=%d res=%d ref=%d\n",
dividend, (uint32_t)dividend, divisor, res, ref);
return EXIT_FAILURE;
}
dividendu++;
} while (dividendu);
printf ("PASSED\n");
}
return EXIT_SUCCESS;
}
How can I optimize this loop?
This is a perfect use-case for libdivide. This library has been designed to speed up division by constant at run-time by using the strategy compilers use at compile-time. The library is header-only so it does not create any run-time dependency. It also support the vectorization of divisions (ie. using SIMD instructions) which is definitively something to use in this case to drastically speed up the computation which compilers cannot do without changing significantly the loop (and in the end it will be not as efficient because of the run-time-defined divisor). Note that the licence of libdivide is very permissive (zlib) so you can easily include it in your project without strong constraints (you basically just need to mark it as modified if you change it).
If header only-libraries are not OK, then you need to reimplement the wheel. The idea is to transform a division by a constant to a sequence of shift and multiplications. The very good answer of #njuffa specify how to do that. You can also read the code of libdivide which is highly optimized.
For small positive divisors and small positive dividends, there is no need for a long sequence of operation. You can cheat with a basic sequence:
uint64_t dividend = corr_indexes[i]; // Must not be too big
uint64_t divider = n_attrs;
uint64_t magic_factor = 4294967296 / n_attrs + 1; // Must be precomputed once
uint32_t result = (dividend * magic_factor) >> 32;
This method should be safe for uint16_t dividends/divisors, but it is not for much bigger values. In practice if fail for dividend values above ~800_000. Bigger dividends require a more complex sequence which is also generally slower.
is their any way to parallelize this loop
Only the division/modulus can be safely parallelized. There is a loop carried dependency in the rest of the loop that prevent any parallelization (unless additional assumptions are made). Thus, the loop can be split in two parts: one that compute the division and put the uint16_t results in a temporary array computed later serially. The array needs not to be too big, since the computation would be memory bound otherwise and the resulting parallel code can be slower than the current one. Thus, you need to operate on small chunks that fit in at least the L3 cache. If chunks are too small, then thread synchronizations can also be an issue. The best solution is certainly to use a rolling window of chunks. All of this is certainly a bit tedious/tricky to implement.
Note that SIMD instructions can be used for the division part (easy with libdivide). You also need to split the loop and use chunks but chunks do not need to be big since there is no synchronization overhead. Something like 64 integers should be enough.
Note that recent processor can compute divisions like this efficiently, especially for 32-bit integers (64-bit ones tends to be significantly more expensive). This is especially the case of the Alder lake, Zen3 and M1 processors (P-cores). Note that both the modulus and the division are computed in one instruction on x86/x86-64 processors. Also note that while the division has a pretty big latency, many processors can pipeline multiple divisions so to get a reasonable throughput. For example, a 32-bit div instruction has a latency of 23~28 cycles on Skylake but a reciprocal throughput of 4~6. This is apparently not the case on Zen1/Zen2.
I would optimize the part after // optimize the code below by:
taking n_attrs
generating a function string like this:
void dynamicFunction(MyType & selectedPairs, Foo & selected)
{
const int npairs = ## * ##;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
const int x = corr_indexes[i] / ##;
const int y = corr_indexes[i] % ##;
if (selected[x] || selected[y]) continue; // fit inside L1 cache
// below lines are called max 2500 times, so they're insignificant
selected[x] = true;
selected[y] = true;
selectedPairs.emplace_back(x, y);
if (selectedPairs.size() == ## / 2)
break;
}
}
replacing all ## with value of n_attrs
compiling it, generating a DLL
linking and calling the function
So that the n_attrs is a compile-time constant value for the DLL and the compiler can automatically do most of its optimization on the value like:
doing n&(x-1) instead of n%x when x is power-of-2 value
shifting and multiplying instead of dividing
maybe other optimizations too, like unrolling the loop with precalculated indices for x and y (since x is known)
Some integer math operations in tight-loops are easier to SIMDify/vectorize by compiler when more of the parts are known in compile-time.
If your CPU is AMD, you can even try magic floating-point operations in place of unknown/unknown division to get vectorization.
By caching all (or big percentage of) values of n_attrs, you can get rid of latencies of:
string generation
compiling
file(DLL) reading (assuming some object-oriented wrapping of DLLs)
If the part to be optimized will be run in GPU, there is high possibility of CUDA/OpenCL implementation already doing the integer division in means of floating-point (to keep SIMD path occupied instead of being serialized on integer division) or just being capable directly as SIMD integer operations so you may just use it as it is in the GPU, except the std::vector which is not supported by all C++ CUDA compilers (and not in OpenCL kernel). These host-environment-related parts could be computed after the kernel (with the parts excluding emplace_back or exchanged with a struct that works in GPU) is executed.
So the actual best solution in my case.
Instead of representing index = row * n_cols + col, do index = (row << 16) | col for 32 bit, or index = (row << 32) | col for 64 bits. Then row = index >> 32, col = index & (32 - 1). Or even better, just uint16_t* pairs = reinterpret_cast<uint16_t*>(index_array);, then pair[i], pair[i+1] for each i % 2 == 0 is a pair.
This is assuming the number of rows/columns is less than 2^16 (or 2^32).
I'm still keeping the top answer because it still answers the case where division has to be used.
Given an integer n(1≤n≤1018). I need to make all the unset bits in this number as set (i.e. only the bits meaningful for the number, not the padding bits required to fit in an unsigned long long).
My approach: Let the most significant bit be at the position p, then n with all set bits will be 2p+1-1.
My all test cases matched except the one shown below.
Input
288230376151711743
My output
576460752303423487
Expected output
288230376151711743
Code
#include<bits/stdc++.h>
using namespace std;
typedef long long int ll;
int main() {
ll n;
cin >> n;
ll x = log2(n) + 1;
cout << (1ULL << x) - 1;
return 0;
}
The precision of typical double is only about 15 decimal digits.
The value of log2(288230376151711743) is 57.999999999999999994994646087789191106964114967902921472132432244... (calculated using Wolfram Alpha)
Threfore, this value is rounded to 58 and this result in putting a bit 1 to higher digit than expected.
As a general advice, you should avoid using floating-point values as much as possible when dealing with integer values.
You can solve this with shift and or.
uint64_t n = 36757654654;
int i = 1;
while (n & (n + 1) != 0) {
n |= n >> i;
i *= 2;
}
Any set bit will be duplicated to the next lower bit, then pairs of bits will be duplicated 2 bits lower, then quads, bytes, shorts, int until all meaningful bits are set and (n + 1) becomes the next power of 2.
Just hardcoding the maximum of 6 shifts and ors might be faster than the loop.
If you need to do integer arithmetics and count bits, you'd better count them properly, and avoid introducing floating point uncertainty:
unsigned x=0;
for (;n;x++)
n>>=1;
...
(demo)
The good news is that for n<=1E18, x will never reach the number of bits in an unsigned long long. So the rest of you code is not at risk of being UB and you could stick to your minus 1 approach, (although it might in theory not be portable for C++ before C++20) ;-)
Btw, here are more ways to efficiently find the most significant bit, and the simple log2() is not among them.
I'm trying to implement the restoring division algorithm, but I keep getting incorrect results. The trick is my assignment requires I implement +,-,*,/,% using only bitwise operators, loops, and branches. I've successfully implemented add(a,b), sub(a,b), and mul(a,b), hence their use in my div(a,b,&rem) method. Here's the code,
template<typename T>
T div(T dividend, T divisor, T &remainder){
unsigned q = 1;
unsigned n = mul(sizeof(T), CHAR_BIT);
remainder = dividend;
divisor <<= n;
for(int i=sub(n,1); i>=0; i=sub(i,1)) {
remainder = sub(remainder << 1, divisor);
if(remainder < 0) {
q &= ~(1 << i); // set i-th bit to 0
remainder = add(remainder, divisor);
} else {
q |= 1 << i; // set i-th bit to 1
}
}
return q;
}
I've tested all the edge cases and common examples for add, sub, and mul and I know they work correctly for any integer input.
It appears that for any input I get q = -1 and remainder = 0. I think the problem has something to do with the signing of T, or q and n. I think my implementation is the same, is there a reason why the method is returning -1 and 0?
You need to check the algorithm a bit closer. Your if(q < 0) comparison is using the wrong variable. It should be if (remainder < 0).
You mentioned this could have to do with signing of T, or q and n. You also mentioned in a comment that your implementation of T is a short.
Referring to the https://en.cppreference.com/w/cpp/language/types,
Your type T is always at least 16 bits but the intermediate values in shown div function are unsigned, or int, in general, which could be 32 bits depending on data model.
So your returned value is being truncated into the potentially smaller size of the T type holder to produce an expected result.
I am managing some big (128~256bits) integers with gmp. It has come a point were I would like to multiply them for a double close to 1 (0.1 < double < 10), the result being still an approximated integer. A good example of the operation I need to do is the following:
int i = 1000000000000000000 * 1.23456789
I searched in the gmp documentation but I didn't find a function for this, so I ended up writing this code which seems to work well:
mpz_mult_d(mpz_class & r, const mpz_class & i, double d, int prec=10) {
if (prec > 15) prec=15; //avoids overflows
uint_fast64_t m = (uint_fast64_t) floor(d);
r = i * m;
uint_fast64_t pos=1;
for (uint_fast8_t j=0; j<prec; j++) {
const double posd = (double) pos;
m = ((uint_fast64_t) floor(d * posd * 10.)) -
((uint_fast64_t) floor(d * posd)) * 10;
pos*=10;
r += (i * m) /pos;
}
}
Can you please tell me what do you think? Do you have any suggestion to make it more robust or faster?
this is what you wanted:
// BYTE lint[_N] ... lint[0]=MSB, lint[_N-1]=LSB
void mul(BYTE *c,BYTE *a,double b) // c[_N]=a[_N]*b
{
int i; DWORD cc;
double q[_N+1],aa,bb;
for (q[0]=0.0,i=0;i<_N;) // mul,carry down
{
bb=double(a[i])*b; aa=floor(bb); bb-=aa;
q[i]+=aa; i++;
q[i]=bb*256.0;
}
cc=0; if (q[_N]>127.0) cc=1.0; // round
for (i=_N-1;i>=0;i--) // carry up
{
double aa,bb;
cc+=q[i];
c[i]=cc&255;
cc>>=8;
}
}
_N is number of bits/8 per large int, large int is array of _N BYTEs where first byte is MSB (most significant BYTE) and last BYTE is LSB (least significant BYTE)
function is not handling signum, but it is only one if and some xor/inc to add.
trouble is that double has low precision even for your number 1.23456789 !!! due to precision loss the result is not exact what it should be (1234387129122386944 instead of 1234567890000000000) I think my code is mutch quicker and even more precise than yours because i do not need to mul/mod/div numbers by 10, instead i use bit shifting where is possible and not by 10-digit but by 256-digit (8bit). if you need more precision than use long arithmetic. you can speed up this code by using larger digits (16,32, ... bit)
My long arithmetics for precise astro computations are usually fixed point 256.256 bits numbers consist of 2*8 DWORDs + signum, but of course is much slower and some goniometric functions are realy tricky to implement, but if you want just basic functions than code your own lon arithmetics is not that hard.
also if you want to have numbers often in readable form is good to compromise between speed/size and consider not to use binary coded numbers but BCD coded numbers
I am not so familiar with either C++ or GMP what I could suggest source code without syntax errors, but what you are doing is more complicated than it should and can introduce unnecessary approximation.
Instead, I suggest you write function mpz_mult_d() like this:
mpz_mult_d(mpz_class & r, const mpz_class & i, double d) {
d = ldexp(d, 52); /* exact, no overflow because 1 <= d <= 10 */
unsigned long long l = d; /* exact because d is an integer */
p = l * i; /* exact, in GMP */
(quotient, remainder) = p / 2^52; /* in GMP */
And now the next step depends on the kind of rounding you wish. If you wish the multiplication of d by i to give a result rounded toward -inf, just return quotient as result of the function. If you wish a result rounded to the nearest integer, you must look at remainder:
assert(0 <= remainder); /* proper Euclidean division */
assert(remainder < 2^52);
if (remainder < 2^51) return quotient;
if (remainder > 2^51) return quotient + 1; /* in GMP */
if (remainder == 2^51) return quotient + (quotient & 1); /* in GMP, round to “even” */
PS: I found your question by random browsing but if you had tagged it “floating-point”, people more competent than me could have answered it quickly.
Try this strategy:
Convert integer value to big float
Convert double value to big float
Make product
Convert result to integer
mpf_set_z(...)
mpf_set_d(...)
mpf_mul(...)
mpz_set_f(...)
I develop software for embedded platform and need a single-word division algorithm.
The problem is as follows:
given a large integer represented by a sequence of 32-bit words (can be many),
we need to divide it by another 32-bit word, i.e. compute the quotient (also large integer)
and the remainder (32-bits).
Certainly, If I were developing this algorithm on x86, I could simply take GNU MP
but this library is way too large for embdedde platform. Furthermore, our processor
does not have hardware integer divider (integer division is performed in the software).
However the processor has quite fast FPU, so the trick is to use floating-point arithmetic wherever possible.
Any ideas how to implement this ?
Sounds like a classic optimization. Instead of dividing by D, multiply by 0x100000000/D and then divide by 0x100000000. The latter is just a wordshift, i.e. trivial. Calculating the multiplier is a bit harder, but not a lot.
See also this detailed article for a far more detailed background.
Take a look at this one: the algorithm divides an integer a[0..n-1] by a single word 'c'
using floating-point for 64x32->32 division. The limbs of the quotient 'q' are just printed in a loop, you can save then in an array if you like. Note that you don't need GMP to run the algorithm - I use it just to compare the results.
#include <gmp.h>
// divides a multi-precision integer a[0..n-1] by a single word c
void div_by_limb(const unsigned *a, unsigned n, unsigned c) {
typedef unsigned long long uint64;
unsigned c_norm = c, sh = 0;
while((c_norm & 0xC0000000) == 0) { // make sure the 2 MSB are set
c_norm <<= 1; sh++;
}
// precompute the inverse of 'c'
double inv1 = 1.0 / (double)c_norm, inv2 = 1.0 / (double)c;
unsigned i, r = 0;
printf("\nquotient: "); // quotient is printed in a loop
for(i = n - 1; (int)i >= 0; i--) { // start from the most significant digit
unsigned u1 = r, u0 = a[i];
union {
struct { unsigned u0, u1; };
uint64 x;
} s = {u0, u1}; // treat [u1, u0] as 64-bit int
// divide a 2-word number [u1, u0] by 'c_norm' using floating-point
unsigned q = floor((double)s.x * inv1), q2;
r = u0 - q * c_norm;
// divide again: this time by 'c'
q2 = floor((double)r * inv2);
q = (q << sh) + q2; // reconstruct the quotient
printf("%x", q);
}
r %= c; // adjust the residue after normalization
printf("; residue: %x\n", r);
}
int main() {
mpz_t z, quo, rem;
mpz_init(z); // this is a dividend
mpz_set_str(z, "9999999999999999999999999999999", 10);
unsigned div = 9; // this is a divisor
div_by_limb((unsigned *)z->_mp_d, mpz_size(z), div);
mpz_init(quo); mpz_init(rem);
mpz_tdiv_qr_ui(quo, rem, z, div); // divide 'z' by 'div'
gmp_printf("compare: Quo: %Zx; Rem %Zx\n", quo, rem);
mpz_clear(quo);
mpz_clear(rem);
mpz_clear(z);
return 1;
}
I believe that a look-up table and Newton Raphson successive approximation is the canonical choice used by hardware designers (who generally can't afford the gates for a full hardware divide). You get to choose the trade off the between accuracy and execution time.