single-word division algorithm - c++

I develop software for embedded platform and need a single-word division algorithm.
The problem is as follows:
given a large integer represented by a sequence of 32-bit words (can be many),
we need to divide it by another 32-bit word, i.e. compute the quotient (also large integer)
and the remainder (32-bits).
Certainly, If I were developing this algorithm on x86, I could simply take GNU MP
but this library is way too large for embdedde platform. Furthermore, our processor
does not have hardware integer divider (integer division is performed in the software).
However the processor has quite fast FPU, so the trick is to use floating-point arithmetic wherever possible.
Any ideas how to implement this ?

Sounds like a classic optimization. Instead of dividing by D, multiply by 0x100000000/D and then divide by 0x100000000. The latter is just a wordshift, i.e. trivial. Calculating the multiplier is a bit harder, but not a lot.
See also this detailed article for a far more detailed background.

Take a look at this one: the algorithm divides an integer a[0..n-1] by a single word 'c'
using floating-point for 64x32->32 division. The limbs of the quotient 'q' are just printed in a loop, you can save then in an array if you like. Note that you don't need GMP to run the algorithm - I use it just to compare the results.
#include <gmp.h>
// divides a multi-precision integer a[0..n-1] by a single word c
void div_by_limb(const unsigned *a, unsigned n, unsigned c) {
typedef unsigned long long uint64;
unsigned c_norm = c, sh = 0;
while((c_norm & 0xC0000000) == 0) { // make sure the 2 MSB are set
c_norm <<= 1; sh++;
}
// precompute the inverse of 'c'
double inv1 = 1.0 / (double)c_norm, inv2 = 1.0 / (double)c;
unsigned i, r = 0;
printf("\nquotient: "); // quotient is printed in a loop
for(i = n - 1; (int)i >= 0; i--) { // start from the most significant digit
unsigned u1 = r, u0 = a[i];
union {
struct { unsigned u0, u1; };
uint64 x;
} s = {u0, u1}; // treat [u1, u0] as 64-bit int
// divide a 2-word number [u1, u0] by 'c_norm' using floating-point
unsigned q = floor((double)s.x * inv1), q2;
r = u0 - q * c_norm;
// divide again: this time by 'c'
q2 = floor((double)r * inv2);
q = (q << sh) + q2; // reconstruct the quotient
printf("%x", q);
}
r %= c; // adjust the residue after normalization
printf("; residue: %x\n", r);
}
int main() {
mpz_t z, quo, rem;
mpz_init(z); // this is a dividend
mpz_set_str(z, "9999999999999999999999999999999", 10);
unsigned div = 9; // this is a divisor
div_by_limb((unsigned *)z->_mp_d, mpz_size(z), div);
mpz_init(quo); mpz_init(rem);
mpz_tdiv_qr_ui(quo, rem, z, div); // divide 'z' by 'div'
gmp_printf("compare: Quo: %Zx; Rem %Zx\n", quo, rem);
mpz_clear(quo);
mpz_clear(rem);
mpz_clear(z);
return 1;
}

I believe that a look-up table and Newton Raphson successive approximation is the canonical choice used by hardware designers (who generally can't afford the gates for a full hardware divide). You get to choose the trade off the between accuracy and execution time.

Related

Fast integer division and modulo with a const runtime divisor

int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<char> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
const int x = corr_indexes[i] / n_attrs;
const int y = corr_indexes[i] % n_attrs;
if (selected[x] || selected[y]) continue; // fit inside L1 cache
// below lines are called max 2500 times, so they're insignificant
selected[x] = true;
selected[y] = true;
selectedPairs.emplace_back(x, y);
if (selectedPairs.size() == n_attrs / 2) break;
}
I have a function that looks like this. The bottleneck is in
const int x = corr_indexes[i] / n_attrs;
const int y = corr_indexes[i] % n_attrs;
n_attrs is const during the loop, so I wish to find a way to speed up this loop. corr_indexes[i], n_attrs > 0, < max_int32. Edit: please note that n_attrs isn't compile-time const.
How can I optimize this loop? No extra library is allowed.
Also, is their any way to parallelize this loop (either CPU or GPU are okay, everything is already on GPU memory before this loop).
I am restricting my comments to integer division, because to first order the modulo operation in C++ can be viewed and implemented as an integer division plus back-multiply and subtraction, although in some cases, there are cheaper ways of computing the modulo directly, e.g. when computing modulo 2n.
Integer division is pretty slow on most platforms, based on either software emulation or an iterative hardware implementation. But it was widely reported last year that based on microbenchmarking on Apple's M1, it has a blazingly fast integer division, presumably by using dedicated circuitry.
Ever since a seminal paper by Torbjörn Granlund and Peter Montgomery almost thirty years ago it has been widely known how to replace integer divisions with constant divisors by using an integer multiply plus possibly a shift and / or other correction steps. This algorithm is often referred to as the magic-multiplier technique. It requires precomputation of some relevant parameters from the integer divisor for use in the multiply-based emulation sequence.
Torbjörn Granlund and Peter L. Montgomery, "Division by invariant integers using multiplication," ACM SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72 (online).
At current, all major toolchains incorporate variants of the Granlund-Montgomery algorithm when dealing with integer divisors that are compile-time constant. The pre-computation occurs at compilation time inside the compiler, which then emits code using the computed parameters. Some toolchains may also use this algorithm for divisions by run-time constant divisors that are used repeatedly. For run-time constant divisors in loops, this could involve emitting a pre-computation block prior to a loop to compute the necessary parameters, and then using those for the division emulation code inside the loop.
If one's toolchain does not optimize divisions with run-time constant divisor one can use the same approach manually as demonstrated by the code below. However, this is unlikely to achieve the same efficiency as a compiler-based solution, because not all machine operations used in the desired emulation sequence can be expressed efficiently at C++ level in a portable manner. This applies in particular to arithmetic right shifts and add-with-carry.
The code below demonstrates the principle of parameter precomputation and integer division emulation via multiplication. It is quite likely that by investing more time into the design than I was willing to expend for this answer more efficient implementations of both parameter precomputation and emulation can be identified.
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#define PORTABLE (1)
uint32_t ilog2 (uint32_t i)
{
uint32_t t = 0;
i = i >> 1;
while (i) {
i = i >> 1;
t++;
}
return (t);
}
/* Based on: Granlund, T.; Montgomery, P.L.: "Division by Invariant Integers
using Multiplication". SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72
*/
void prepare_magic (int32_t divisor, int32_t &multiplier, int32_t &add_mask, int32_t &sign_shift)
{
uint32_t divisoru, d, n, i, j, two_to_31 = uint32_t (1) << 31;
uint64_t m_lower, m_upper, k, msb, two_to_32 = uint64_t (1) << 32;
divisoru = uint32_t (divisor);
d = (divisor < 0) ? (0 - divisoru) : divisoru;
i = ilog2 (d);
j = two_to_31 % d;
msb = two_to_32 << i;
k = msb / (two_to_31 - j);
m_lower = msb / d;
m_upper = (msb + k) / d;
n = ilog2 (uint32_t (m_lower ^ m_upper));
n = (n > i) ? i : n;
m_upper = m_upper >> n;
i = i - n;
multiplier = int32_t (uint32_t (m_upper));
add_mask = (m_upper >> 31) ? (-1) : 0;
sign_shift = int32_t ((divisoru & two_to_31) | i);
}
int32_t arithmetic_right_shift (int32_t a, int32_t s)
{
uint32_t msb = uint32_t (1) << 31;
uint32_t ua = uint32_t (a);
ua = ua >> s;
msb = msb >> s;
return int32_t ((ua ^ msb) - msb);
}
int32_t magic_division (int32_t dividend, int32_t multiplier, int32_t add_mask, int32_t sign_shift)
{
int64_t prod = int64_t (dividend) * multiplier;
int32_t quot = (int32_t)(uint64_t (prod) >> 32);
quot = int32_t (uint32_t (quot) + (uint32_t (dividend) & uint32_t (add_mask)));
#if PORTABLE
const int32_t byte_mask = 0xff;
quot = arithmetic_right_shift (quot, sign_shift & byte_mask);
#else // PORTABLE
quot = quot >> sign_shift; // must mask shift count & use arithmetic right shift
#endif // PORTABLE
quot = int32_t (uint32_t (quot) + (uint32_t (dividend) >> 31));
if (sign_shift < 0) quot = -quot;
return quot;
}
int main (void)
{
int32_t multiplier;
int32_t add_mask;
int32_t sign_shift;
int32_t divisor;
for (divisor = -20; divisor <= 20; divisor++) {
/* avoid division by zero */
if (divisor == 0) {
divisor++;
continue;
}
printf ("divisor=%d\n", divisor);
prepare_magic (divisor, multiplier, add_mask, sign_shift);
printf ("multiplier=%d add_mask=%d sign_shift=%d\n",
multiplier, add_mask, sign_shift);
printf ("exhaustive test of dividends ... ");
uint32_t dividendu = 0;
do {
int32_t dividend = (int32_t)dividendu;
/* avoid overflow in signed integer division */
if ((divisor == (-1)) && (dividend == ((-2147483647)-1))) {
dividendu++;
continue;
}
int32_t res = magic_division (dividend, multiplier, add_mask, sign_shift);
int32_t ref = dividend / divisor;
if (res != ref) {
printf ("\nERR dividend=%d (%08x) divisor=%d res=%d ref=%d\n",
dividend, (uint32_t)dividend, divisor, res, ref);
return EXIT_FAILURE;
}
dividendu++;
} while (dividendu);
printf ("PASSED\n");
}
return EXIT_SUCCESS;
}
How can I optimize this loop?
This is a perfect use-case for libdivide. This library has been designed to speed up division by constant at run-time by using the strategy compilers use at compile-time. The library is header-only so it does not create any run-time dependency. It also support the vectorization of divisions (ie. using SIMD instructions) which is definitively something to use in this case to drastically speed up the computation which compilers cannot do without changing significantly the loop (and in the end it will be not as efficient because of the run-time-defined divisor). Note that the licence of libdivide is very permissive (zlib) so you can easily include it in your project without strong constraints (you basically just need to mark it as modified if you change it).
If header only-libraries are not OK, then you need to reimplement the wheel. The idea is to transform a division by a constant to a sequence of shift and multiplications. The very good answer of #njuffa specify how to do that. You can also read the code of libdivide which is highly optimized.
For small positive divisors and small positive dividends, there is no need for a long sequence of operation. You can cheat with a basic sequence:
uint64_t dividend = corr_indexes[i]; // Must not be too big
uint64_t divider = n_attrs;
uint64_t magic_factor = 4294967296 / n_attrs + 1; // Must be precomputed once
uint32_t result = (dividend * magic_factor) >> 32;
This method should be safe for uint16_t dividends/divisors, but it is not for much bigger values. In practice if fail for dividend values above ~800_000. Bigger dividends require a more complex sequence which is also generally slower.
is their any way to parallelize this loop
Only the division/modulus can be safely parallelized. There is a loop carried dependency in the rest of the loop that prevent any parallelization (unless additional assumptions are made). Thus, the loop can be split in two parts: one that compute the division and put the uint16_t results in a temporary array computed later serially. The array needs not to be too big, since the computation would be memory bound otherwise and the resulting parallel code can be slower than the current one. Thus, you need to operate on small chunks that fit in at least the L3 cache. If chunks are too small, then thread synchronizations can also be an issue. The best solution is certainly to use a rolling window of chunks. All of this is certainly a bit tedious/tricky to implement.
Note that SIMD instructions can be used for the division part (easy with libdivide). You also need to split the loop and use chunks but chunks do not need to be big since there is no synchronization overhead. Something like 64 integers should be enough.
Note that recent processor can compute divisions like this efficiently, especially for 32-bit integers (64-bit ones tends to be significantly more expensive). This is especially the case of the Alder lake, Zen3 and M1 processors (P-cores). Note that both the modulus and the division are computed in one instruction on x86/x86-64 processors. Also note that while the division has a pretty big latency, many processors can pipeline multiple divisions so to get a reasonable throughput. For example, a 32-bit div instruction has a latency of 23~28 cycles on Skylake but a reciprocal throughput of 4~6. This is apparently not the case on Zen1/Zen2.
I would optimize the part after // optimize the code below by:
taking n_attrs
generating a function string like this:
void dynamicFunction(MyType & selectedPairs, Foo & selected)
{
const int npairs = ## * ##;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
const int x = corr_indexes[i] / ##;
const int y = corr_indexes[i] % ##;
if (selected[x] || selected[y]) continue; // fit inside L1 cache
// below lines are called max 2500 times, so they're insignificant
selected[x] = true;
selected[y] = true;
selectedPairs.emplace_back(x, y);
if (selectedPairs.size() == ## / 2)
break;
}
}
replacing all ## with value of n_attrs
compiling it, generating a DLL
linking and calling the function
So that the n_attrs is a compile-time constant value for the DLL and the compiler can automatically do most of its optimization on the value like:
doing n&(x-1) instead of n%x when x is power-of-2 value
shifting and multiplying instead of dividing
maybe other optimizations too, like unrolling the loop with precalculated indices for x and y (since x is known)
Some integer math operations in tight-loops are easier to SIMDify/vectorize by compiler when more of the parts are known in compile-time.
If your CPU is AMD, you can even try magic floating-point operations in place of unknown/unknown division to get vectorization.
By caching all (or big percentage of) values of n_attrs, you can get rid of latencies of:
string generation
compiling
file(DLL) reading (assuming some object-oriented wrapping of DLLs)
If the part to be optimized will be run in GPU, there is high possibility of CUDA/OpenCL implementation already doing the integer division in means of floating-point (to keep SIMD path occupied instead of being serialized on integer division) or just being capable directly as SIMD integer operations so you may just use it as it is in the GPU, except the std::vector which is not supported by all C++ CUDA compilers (and not in OpenCL kernel). These host-environment-related parts could be computed after the kernel (with the parts excluding emplace_back or exchanged with a struct that works in GPU) is executed.
So the actual best solution in my case.
Instead of representing index = row * n_cols + col, do index = (row << 16) | col for 32 bit, or index = (row << 32) | col for 64 bits. Then row = index >> 32, col = index & (32 - 1). Or even better, just uint16_t* pairs = reinterpret_cast<uint16_t*>(index_array);, then pair[i], pair[i+1] for each i % 2 == 0 is a pair.
This is assuming the number of rows/columns is less than 2^16 (or 2^32).
I'm still keeping the top answer because it still answers the case where division has to be used.

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element.
Is there an efficient way to implement this using AVX and AVX2 instructions only?
Currently I'm using a loop which extracts each element and applies the _lzcnt_u32 function.
Related: to bit-scan one large bitmap, see Count leading zeros in __m256i word which uses pmovmskb -> bitscan to find which byte to do a scalar bitscan on.
This question is about doing 8 separate lzcnts on 8 separate 32-bit elements when you're actually going to use all 8 results, not just select one.
float represents numbers in an exponential format, so int->FP conversion gives us the position of the highest set bit encoded in the exponent field.
We want int->float with magnitude rounded down (truncate the value towards 0), not the default rounding of nearest. That could round up and make 0x3FFFFFFF look like 0x40000000. If you're doing a lot of these conversions without doing any FP math, you could set the rounding mode in the MXCSR1 to truncation then set it back when you're done.
Otherwise you can use v & ~(v>>8) to keep the 8 most-significant bits and zero some or all lower bits, including a potentially-set bit 8 below the MSB. That's enough to ensure all rounding modes never round up to the next power of two. It always keeps the 8 MSB because v>>8 shifts in 8 zeros, so inverted that's 8 ones. At lower bit positions, wherever the MSB is, 8 zeros are shifted past there from higher positions, so it will never clear the most significant bit of any integer. Depending on how set bits below the MSB line up, it might or might not clear more below the 8 most significant.
After conversion, we use an integer shift on the bit-pattern to bring the exponent (and sign bit) to the bottom and undo the bias with a saturating subtract. We use min to set the result to 32 if no bits were set in the original 32-bit input.
__m256i avx2_lzcnt_epi32 (__m256i v) {
// prevent value from being rounded up to the next power of two
v = _mm256_andnot_si256(_mm256_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm256_castps_si256(_mm256_cvtepi32_ps(v)); // convert an integer to float
v = _mm256_srli_epi32(v, 23); // shift down the exponent
v = _mm256_subs_epu16(_mm256_set1_epi32(158), v); // undo bias
v = _mm256_min_epi16(v, _mm256_set1_epi32(32)); // clamp at 32
return v;
}
Footnote 1: fp->int conversion is available with truncation (cvtt), but int->fp conversion is only available with default rounding (subject to MXCSR).
AVX512F introduces rounding-mode overrides for 512-bit vectors which would solve the problem, __m512 _mm512_cvt_roundepi32_ps( __m512i a, int r);. But all CPUs with AVX512F also support AVX512CD so you could just use _mm512_lzcnt_epi32. And with AVX512VL, _mm256_lzcnt_epi32
#aqrit's answer looks like a more-clever use of FP bithacks. My answer below is based on the first place I looked for a bithack which was old and aimed at scalar so it didn't try to avoid double (which is wider than int32 and thus a problem for SIMD).
It uses HW signed int->float conversion and saturating integer subtracts to handle the MSB being set (negative float), instead of stuffing bits into a mantissa for manual uint->double. If you can set MXCSR to round down across a lot of these _mm256_lzcnt_epi32, that's even more efficient.
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogIEEE64Float suggests stuffing integers into the mantissa of a large double, then subtracting to get the FPU hardware to get a normalized double. (I think this bit of magic is doing uint32_t -> double, with the technique #Mysticial explains in How to efficiently perform double/int64 conversions with SSE/AVX? (which works for uint64_t up to 252-1)
Then grab the exponent bits of the double and undo the bias.
I think integer log2 is the same thing as lzcnt, but there might be an off-by-1 at powers of 2.
The Standford Graphics bithack page lists other branchless bithacks you could use that would probably still be better than 8x scalar lzcnt.
If you knew your numbers were always small-ish (like less than 2^23) you could maybe do this with float and avoid splitting and blending.
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
The code above loads a 64-bit (IEEE-754 floating-point) double with a 32-bit integer (with no paddding bits) by storing the integer in the mantissa while the exponent is set to 252. From this newly minted double, 252 (expressed as a double) is subtracted, which sets the resulting exponent to the log base 2 of the input value, v. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias, 0x3FF (which is 1023 decimal).
To do this with AVX2, blend and shift+blend odd/even halves with set1_epi32(0x43300000) and _mm256_castps_pd to get a __m256d. And after subtracting, _mm256_castpd_si256 and shift / blend the low/high halves into place then mask to get the exponents.
Doing integer operations on FP bit-patterns is very efficient with AVX2, just 1 cycle of extra latency for a bypass delay when doing integer shifts on the output of an FP math instruction.
(TODO: write it with C++ intrinsics, edit welcome or someone else could just post it as an answer.)
I'm not sure if you can do anything with int -> double conversion and then reading the exponent field. Negative numbers have no leading zeros and positive numbers give an exponent that depends on the magnitude.
If you did want that, you'd go one 128-bit lane at a time, shuffling to feed xmm -> ymm packed int32_t -> packed double conversion.
The question is also tagged AVX, but there are no instructions for integer processing in AVX, which means one needs to fall back to SSE on platforms that support AVX but not AVX2. I am showing an exhaustively tested, but a bit pedestrian version below. The basic idea here is as in the other answers, in that the count of leading zeros is determined by the floating-point normalization that occurs during integer to floating-point conversion. The exponent of the result has a one-to-one correspondence with the count of leading zeros, except that the result is wrong in the case of an argument of zero. Conceptually:
clz (a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
where float_as_uint32() is a re-interpreting cast and uint32_to_float_rz() is a conversion from unsigned integer to floating-point with truncation. A normal, rounding, conversion could bump up the conversion result to the next power of two, resulting in an incorrect count of leading zero bits.
SSE does not provide truncating integer to floating-point conversion as a single instruction, nor conversions from unsigned integers. This functionality needs to be emulated. The emulation does not need to be exact, as long as it does not change the magnitude of the conversion result. The truncation part is handled by the invert - right shift - andn technique from aqrit's answer. To use signed conversion, we cut the number in half before the conversion, then double and increment after the conversion:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
This approach is translated into SSE intrinsics in sse_clz() below.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"
/* compute count of leading zero bits using floating-point normalization.
clz(a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
The problematic part here is uint32_to_float_rz(). SSE does not offer
conversion of unsigned integers, and no rounding modes in integer to
floating-point conversion. Since all we need is an approximate version
that preserves order of magnitude:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
*/
__m128i sse_clz (__m128i a)
{
__m128 fp1 = _mm_set_ps1 (1.0f);
__m128i zero = _mm_set1_epi32 (0);
__m128i i158 = _mm_set1_epi32 (158);
__m128i iszero = _mm_cmpeq_epi32 (a, zero);
__m128i lsr1 = _mm_srli_epi32 (a, 1);
__m128i lsr2 = _mm_srli_epi32 (a, 2);
__m128i atrunc = _mm_andnot_si128 (lsr2, lsr1);
__m128 atruncf = _mm_cvtepi32_ps (atrunc);
__m128 atruncf2 = _mm_add_ps (atruncf, atruncf);
__m128 conv = _mm_add_ps (atruncf2, fp1);
__m128i convi = _mm_castps_si128 (conv);
__m128i lsr23 = _mm_srli_epi32 (convi, 23);
__m128i res = _mm_sub_epi32 (i158, lsr23);
return _mm_sub_epi32 (res, iszero);
}
/* Portable reference implementation of 32-bit count of leading zeros */
int clz32 (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
/* Test floating-point based count leading zeros exhaustively */
int main (void)
{
__m128i res;
uint32_t resi[4], refi[4];
uint32_t count = 0;
do {
refi[0] = clz32 (count);
refi[1] = clz32 (count + 1);
refi[2] = clz32 (count + 2);
refi[3] = clz32 (count + 3);
res = sse_clz (_mm_set_epi32 (count + 3, count + 2, count + 1, count));
memcpy (resi, &res, sizeof resi);
if ((resi[0] != refi[0]) || (resi[1] != refi[1]) ||
(resi[2] != refi[2]) || (resi[3] != refi[3])) {
printf ("error # %08x %08x %08x %08x\n",
count, count+1, count+2, count+3);
return EXIT_FAILURE;
}
count += 4;
} while (count);
return EXIT_SUCCESS;
}

Multiplication between big integers and doubles

I am managing some big (128~256bits) integers with gmp. It has come a point were I would like to multiply them for a double close to 1 (0.1 < double < 10), the result being still an approximated integer. A good example of the operation I need to do is the following:
int i = 1000000000000000000 * 1.23456789
I searched in the gmp documentation but I didn't find a function for this, so I ended up writing this code which seems to work well:
mpz_mult_d(mpz_class & r, const mpz_class & i, double d, int prec=10) {
if (prec > 15) prec=15; //avoids overflows
uint_fast64_t m = (uint_fast64_t) floor(d);
r = i * m;
uint_fast64_t pos=1;
for (uint_fast8_t j=0; j<prec; j++) {
const double posd = (double) pos;
m = ((uint_fast64_t) floor(d * posd * 10.)) -
((uint_fast64_t) floor(d * posd)) * 10;
pos*=10;
r += (i * m) /pos;
}
}
Can you please tell me what do you think? Do you have any suggestion to make it more robust or faster?
this is what you wanted:
// BYTE lint[_N] ... lint[0]=MSB, lint[_N-1]=LSB
void mul(BYTE *c,BYTE *a,double b) // c[_N]=a[_N]*b
{
int i; DWORD cc;
double q[_N+1],aa,bb;
for (q[0]=0.0,i=0;i<_N;) // mul,carry down
{
bb=double(a[i])*b; aa=floor(bb); bb-=aa;
q[i]+=aa; i++;
q[i]=bb*256.0;
}
cc=0; if (q[_N]>127.0) cc=1.0; // round
for (i=_N-1;i>=0;i--) // carry up
{
double aa,bb;
cc+=q[i];
c[i]=cc&255;
cc>>=8;
}
}
_N is number of bits/8 per large int, large int is array of _N BYTEs where first byte is MSB (most significant BYTE) and last BYTE is LSB (least significant BYTE)
function is not handling signum, but it is only one if and some xor/inc to add.
trouble is that double has low precision even for your number 1.23456789 !!! due to precision loss the result is not exact what it should be (1234387129122386944 instead of 1234567890000000000) I think my code is mutch quicker and even more precise than yours because i do not need to mul/mod/div numbers by 10, instead i use bit shifting where is possible and not by 10-digit but by 256-digit (8bit). if you need more precision than use long arithmetic. you can speed up this code by using larger digits (16,32, ... bit)
My long arithmetics for precise astro computations are usually fixed point 256.256 bits numbers consist of 2*8 DWORDs + signum, but of course is much slower and some goniometric functions are realy tricky to implement, but if you want just basic functions than code your own lon arithmetics is not that hard.
also if you want to have numbers often in readable form is good to compromise between speed/size and consider not to use binary coded numbers but BCD coded numbers
I am not so familiar with either C++ or GMP what I could suggest source code without syntax errors, but what you are doing is more complicated than it should and can introduce unnecessary approximation.
Instead, I suggest you write function mpz_mult_d() like this:
mpz_mult_d(mpz_class & r, const mpz_class & i, double d) {
d = ldexp(d, 52); /* exact, no overflow because 1 <= d <= 10 */
unsigned long long l = d; /* exact because d is an integer */
p = l * i; /* exact, in GMP */
(quotient, remainder) = p / 2^52; /* in GMP */
And now the next step depends on the kind of rounding you wish. If you wish the multiplication of d by i to give a result rounded toward -inf, just return quotient as result of the function. If you wish a result rounded to the nearest integer, you must look at remainder:
assert(0 <= remainder); /* proper Euclidean division */
assert(remainder < 2^52);
if (remainder < 2^51) return quotient;
if (remainder > 2^51) return quotient + 1; /* in GMP */
if (remainder == 2^51) return quotient + (quotient & 1); /* in GMP, round to “even” */
PS: I found your question by random browsing but if you had tagged it “floating-point”, people more competent than me could have answered it quickly.
Try this strategy:
Convert integer value to big float
Convert double value to big float
Make product
Convert result to integer
mpf_set_z(...)
mpf_set_d(...)
mpf_mul(...)
mpz_set_f(...)

Generating random floating-point values based on random bit stream

Given a random source (a generator of random bit stream), how do I generate a uniformly distributed random floating-point value in a given range?
Assume that my random source looks something like:
unsigned int GetRandomBits(char* pBuf, int nLen);
And I want to implement
double GetRandomVal(double fMin, double fMax);
Notes:
I don't want the result precision to be limited (for example only 5 digits).
Strict uniform distribution is a must
I'm not asking for a reference to an existing library. I want to know how to implement it from scratch.
For pseudo-code / code, C++ would be most appreciated
I don't think I'll ever be convinced that you actually need this, but it was fun to write.
#include <stdint.h>
#include <cmath>
#include <cstdio>
FILE* devurandom;
bool geometric(int x) {
// returns true with probability min(2^-x, 1)
if (x <= 0) return true;
while (1) {
uint8_t r;
fread(&r, sizeof r, 1, devurandom);
if (x < 8) {
return (r & ((1 << x) - 1)) == 0;
} else if (r != 0) {
return false;
}
x -= 8;
}
}
double uniform(double a, double b) {
// requires IEEE doubles and 0.0 < a < b < inf and a normal
// implicitly computes a uniform random real y in [a, b)
// and returns the greatest double x such that x <= y
union {
double f;
uint64_t u;
} convert;
convert.f = a;
uint64_t a_bits = convert.u;
convert.f = b;
uint64_t b_bits = convert.u;
uint64_t mask = b_bits - a_bits;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
int b_exp;
frexp(b, &b_exp);
while (1) {
// sample uniform x_bits in [a_bits, b_bits)
uint64_t x_bits;
fread(&x_bits, sizeof x_bits, 1, devurandom);
x_bits &= mask;
x_bits += a_bits;
if (x_bits >= b_bits) continue;
double x;
convert.u = x_bits;
x = convert.f;
// accept x with probability proportional to 2^x_exp
int x_exp;
frexp(x, &x_exp);
if (geometric(b_exp - x_exp)) return x;
}
}
int main() {
devurandom = fopen("/dev/urandom", "r");
for (int i = 0; i < 100000; ++i) {
printf("%.17g\n", uniform(1.0 - 1e-15, 1.0 + 1e-15));
}
}
Here is one way of doing it.
The IEEE Std 754 double format is as follows:
[s][ e ][ f ]
where s is the sign bit (1 bit), e is the biased exponent (11 bits) and f is the fraction (52 bits).
Beware that the layout in memory will be different on little-endian machines.
For 0 < e < 2047, the number represented is
(-1)**(s) * 2**(e – 1023) * (1.f)
By setting s to 0, e to 1023 and f to 52 random bits from your bit stream, you get a random double in the interval [1.0, 2.0). This interval is unique in that it contains 2 ** 52 doubles, and these doubles are equidistant. If you then subtract 1.0 from the constructed double, you get a random double in the interval [0.0, 1.0). Moreover, the property about being equidistant is preserve.
From there you should be able to scale and translate as needed.
I'm surprised that for question this old, nobody had actual code for the best answer. User515430's answer got it right--you can take advantage of IEEE-754 double format to directly put 52 bits into a double with no math at all. But he didn't give code. So here it is, from my public domain ojrandlib:
double ojr_next_double(ojr_generator *g) {
uint64_t r = (OJR_NEXT64(g) & 0xFFFFFFFFFFFFFull) | 0x3FF0000000000000ull;
return *(double *)(&r) - 1.0;
}
NEXT64() gets a 64-bit random number. If you have a more efficient way of getting only 52 bits, use that instead.
This is easy, as long as you have an integer type with as many bits of precision as a double. For instance, an IEEE double-precision number has 53 bits of precision, so a 64-bit integer type is enough:
#include <limits.h>
double GetRandomVal(double fMin, double fMax) {
unsigned long long n ;
GetRandomBits ((char*)&n, sizeof(n)) ;
return fMin + (n * (fMax - fMin))/ULLONG_MAX ;
}
This is probably not the answer you want, but the specification here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3225.pdf
in sections [rand.util.canonical] and [rand.dist.uni.real], contains sufficient information to implement what you want, though with slightly different syntax. It isn't easy, but it is possible. I speak from personal experience. A year ago I knew nothing about random numbers, and I was able to do it. Though it took me a while... :-)
The question is ill-posed. What does uniform distribution over floats even mean?
Taking our cue from discrepancy, one way to operationalize your question is to define that you want the distribution that minimizes the following value:
Where x is the random variable you are sampling with your GetRandomVal(double fMin, double fMax) function, and means the probability that a random x is smaller or equal to t.
And now you can go on and try to evaluate eg a dabbler's answer. (Hint all the answers that fail to use the whole precision and stick to eg 52 bits will fail this minimization criterion.)
However, if you just want to be able to generate all float bit patterns that fall into your specified range with equal possibility, even if that means that eg asking for GetRandomVal(0,1000) will create more values between 0 and 1.5 than between 1.5 and 1000, that's easy: any interval of IEEE floating point numbers when interpreted as bit patterns map easily to a very small number of intervals of unsigned int64. See eg this question. Generating equally distributed random values of unsigned int64 in any given interval is easy.
I may be misunderstanding the question, but what stops you simply sampling the next n bits from the random bit stream and converting that to a base 10 number number ranged 0 to 2^n - 1.
To get a random value in [0..1[ you could do something like:
double value = 0;
for (int i=0;i<53;i++)
value = 0.5 * (value + random_bit()); // Insert 1 random bit
// or value = ldexp(value+random_bit(),-1);
// or group several bits into one single ldexp
return value;

Converting from unsigned long long to float with round to nearest even

I need to write a function that rounds from unsigned long long to float, and the rounding should be toward nearest even.
I cannot just do a C++ type-cast, since AFAIK the standard does not specify the rounding.
I was thinking of using boost::numeric, but i could not find any useful lead after reading the documentation. Can this be done using that library?
Of course, if there is an alternative, i would be glad to use it.
Any help would be much appreciated.
EDIT: Adding an example to make things a bit clearer.
Suppose i want to convert 0xffffff7fffffffff to its floating point representation. The C++ standard permits either one of:
0x5f7fffff ~ 1.9999999*2^63
0x5f800000 = 2^64
Now if you add the restriction of round to nearest even, only the first result is acceptable.
Since you have so many bits in the source that can't be represented in the float and you can't (apparently) rely on the language's conversion, you'll have to do it yourself.
I devised a scheme that may or may not help you. Basically, there are 31 bits to represent positive numbers in a float so I pick up the 31 most significant bits in the source number. Then I save off and mask away all the lower bits. Then based on the value of the lower bits I round the "new" LSB up or down and finally use static_cast to create a float.
I left in some couts that you can remove as desired.
const unsigned long long mask_bit_count = 31;
float ull_to_float2(unsigned long long val)
{
// How many bits are needed?
int b = sizeof(unsigned long long) * CHAR_BIT - 1;
for(; b >= 0; --b)
{
if(val & (1ull << b))
{
break;
}
}
std::cout << "Need " << (b + 1) << " bits." << std::endl;
// If there are few enough significant bits, use normal cast and done.
if(b < mask_bit_count)
{
return static_cast<float>(val & ~1ull);
}
// Save off the low-order useless bits:
unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
std::cout << "Saved low bits=" << low_bits << std::endl;
std::cout << val << "->mask->";
// Now mask away those useless low bits:
val &= ~((1ull << (b - mask_bit_count)) - 1);
std::cout << val << std::endl;
// Finally, decide how to round the new LSB:
if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
{
std::cout << "Rounding up " << val;
// Round up.
val |= (1ull << (b - mask_bit_count));
std::cout << " to " << val << std::endl;
}
else
{
// Round down.
val &= ~(1ull << (b - mask_bit_count));
}
return static_cast<float>(val);
}
I did this in Smalltalk for arbitrary precision integer (LargeInteger), implemented and tested in Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk, and even blogged about it if you can read Smalltalk code http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .
The trick for accelerating the algorithm is this one: IEEE 754 compliant FPU will round exactly the result of an inexact operation. So we can afford 1 inexact operation and let the hardware rounds correctly for us. That let us handle easily first 48 bits. But we cannot afford two inexact operations, so we sometimes have to care of the lowest bits differently...
Hope the code is documented enough:
#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long high=val>>prec; // the high bits above float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
// handle trivial cases, 48 bits or less,
// let FPU apply correct rounding after exactly 1 inexact operation
if( high <= mask )
return ldexpf((float) high,prec) + (float) (val & mask);
// more than 48 bits,
// what scaling s is needed to isolate highest 48 bits of val?
int s = 0;
for( ; high > mask ; high >>= 1) ++s;
// high now contains highest 24 bits
float f_high = ldexpf( (float) high , prec + s );
// store next 24 bits in mid
unsigned long long mid = (val >> s) & mask;
// care of rare case when trailing low bits can change the rounding:
// can mid bits be a case of perfect tie or perfect zero?
if( (mid & tmsk) == 0ull )
{
// if low bits are zero, mid is either an exact tie or an exact zero
// else just increment mid to distinguish from such case
unsigned long long low = val & ((1ull << s) - 1);
if(low > 0ull) mid++;
}
return f_high + ldexpf( (float) mid , s );
}
Bonus: this code should round according to your FPU rounding mode whatever it may be, since we implicitely used the FPU to perform rounding with + operation.
However, beware of aggressive optimizations in standards < C99, who knows when the compiler will use extended precision... (unless you force something like -ffloat-store).
If you always want to round to nearest even, whatever the current rounding mode, then you'll have to increment high bits when:
mid bits > tie, where tie=1ull<<(prec-1);
mid bits == tie and (low bits > 0 or high bits is odd).
EDIT:
If you stick to round-to-nearest-even tie breaking, then another solution is to use Shewchuck EXPANSION-SUM of non adjacent parts (fhigh,flow) and (fmid) see http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps :
#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long high=val>>(2*prec); // the high bits
unsigned long long mid=(val>>prec) & mask; // the mid bits
unsigned long long low=val & mask; // the low bits
float fhigh = ldexpf((float) high,2*prec);
float fmid = ldexpf((float) mid,prec);
float flow = (float) low;
float sum1 = fmid + flow;
float residue1 = flow - (sum1 - fmid);
float sum2 = fhigh + sum1;
float residue2 = sum1 - (sum2 - fhigh);
return (residue1 + residue2) + sum2;
}
This makes a branch-free algorithm with a bit more ops. It may work with other rounding modes, but I let you analyze the paper to make sure...
What is possible between between 8-byte integers and the float format is straightforward to explain but less so to implement!
The next paragraph concerns what is representable in 8 byte signed integers.
All positive integers between 1 (2^0) and 16777215 (2^24-1) are exactly representable in iEEE754 single precision (float). Or, to be precise, all numbers between 2^0 and 2^24-2^0 in increments of 2^0. The next range of exactly representable positive integers is 2^1 to 2^25-2^1 in increments of 2^1 and so on up to 2^39 to 2^63-2^39 in increments of 2^39.
Unsigned 8-byte integer values can be expressed up to 2^64-2^40 in increments of 2^40.
The single precison format doesn't stop here but goes on all the way up to the range 2^103 to 2^127-2^103 in increments of 2^103.
For 4-byte integers (long) the highest float range is 2^7 to 2^31-2^7 in 2^7 increments.
On the x86 architecture the largest integer type supported by the floating point instruction set is the 8 byte signed integer. 2^64-1 cannot be loaded by conventional means.
This means that for a given range increment expressed as "2^i where i is an integer >0" all integers that end with the bit pattern 0x1 up to 2^i-1 will not be exactly representable within that range in a float
This means that what you call rounding upwards is actually dependent on what range you are working in. It is of no use to try to round up by 1 (2^0) or 16 (2^4) if the granularity of the range you are in is 2^19.
An additional consequence of what you propose to do (rounding 2^63-1 to 2^63) could result in an (long integer format) overflow if you attempt the following conversion: longlong_int=(long long) ((float) 2^63).
Check out this small program I wrote (in C) which should help illustrate what is possible and what isn't.
int main (void)
{
__int64 basel=1,baseh=16777215,src,dst,j;
float cnvl,cnvh,range;
int i=0;
while (i<40)
{
src=basel<<i;
cnvl=(float) src;
dst=(__int64) cnvl; /* compare dst with basel */
src=baseh<<i;
cnvh=(float) src;
dst=(__int64) cnvh; /* compare dst with baseh */
j=basel;
while (j<=baseh)
{
range=(float) j;
dst=(__int64) range;
if (j!=dst) dst/=0;
j+=basel;
}
++i;
}
return i;
}
This program shows the representable integer value ranges. There is overlap beteen them: for example 2^5 is representable in all ranges with a lower boundary 2^b where 1=