Dividing a 2x32 bit big integer by 1000 - c++

I have big number, time (micro seconds) stored in two 32bit variables.
I need a help, how to change micro seconds time into millisecond, so I can store result of difference in 32bit number.
More details:
I have one time in two 32bit variables. Where one variable have more significant bits and other have less significant bits. This time have microseconds resolution so I want to change it to milliseconds. So how to divide number that is stored in two variables.

If you don't have a 64-bit type, you can do it like the following:
uint32_t higher, lower; // your input
lower /= 1000;
lower += (higher % 1000) * 4294967L; // approximate 2^32 / 1000
higher /= 1000;
If the result fitted in lower itself, higher should be 0.
Just note that as #Mikhail pointed out, this solution is approximate, and has an error of 0.296 * higher + 2 ms (unless I'm missing something).
If you really want a better precision and don't care about efficiency, you can use a bit of floating-point arithmetic in the middle, and round the results correctly. I doubt if it's worth the effort:
uint32_t higher, lower; // your input
// simpler without a helper variable
if (lower % 1000 >= 500)
{
lower /= 1000;
++lower;
}
else
lower /= 1000;
lower += round((higher % 1000) * 4294967.296); // 2^32 / 1000
higher /= 1000;
You'll need to include <cmath> for round().
As a note, #Mikhail's solution in this case is probably better and may be faster. Though it's too complex for me.
If you have a 64-bit type, you can convert the split value to it:
uint64_t whole_number = higher;
whole_number <<= 32;
whole_number |= lower;
And then you can use whole_number as usual.
Note that if you only need a difference, it will be faster to subtract the values before actually dividing.
Assuming that you know which value is bigger:
uint32_t higher1, lower1; // smaller value
uint32_t higher2, lower2; // bigger value
uint32_t del_high = higher2 - higher1;
uint32_t del_low = lower2 - lower1;
if (lower2 < lower1)
--del_high;
And now you can convert the result like explained before. Or with a bit luck, del_high will be 0 (if the difference is smaller than 2^32 μs), and you will have the result in del_low (in μs).

The simplest way is to use 64-bit integer type, but I assume you cannot do this. Since you want your answer in 32-bit integer, the high-order value of microseconds cannot be greater than 999, or it would not fit in 32-bit after division by 1000. So the bigger number of microseconds you're operating with is 999 * 2^32 + (2^32 - 1) = 4294967295999. It gives you 13 decimal digits and you can use double to handle precise division.
If you are forced for some reason to use only 32-bit integers, the answer of Michał Górny gives you an approximate solution. E.g. for whole_number = 1234567890123 it will give a result of 1234567805. Because dividing of max 32-bit int on 1000 have a reminder.
The only way to have an exact answer with 32-bit integer is by using long arithmetic. It requires long digits to be stored in a type which can be extended to store a reminder. You have to split your two 32-bit integers in four 16-bit digits. After that you can divide it as on paper and you have enough bits to store a reminder. See the code of micro2milli:
#include <iostream>
typedef unsigned __int32 uint32;
typedef unsigned __int64 uint64;
const uint32 MAX_INT = 0xFFFFFFFF;
uint32 micro2milli(uint32 hi, uint32 lo)
{
if (hi >= 1000)
{
throw std::runtime_error("Cannot store milliseconds in uint32!");
}
uint32 r = (lo >> 16) + (hi << 16);
uint32 ans = r / 1000;
r = ((r % 1000) << 16) + (lo & 0xFFFF);
ans = (ans << 16) + r / 1000;
return ans;
}
uint32 micro2milli_simple(uint32 hi, uint32 lo)
{
lo /= 1000;
return lo + (hi % 1000) * 4294967L;
}
void main()
{
uint64 micro = 1234567890123;
uint32 micro_high = micro >> 32;
uint32 micro_low = micro & MAX_INT;
// 1234567805
std::cout << micro2milli_simple(micro_high, micro_low) << std::endl;
// 1234567890
std::cout << micro2milli(micro_high, micro_low) << std::endl;
}

First, put your two variables into 3 with 22 significant bits each.
uint32_t x0 = l & 0x3FFFFF;
uint32_t x1 = ((l >> 22) | (h << 10)) & 0x3FFFFF;
uint32_t x2 = h >> 12;
Now do the division (there is 10 available bits per x?, and 1000 < 2^10 = 1024 so there is no overflow possible)
uint32_t t2 = x2 / 1000;
x1 |= (x2 % 1000) << 22;
uint32_t t1 = x1 / 1000;
x0 |= (x1 % 1000) << 22;
uint32_t t0 = (x0 + 500) / 1000;
/* +0 for round down, +500 for round to nearest, +999 for round up */
Now put back things together.
uint32_t r0 = t0 + t1 << 22;
uint32_t r1 = (t1 >> 10) + (t2 << 12) + (r0 < t0);
Using the same technique but with four variables holding 16 bits, you can do it for divisor up to 65535. Then it becomes harder to do it with 32 bits arithmetic.

Assuming you can not use a 64 bit int for this, I would suggest using a multiple precision library, like GMP.

Related

Multiplying two 64-bit numbers using 64-bit registers, by splitting in to 32-bit. Getting unexpected behaviour

I would like to multiply 371.1608054189 (using mantissa 3711608054189) by 0.002699633523001229 (using mantissa 2699633523001229). The high precision answer is 1.0019981527329986544145598281.
However, I can only use 64-bit registers.
Earlier I created a Stackoverflow question and a solution was kindly provided, which worked with smaller mantissas:
Stackoverflow answer to multiply two uint64_ts and store result to uint64_t doesnt seem to work?
I have pasted the solution below.
However, when I insert the aforementioned mantissas I get result 180588304518940489, not close to the expected answer. Then, interestingly if I truncate my two mantissas to 3711608054 and 2699633523 I get 10019981526815194242, close to the expected result.
Unfortunately I don't understand why this is?
Is it possible to multiply these two 64-bit mantissas using 64-bit registers and get a mantissa representative of 1.00199815273299865 (18 digits)?
https://godbolt.org/z/9v47c51zW
#include <iostream>
int main()
{
uint64_t a = 3711608054189; // Input 1 representing 371.1608054189
uint64_t b = 2699633523001229; // Input 2 representing 0.002699633523001229
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t a_x_b_hi = a_hi * b_hi;
uint64_t a_x_b_mid = a_hi * b_lo;
uint64_t b_x_a_mid = b_hi * a_lo;
uint64_t a_x_b_lo = a_lo * b_lo;
/*
This is implementing schoolbook multiplication:
x1 x0
X y1 y0
-------------
00 LOW PART
-------------
00
10 10 MIDDLE PART
+ 01
-------------
01
+ 11 11 HIGH PART
-------------
*/
// 64-bit product + two 32-bit values
uint64_t middle = a_x_b_mid + (a_x_b_lo >> 32) + uint32_t(b_x_a_mid);
// 64-bit product + two 32-bit values
uint64_t hi = a_x_b_hi + (middle >> 32) + (b_x_a_mid >> 32);
// Add LOW PART and lower half of MIDDLE PART
uint64_t result = (middle << 32) | uint32_t(a_x_b_lo);
std::cout << result << std::endl;
}

Multiply two uint64_ts and store result to uint64_t doesnt seem to work?

I am trying to multiply two uint64_ts and store the result to uint64_t. I found an existing answer on Stackoverflow which splits the inputs in to their four uint32_ts and joins the result later:
https://stackoverflow.com/a/28904636/1107474
I have created a full example using the code and pasted it below.
However, for 37 x 5 I am getting the result 0 instead of 185?
#include <iostream>
int main()
{
uint64_t a = 37; // Input 1
uint64_t b = 5; // Input 2
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t a_x_b_hi = a_hi * b_hi;
uint64_t a_x_b_mid = a_hi * b_lo;
uint64_t b_x_a_mid = b_hi * a_lo;
uint64_t a_x_b_lo = a_lo * b_lo;
uint64_t carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
(uint64_t)(uint32_t)b_x_a_mid +
(a_x_b_lo >> 32) ) >> 32;
uint64_t multhi = a_x_b_hi +
(a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
carry_bit;
std::cout << multhi << std::endl; // Outputs 0 instead of 185?
}
I'm merging your code with another answer in the original link.
#include <iostream>
int main()
{
uint64_t a = 37; // Input 1
uint64_t b = 5; // Input 2
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t a_x_b_hi = a_hi * b_hi;
uint64_t a_x_b_mid = a_hi * b_lo;
uint64_t b_x_a_mid = b_hi * a_lo;
uint64_t a_x_b_lo = a_lo * b_lo;
/*
This is implementing schoolbook multiplication:
x1 x0
X y1 y0
-------------
00 LOW PART
-------------
00
10 10 MIDDLE PART
+ 01
-------------
01
+ 11 11 HIGH PART
-------------
*/
// 64-bit product + two 32-bit values
uint64_t middle = a_x_b_mid + (a_x_b_lo >> 32) + uint32_t(b_x_a_mid);
// 64-bit product + two 32-bit values
uint64_t carry = a_x_b_hi + (middle >> 32) + (b_x_a_mid >> 32);
// Add LOW PART and lower half of MIDDLE PART
uint64_t result = (middle << 32) | uint32_t(a_x_b_lo);
std::cout << result << std::endl;
std::cout << carry << std::endl;
}
This results in
Program stdout
185
0
Godbolt link: https://godbolt.org/z/97xhMvY53
Or you could use __uint128_t which is non-standard but widely available.
static inline void mul64(uint64_t a, uint64_t b, uint64_t& result, uint64_t& carry) {
__uint128_t va(a);
__uint128_t vb(b);
__uint128_t vr = va * vb;
result = uint64_t(vr);
carry = uint64_t(vr >> 64);
}
In the title of this question, you said you wanted to multiply two integers. But the code you found on that other Q&A (Getting the high part of 64 bit integer multiplication) isn't trying to do that, it's only trying to get the high half of the full product. For a 64x64 => 128-bit product, the high half is product >> 64.
37 x 5 = 185
185 >> 64 = 0
It's correctly emulating multihi = (37 * (unsigned __int128)5) >> 64, and you're forgetting about the >>64 part.
__int128 is a GNU C extension; it's much more efficient than emulating it manually with pure ISO C, but only supported on 64-bit targets by current compilers. See my answer on the same question. (ISO C23 is expected to have _BitInt(128) or whatever width you specify.)
In comments you were talking about floating-point mantissas. In an FP multiply, you have two n-bit mantissas (usually with their leading bits set), so the high half of the 2n-bit product will have n significant bits (more or less; maybe actually one place to the right IIRC).
Something like 37 x 5 would only happen with tiny subnormal floats, where the product would indeed underflow to zero. But in that case, it would be because you only get subnormals at the limits of the exponent range, and (37 * 2^-1022) * (5 * 2^-1022) would be 186 * 2^-2044, an exponent way too small to be represented in an FP format like IEEE binary64 aka double where -1022 was the minimum exponent.
You're using integers where a>>63 isn't 1, in fact they're both less than 2^32 so there are no significant bits outside the low 64 bits of the full 128-bit product.

Compute product of two integers as lower and higher half [duplicate]

I am looking for an efficient (optionally standard, elegant and easy to implement) solution to multiply relatively large numbers, and store the result into one or several integers :
Let say I have two 64 bits integers declared like this :
uint64_t a = xxx, b = yyy;
When I do a * b, how can I detect if the operation results in an overflow and in this case store the carry somewhere?
Please note that I don't want to use any large-number library since I have constraints on the way I store the numbers.
1. Detecting the overflow:
x = a * b;
if (a != 0 && x / a != b) {
// overflow handling
}
Edit: Fixed division by 0 (thanks Mark!)
2. Computing the carry is quite involved. One approach is to split both operands into half-words, then apply long multiplication to the half-words:
uint64_t hi(uint64_t x) {
return x >> 32;
}
uint64_t lo(uint64_t x) {
return ((1ULL << 32) - 1) & x;
}
void multiply(uint64_t a, uint64_t b) {
// actually uint32_t would do, but the casting is annoying
uint64_t s0, s1, s2, s3;
uint64_t x = lo(a) * lo(b);
s0 = lo(x);
x = hi(a) * lo(b) + hi(x);
s1 = lo(x);
s2 = hi(x);
x = s1 + lo(a) * hi(b);
s1 = lo(x);
x = s2 + hi(a) * hi(b) + hi(x);
s2 = lo(x);
s3 = hi(x);
uint64_t result = s1 << 32 | s0;
uint64_t carry = s3 << 32 | s2;
}
To see that none of the partial sums themselves can overflow, we consider the worst case:
x = s2 + hi(a) * hi(b) + hi(x)
Let B = 1 << 32. We then have
x <= (B - 1) + (B - 1)(B - 1) + (B - 1)
<= B*B - 1
< B*B
I believe this will work - at least it handles Sjlver's test case. Aside from that, it is untested (and might not even compile, as I don't have a C++ compiler at hand anymore).
The idea is to use following fact which is true for integral operation:
a*b > c if and only if a > c/b
/ is integral division here.
The pseudocode to check against overflow for positive numbers follows:
if (a > max_int64 / b) then "overflow" else "ok".
To handle zeroes and negative numbers you should add more checks.
C code for non-negative a and b follows:
if (b > 0 && a > 18446744073709551615 / b) {
// overflow handling
}; else {
c = a * b;
}
Note, max value for 64 type:
18446744073709551615 == (1<<64)-1
To calculate carry we can use approach to split number into two 32-digits and multiply them as we do this on the paper. We need to split numbers to avoid overflow.
Code follows:
// split input numbers into 32-bit digits
uint64_t a0 = a & ((1LL<<32)-1);
uint64_t a1 = a >> 32;
uint64_t b0 = b & ((1LL<<32)-1);
uint64_t b1 = b >> 32;
// The following 3 lines of code is to calculate the carry of d1
// (d1 - 32-bit second digit of result, and it can be calculated as d1=d11+d12),
// but to avoid overflow.
// Actually rewriting the following 2 lines:
// uint64_t d1 = (a0 * b0 >> 32) + a1 * b0 + a0 * b1;
// uint64_t c1 = d1 >> 32;
uint64_t d11 = a1 * b0 + (a0 * b0 >> 32);
uint64_t d12 = a0 * b1;
uint64_t c1 = (d11 > 18446744073709551615 - d12) ? 1 : 0;
uint64_t d2 = a1 * b1 + c1;
uint64_t carry = d2; // needed carry stored here
Although there have been several other answers to this question, I several of them have code that is completely untested, and thus far no one has adequately compared the different possible options.
For that reason, I wrote and tested several possible implementations (the last one is based on this code from OpenBSD, discussed on Reddit here). Here's the code:
/* Multiply with overflow checking, emulating clang's builtin function
*
* __builtin_umull_overflow
*
* This code benchmarks five possible schemes for doing so.
*/
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#ifndef BOOL
#define BOOL int
#endif
// Option 1, check for overflow a wider type
// - Often fastest and the least code, especially on modern compilers
// - When long is a 64-bit int, requires compiler support for 128-bits
// ints (requires GCC >= 3.0 or Clang)
#if LONG_BIT > 32
typedef __uint128_t long_overflow_t ;
#else
typedef uint64_t long_overflow_t;
#endif
BOOL
umull_overflow1(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
long_overflow_t prod = (long_overflow_t)lhs * (long_overflow_t)rhs;
*result = (unsigned long) prod;
return (prod >> LONG_BIT) != 0;
}
// Option 2, perform long multiplication using a smaller type
// - Sometimes the fastest (e.g., when mulitply on longs is a library
// call).
// - Performs at most three multiplies, and sometimes only performs one.
// - Highly portable code; works no matter how many bits unsigned long is
BOOL
umull_overflow2(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long HALFSIZE_MAX = (1ul << LONG_BIT/2) - 1ul;
unsigned long lhs_high = lhs >> LONG_BIT/2;
unsigned long lhs_low = lhs & HALFSIZE_MAX;
unsigned long rhs_high = rhs >> LONG_BIT/2;
unsigned long rhs_low = rhs & HALFSIZE_MAX;
unsigned long bot_bits = lhs_low * rhs_low;
if (!(lhs_high || rhs_high)) {
*result = bot_bits;
return 0;
}
BOOL overflowed = lhs_high && rhs_high;
unsigned long mid_bits1 = lhs_low * rhs_high;
unsigned long mid_bits2 = lhs_high * rhs_low;
*result = bot_bits + ((mid_bits1+mid_bits2) << LONG_BIT/2);
return overflowed || *result < bot_bits
|| (mid_bits1 >> LONG_BIT/2) != 0
|| (mid_bits2 >> LONG_BIT/2) != 0;
}
// Option 3, perform long multiplication using a smaller type (this code is
// very similar to option 2, but calculates overflow using a different but
// equivalent method).
// - Sometimes the fastest (e.g., when mulitply on longs is a library
// call; clang likes this code).
// - Performs at most three multiplies, and sometimes only performs one.
// - Highly portable code; works no matter how many bits unsigned long is
BOOL
umull_overflow3(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long HALFSIZE_MAX = (1ul << LONG_BIT/2) - 1ul;
unsigned long lhs_high = lhs >> LONG_BIT/2;
unsigned long lhs_low = lhs & HALFSIZE_MAX;
unsigned long rhs_high = rhs >> LONG_BIT/2;
unsigned long rhs_low = rhs & HALFSIZE_MAX;
unsigned long lowbits = lhs_low * rhs_low;
if (!(lhs_high || rhs_high)) {
*result = lowbits;
return 0;
}
BOOL overflowed = lhs_high && rhs_high;
unsigned long midbits1 = lhs_low * rhs_high;
unsigned long midbits2 = lhs_high * rhs_low;
unsigned long midbits = midbits1 + midbits2;
overflowed = overflowed || midbits < midbits1 || midbits > HALFSIZE_MAX;
unsigned long product = lowbits + (midbits << LONG_BIT/2);
overflowed = overflowed || product < lowbits;
*result = product;
return overflowed;
}
// Option 4, checks for overflow using division
// - Checks for overflow using division
// - Division is slow, especially if it is a library call
BOOL
umull_overflow4(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
*result = lhs * rhs;
return rhs > 0 && (SIZE_MAX / rhs) < lhs;
}
// Option 5, checks for overflow using division
// - Checks for overflow using division
// - Avoids division when the numbers are "small enough" to trivially
// rule out overflow
// - Division is slow, especially if it is a library call
BOOL
umull_overflow5(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long MUL_NO_OVERFLOW = (1ul << LONG_BIT/2) - 1ul;
*result = lhs * rhs;
return (lhs >= MUL_NO_OVERFLOW || rhs >= MUL_NO_OVERFLOW) &&
rhs > 0 && SIZE_MAX / rhs < lhs;
}
#ifndef umull_overflow
#define umull_overflow2
#endif
/*
* This benchmark code performs a multiply at all bit sizes,
* essentially assuming that sizes are logarithmically distributed.
*/
int main()
{
unsigned long i, j, k;
int count = 0;
unsigned long mult;
unsigned long total = 0;
for (k = 0; k < 0x40000000 / LONG_BIT / LONG_BIT; ++k)
for (i = 0; i != LONG_MAX; i = i*2+1)
for (j = 0; j != LONG_MAX; j = j*2+1) {
count += umull_overflow(i+k, j+k, &mult);
total += mult;
}
printf("%d overflows (total %lu)\n", count, total);
}
Here are the results, testing with various compilers and systems I have (in this case, all testing was done on OS X, but results should be similar on BSD or Linux systems):
+------------------+----------+----------+----------+----------+----------+
| | Option 1 | Option 2 | Option 3 | Option 4 | Option 5 |
| | BigInt | LngMult1 | LngMult2 | Div | OptDiv |
+------------------+----------+----------+----------+----------+----------+
| Clang 3.5 i386 | 1.610 | 3.217 | 3.129 | 4.405 | 4.398 |
| GCC 4.9.0 i386 | 1.488 | 3.469 | 5.853 | 4.704 | 4.712 |
| GCC 4.2.1 i386 | 2.842 | 4.022 | 3.629 | 4.160 | 4.696 |
| GCC 4.2.1 PPC32 | 8.227 | 7.756 | 7.242 | 20.632 | 20.481 |
| GCC 3.3 PPC32 | 5.684 | 9.804 | 11.525 | 21.734 | 22.517 |
+------------------+----------+----------+----------+----------+----------+
| Clang 3.5 x86_64 | 1.584 | 2.472 | 2.449 | 9.246 | 7.280 |
| GCC 4.9 x86_64 | 1.414 | 2.623 | 4.327 | 9.047 | 7.538 |
| GCC 4.2.1 x86_64 | 2.143 | 2.618 | 2.750 | 9.510 | 7.389 |
| GCC 4.2.1 PPC64 | 13.178 | 8.994 | 8.567 | 37.504 | 29.851 |
+------------------+----------+----------+----------+----------+----------+
Based on these results, we can draw a few conclusions:
Clearly, the division-based approach, although simple and portable, is slow.
No technique is a clear winner in all cases.
On modern compilers, the use-a-larger-int approach is best, if you can use it
On older compilers, the long-multiplication approach is best
Surprisingly, GCC 4.9.0 has performance regressions over GCC 4.2.1, and GCC 4.2.1 has performance regressions over GCC 3.3
A version that also works when a == 0:
x = a * b;
if (a != 0 && x / a != b) {
// overflow handling
}
Easy and fast with clang and gcc:
unsigned long long t a, b, result;
if (__builtin_umulll_overflow(a, b, &result)) {
// overflow!!
}
This will use hardware support for overflow detection where available. By being compiler extensions it can even handle signed integer overflow (replace umul with smul), eventhough that is undefined behavior in C++.
If you need not just to detect overflow but also to capture the carry, you're best off breaking your numbers down into 32-bit parts. The code is a nightmare; what follows is just a sketch:
#include <stdint.h>
uint64_t mul(uint64_t a, uint64_t b) {
uint32_t ah = a >> 32;
uint32_t al = a; // truncates: now a = al + 2**32 * ah
uint32_t bh = b >> 32;
uint32_t bl = b; // truncates: now b = bl + 2**32 * bh
// a * b = 2**64 * ah * bh + 2**32 * (ah * bl + bh * al) + al * bl
uint64_t partial = (uint64_t) al * (uint64_t) bl;
uint64_t mid1 = (uint64_t) ah * (uint64_t) bl;
uint64_t mid2 = (uint64_t) al * (uint64_t) bh;
uint64_t carry = (uint64_t) ah * (uint64_t) bh;
// add high parts of mid1 and mid2 to carry
// add low parts of mid1 and mid2 to partial, carrying
// any carry bits into carry...
}
The problem is not just the partial products but the fact that any of the sums can overflow.
If I had to do this for real, I would write an extended-multiply routine in the local assembly language. That is, for example, multiply two 64-bit integers to get a 128-bit result, which is stored in two 64-bit registers. All reasonable hardware provides this functionality in a single native multiply instruction—it's not just accessible from C.
This is one of those rare cases where the solution that's most elegant and easy to program is actually to use assembly language. But it's certainly not portable :-(
The GNU Portability Library (Gnulib) contains a module intprops, which has macros that efficiently test whether arithmetic operations would overflow.
For example, if an overflow in multiplication would occur, INT_MULTIPLY_OVERFLOW (a, b) would yield 1.
Perhaps the best way to solve this problem is to have a function, which multiplies two UInt64 and results a pair of UInt64, an upper part and a lower part of the UInt128 result. Here is the solution, including a function, which displays the result in hex. I guess you perhaps prefer a C++ solution, but I have a working Swift-Solution which shows, how to manage the problem:
func hex128 (_ hi: UInt64, _ lo: UInt64) -> String
{
var s: String = String(format: "%08X", hi >> 32)
+ String(format: "%08X", hi & 0xFFFFFFFF)
+ String(format: "%08X", lo >> 32)
+ String(format: "%08X", lo & 0xFFFFFFFF)
return (s)
}
func mul64to128 (_ multiplier: UInt64, _ multiplicand : UInt64)
-> (result_hi: UInt64, result_lo: UInt64)
{
let x: UInt64 = multiplier
let x_lo: UInt64 = (x & 0xffffffff)
let x_hi: UInt64 = x >> 32
let y: UInt64 = multiplicand
let y_lo: UInt64 = (y & 0xffffffff)
let y_hi: UInt64 = y >> 32
let mul_lo: UInt64 = (x_lo * y_lo)
let mul_hi: UInt64 = (x_hi * y_lo) + (mul_lo >> 32)
let mul_carry: UInt64 = (x_lo * y_hi) + (mul_hi & 0xffffffff)
let result_hi: UInt64 = (x_hi * y_hi) + (mul_hi >> 32) + (mul_carry >> 32)
let result_lo: UInt64 = (mul_carry << 32) + (mul_lo & 0xffffffff)
return (result_hi, result_lo)
}
Here is an example to verify, that the function works:
var c: UInt64 = 0
var d: UInt64 = 0
(c, d) = mul64to128(0x1234567890123456, 0x9876543210987654)
// 0AD77D742CE3C72E45FD10D81D28D038 is the result of the above example
print(hex128(c, d))
(c, d) = mul64to128(0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF)
// FFFFFFFFFFFFFFFE0000000000000001 is the result of the above example
print(hex128(c, d))
There is a simple (and often very fast solution) which has not been mentioned yet. The solution is based on the fact that n-Bit times m-Bit multiplication does never overflow for a product width of n+m-bit or higher but overflows for all result widths smaller than n+m-1.
Because my old description might have been too difficult to read for some people, I try it again:
What you need is checking the sum of leading-zeroes of both operands. It would be very easy to prove mathematically.
Let x be n-Bit and y be m-Bit. z = x * y is k-Bit. Because the product can be n+m bit large at most it can overflow. Let's say. x*y is p-Bit long (without leading zeroes). The leading zeroes of the product are clz(x * y) = n+m - p. clz behaves similar to log, hence:
clz(x * y) = clz(x) + clz(y) + c with c = either 1 or 0.
(thank you for the c = 1 advice in the comment!)
It overflows when k < p <= n+m <=> n+m - k > n+m - p = clz(x * y).
Now we can use this algorithm:
if max(clz(x * y)) = clz(x) + clz(y) +1 < (n+m - k) --> overflow
if max(clz(x * y)) = clz(x) + clz(y) +1 == (n+m - k) --> overflow if c = 0
else --> no overflow
How to check for overflow in the middle case? I assume, you have a multiplication instruction. Then we easily can use it to see the leading zeroes of the result, i.e.:
if clz(x * y / 2) == (n+m - k) <=> msb(x * y/2) == 1 --> overflow
else --> no overflow
You do the multiplication by treating x/2 as fixed point and y as normal integer:
msb(x * y/2) = msb(floor(x * y / 2))
floor(x * y/2) = floor(x/2) * y + (lsb(x) * floor(y/2)) = (x >> 1)*y + (x & 1)*(y >> 1)
(this result never overflows in case of clz(x)+clz(y)+1 == (n+m -k))
The trick is using builtins/intrinsics. In GCC it looks this way:
static inline int clz(int a) {
if (a == 0) return 32; //only needed for x86 architecture
return __builtin_clz(a);
}
/**#fn static inline _Bool chk_mul_ov(uint32_t f1, uint32_t f2)
* #return one, if a 32-Bit-overflow occurs when unsigned-unsigned-multipliying f1 with f2 otherwise zero. */
static inline _Bool chk_mul_ov(uint32_t f1, uint32_t f2) {
int lzsum = clz(f1) + clz(f2); //leading zero sum
return
lzsum < sizeof(f1)*8-1 || ( //if too small, overflow guaranteed
lzsum == sizeof(f1)*8-1 && //if special case, do further check
(int32_t)((f1 >> 1)*f2 + (f1 & 1)*(f2 >> 1)) < 0 //check product rightshifted by one
);
}
...
if (chk_mul_ov(f1, f2)) {
//error handling
}
...
Just an example for n = m = k = 32-Bit unsigned-unsigned-multiplication. You can generalize it to signed-unsigned- or signed-signed-multiplication. And even no multiple-bit-shift is required (because some microcontrollers implement one-bit-shifts only but sometimes support product divided by two with a single instruction like Atmega!). However, if no count-leading-zeroes instruction exists but long multiplication, this might not be better.
Other compilers have their own way of specifying intrinsics for CLZ operations.
Compared to checking upper half of the multiplication the clz-method should scale better (in worst case) than using a highly optimized 128-Bit multiplication to check for 64-Bit overflow. Multiplication needs over linear overhead while count bits needs only linear overhead.
This code worked out-of-the box for me when tried.
I've been working with this problem this days and I have to say that it has impressed me the number of times I have seen people saying the best way to know if there has been an overflow is to divide the result, thats totally inefficient and unnecessary. The point for this function is that it must be as fast as possible.
There are two options for the overflow detection:
1º- If possible create the result variable twice as big as the multipliers, for example:
struct INT32struct {INT16 high, low;};
typedef union
{
struct INT32struct s;
INT32 ll;
} INT32union;
INT16 mulFunction(INT16 a, INT16 b)
{
INT32union result.ll = a * b; //32Bits result
if(result.s.high > 0)
Overflow();
return (result.s.low)
}
You will know inmediately if there has been an overflow, and the code is the fastest possible without writing it in machine code. Depending on the compiler this code can be improved in machine code.
2º- Is impossible to create a result variable twice as big as the multipliers variable:
Then you should play with if conditions to determine the best path. Continuing with the example:
INT32 mulFunction(INT32 a, INT32 b)
{
INT32union s_a.ll = abs(a);
INT32union s_b.ll = abs(b); //32Bits result
INT32union result;
if(s_a.s.hi > 0 && s_b.s.hi > 0)
{
Overflow();
}
else if (s_a.s.hi > 0)
{
INT32union res1.ll = s_a.s.hi * s_b.s.lo;
INT32union res2.ll = s_a.s.lo * s_b.s.lo;
if (res1.hi == 0)
{
result.s.lo = res1.s.lo + res2.s.hi;
if (result.s.hi == 0)
{
result.s.ll = result.s.lo << 16 + res2.s.lo;
if ((a.s.hi >> 15) ^ (b.s.hi >> 15) == 1)
{
result.s.ll = -result.s.ll;
}
return result.s.ll
}else
{
Overflow();
}
}else
{
Overflow();
}
}else if (s_b.s.hi > 0)
{
//Same code changing a with b
}else
{
return (s_a.lo * s_b.lo);
}
}
I hope this code helps you to have a quite efficient program and I hope the code is clear, if not I'll put some coments.
best regards.
Here is a trick for detecting whether multiplication of two unsigned integers overflows.
We make the observation that if we multiply an N-bit-wide binary number with an M-bit-wide binary number, the product does not have more than N + M bits.
For instance, if we are asked to multiply a three-bit number with a twenty-nine bit number, we know that this doesn't overflow thirty-two bits.
#include <stdlib.h>
#include <stdio.h>
int might_be_mul_oflow(unsigned long a, unsigned long b)
{
if (!a || !b)
return 0;
a = a | (a >> 1) | (a >> 2) | (a >> 4) | (a >> 8) | (a >> 16) | (a >> 32);
b = b | (b >> 1) | (b >> 2) | (b >> 4) | (b >> 8) | (b >> 16) | (b >> 32);
for (;;) {
unsigned long na = a << 1;
if (na <= a)
break;
a = na;
}
return (a & b) ? 1 : 0;
}
int main(int argc, char **argv)
{
unsigned long a, b;
char *endptr;
if (argc < 3) {
printf("supply two unsigned long integers in C form\n");
return EXIT_FAILURE;
}
a = strtoul(argv[1], &endptr, 0);
if (*endptr != 0) {
printf("%s is garbage\n", argv[1]);
return EXIT_FAILURE;
}
b = strtoul(argv[2], &endptr, 0);
if (*endptr != 0) {
printf("%s is garbage\n", argv[2]);
return EXIT_FAILURE;
}
if (might_be_mul_oflow(a, b))
printf("might be multiplication overflow\n");
{
unsigned long c = a * b;
printf("%lu * %lu = %lu\n", a, b, c);
if (a != 0 && c / a != b)
printf("confirmed multiplication overflow\n");
}
return 0;
}
A smattering of tests: (on 64 bit system):
$ ./uflow 0x3 0x3FFFFFFFFFFFFFFF
3 * 4611686018427387903 = 13835058055282163709
$ ./uflow 0x7 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
7 * 4611686018427387903 = 13835058055282163705
confirmed multiplication overflow
$ ./uflow 0x4 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
4 * 4611686018427387903 = 18446744073709551612
$ ./uflow 0x5 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
5 * 4611686018427387903 = 4611686018427387899
confirmed multiplication overflow
The steps in might_be_mul_oflow are almost certainly slower than just doing the division test, at least on mainstream processors used in desktop workstations, servers and mobile devices. On chips without good division support, it could be useful.
It occurs to me that there is another way to do this early rejection test.
We start with a pair of numbers arng and brng which are initialized to 0x7FFF...FFFF and 1.
If a <= arng and b <= brng we can conclude that there is no overflow.
Otherwise, we shift arng to the right, and shift brng to the left, adding one bit to brng, so that they are 0x3FFF...FFFF and 3.
If arng is zero, finish; otherwise repeat at 2.
The function now looks like:
int might_be_mul_oflow(unsigned long a, unsigned long b)
{
if (!a || !b)
return 0;
{
unsigned long arng = ULONG_MAX >> 1;
unsigned long brng = 1;
while (arng != 0) {
if (a <= arng && b <= brng)
return 0;
arng >>= 1;
brng <<= 1;
brng |= 1;
}
return 1;
}
}
When your using e.g. 64 bits variables, implement 'number of significant bits' with nsb(var) = { 64 - clz(var); }.
clz(var) = count leading zeros in var, a builtin command for GCC and Clang, or probably available with inline assembly for your CPU.
Now use the fact that nsb(a * b) <= nsb(a) + nsb(b) to check for overflow. When smaller, it is always 1 smaller.
Ref GCC: Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
I was thinking about this today and stumbled upon this question, my thoughts led me to this result. TLDR, while I find it "elegant" in that it only uses a few lines of code (could easily be a one liner), and has some mild math that simplifies to something relatively simple conceptually, this is mostly "interesting" and I haven't tested it.
If you think of an unsigned integer as being a single digit with radix 2^n where n is the number of bits in the integer, then you can map those numbers to radians around the unit circle, e.g.
radians(x) = x * (2 * pi * rad / 2^n)
When the integer overflows, it is equivalent to wrapping around the circle. So calculating the carry is equivalent to calculating the number of times multiplication would wrap around the circle. To calculate the number of times we wrap around the circle we divide radians(x) by 2pi radians. e.g.
wrap(x) = radians(x) / (2*pi*rad)
= (x * (2*pi*rad / 2^n)) / (2*pi*rad / 1)
= (x * (2*pi*rad / 2^n)) * (1 / 2*pi*rad)
= x * 1 / 2^n
= x / 2^n
Which simplifies to
wrap(x) = x / 2^n
This makes sense. The number of times a number, for example, 15 with radix 10, wraps around is 15 / 10 = 1.5, or one and a half times. However, we can't use 2 digits here (assuming we are limited to a single 2^64 digit).
Say we have a * b, with radix R, we can calculate the carry with
Consider that: wrap(a * b) = a * wrap(b)
wrap(a * b) = (a * b) / R
a * wrap(b) = a * (b / R)
a * (b / R) = (a * b) / R
carry = floor(a * wrap(b))
Take for example a = 9 and b = 5, which are factors of 45 (i.e. 9 * 5 = 45).
wrap(5) = 5 / 10 = 0.5
a * wrap(5) = 9 * 0.5 = 4.5
carry = floor(9 * wrap(5)) = floor(4.5) = 4
Note that if the carry was 0, then we would not have had overflow, for example if a = 2, b=2.
In C/C++ (if the compiler and architecture supports it) we have to use long double.
Thus we have:
long double wrap = b / 18446744073709551616.0L; // this is b / 2^64
unsigned long carry = (unsigned long)(a * wrap); // floor(a * wrap(b))
bool overflow = carry > 0;
unsigned long c = a * b;
c here is the lower significant "digit", i.e. in base 10 9 * 9 = 81, carry = 8, and c = 1.
This was interesting to me in theory, so I thought I'd share it, but one major caveat is with the floating point precision in computers. Using long double, there may be rounding errors for some numbers when we calculate the wrap variable depending on how many significant digits your compiler/arch uses for long doubles, I believe it should be 20 more more to be sure. Another issue with this result, is that it may not perform as well as some of the other solutions simply by using floating points and division.
If you just want to detect overflow, how about converting to double, doing the multiplication and if
|x| < 2^53, convert to int64
|x| < 2^63, make the multiplication using int64
otherwise produce whatever error you want?
This seems to work:
int64_t safemult(int64_t a, int64_t b) {
double dx;
dx = (double)a * (double)b;
if ( fabs(dx) < (double)9007199254740992 )
return (int64_t)dx;
if ( (double)INT64_MAX < fabs(dx) )
return INT64_MAX;
return a*b;
}

Generating random floating-point values based on random bit stream

Given a random source (a generator of random bit stream), how do I generate a uniformly distributed random floating-point value in a given range?
Assume that my random source looks something like:
unsigned int GetRandomBits(char* pBuf, int nLen);
And I want to implement
double GetRandomVal(double fMin, double fMax);
Notes:
I don't want the result precision to be limited (for example only 5 digits).
Strict uniform distribution is a must
I'm not asking for a reference to an existing library. I want to know how to implement it from scratch.
For pseudo-code / code, C++ would be most appreciated
I don't think I'll ever be convinced that you actually need this, but it was fun to write.
#include <stdint.h>
#include <cmath>
#include <cstdio>
FILE* devurandom;
bool geometric(int x) {
// returns true with probability min(2^-x, 1)
if (x <= 0) return true;
while (1) {
uint8_t r;
fread(&r, sizeof r, 1, devurandom);
if (x < 8) {
return (r & ((1 << x) - 1)) == 0;
} else if (r != 0) {
return false;
}
x -= 8;
}
}
double uniform(double a, double b) {
// requires IEEE doubles and 0.0 < a < b < inf and a normal
// implicitly computes a uniform random real y in [a, b)
// and returns the greatest double x such that x <= y
union {
double f;
uint64_t u;
} convert;
convert.f = a;
uint64_t a_bits = convert.u;
convert.f = b;
uint64_t b_bits = convert.u;
uint64_t mask = b_bits - a_bits;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
int b_exp;
frexp(b, &b_exp);
while (1) {
// sample uniform x_bits in [a_bits, b_bits)
uint64_t x_bits;
fread(&x_bits, sizeof x_bits, 1, devurandom);
x_bits &= mask;
x_bits += a_bits;
if (x_bits >= b_bits) continue;
double x;
convert.u = x_bits;
x = convert.f;
// accept x with probability proportional to 2^x_exp
int x_exp;
frexp(x, &x_exp);
if (geometric(b_exp - x_exp)) return x;
}
}
int main() {
devurandom = fopen("/dev/urandom", "r");
for (int i = 0; i < 100000; ++i) {
printf("%.17g\n", uniform(1.0 - 1e-15, 1.0 + 1e-15));
}
}
Here is one way of doing it.
The IEEE Std 754 double format is as follows:
[s][ e ][ f ]
where s is the sign bit (1 bit), e is the biased exponent (11 bits) and f is the fraction (52 bits).
Beware that the layout in memory will be different on little-endian machines.
For 0 < e < 2047, the number represented is
(-1)**(s) * 2**(e – 1023) * (1.f)
By setting s to 0, e to 1023 and f to 52 random bits from your bit stream, you get a random double in the interval [1.0, 2.0). This interval is unique in that it contains 2 ** 52 doubles, and these doubles are equidistant. If you then subtract 1.0 from the constructed double, you get a random double in the interval [0.0, 1.0). Moreover, the property about being equidistant is preserve.
From there you should be able to scale and translate as needed.
I'm surprised that for question this old, nobody had actual code for the best answer. User515430's answer got it right--you can take advantage of IEEE-754 double format to directly put 52 bits into a double with no math at all. But he didn't give code. So here it is, from my public domain ojrandlib:
double ojr_next_double(ojr_generator *g) {
uint64_t r = (OJR_NEXT64(g) & 0xFFFFFFFFFFFFFull) | 0x3FF0000000000000ull;
return *(double *)(&r) - 1.0;
}
NEXT64() gets a 64-bit random number. If you have a more efficient way of getting only 52 bits, use that instead.
This is easy, as long as you have an integer type with as many bits of precision as a double. For instance, an IEEE double-precision number has 53 bits of precision, so a 64-bit integer type is enough:
#include <limits.h>
double GetRandomVal(double fMin, double fMax) {
unsigned long long n ;
GetRandomBits ((char*)&n, sizeof(n)) ;
return fMin + (n * (fMax - fMin))/ULLONG_MAX ;
}
This is probably not the answer you want, but the specification here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3225.pdf
in sections [rand.util.canonical] and [rand.dist.uni.real], contains sufficient information to implement what you want, though with slightly different syntax. It isn't easy, but it is possible. I speak from personal experience. A year ago I knew nothing about random numbers, and I was able to do it. Though it took me a while... :-)
The question is ill-posed. What does uniform distribution over floats even mean?
Taking our cue from discrepancy, one way to operationalize your question is to define that you want the distribution that minimizes the following value:
Where x is the random variable you are sampling with your GetRandomVal(double fMin, double fMax) function, and means the probability that a random x is smaller or equal to t.
And now you can go on and try to evaluate eg a dabbler's answer. (Hint all the answers that fail to use the whole precision and stick to eg 52 bits will fail this minimization criterion.)
However, if you just want to be able to generate all float bit patterns that fall into your specified range with equal possibility, even if that means that eg asking for GetRandomVal(0,1000) will create more values between 0 and 1.5 than between 1.5 and 1000, that's easy: any interval of IEEE floating point numbers when interpreted as bit patterns map easily to a very small number of intervals of unsigned int64. See eg this question. Generating equally distributed random values of unsigned int64 in any given interval is easy.
I may be misunderstanding the question, but what stops you simply sampling the next n bits from the random bit stream and converting that to a base 10 number number ranged 0 to 2^n - 1.
To get a random value in [0..1[ you could do something like:
double value = 0;
for (int i=0;i<53;i++)
value = 0.5 * (value + random_bit()); // Insert 1 random bit
// or value = ldexp(value+random_bit(),-1);
// or group several bits into one single ldexp
return value;

Converting from unsigned long long to float with round to nearest even

I need to write a function that rounds from unsigned long long to float, and the rounding should be toward nearest even.
I cannot just do a C++ type-cast, since AFAIK the standard does not specify the rounding.
I was thinking of using boost::numeric, but i could not find any useful lead after reading the documentation. Can this be done using that library?
Of course, if there is an alternative, i would be glad to use it.
Any help would be much appreciated.
EDIT: Adding an example to make things a bit clearer.
Suppose i want to convert 0xffffff7fffffffff to its floating point representation. The C++ standard permits either one of:
0x5f7fffff ~ 1.9999999*2^63
0x5f800000 = 2^64
Now if you add the restriction of round to nearest even, only the first result is acceptable.
Since you have so many bits in the source that can't be represented in the float and you can't (apparently) rely on the language's conversion, you'll have to do it yourself.
I devised a scheme that may or may not help you. Basically, there are 31 bits to represent positive numbers in a float so I pick up the 31 most significant bits in the source number. Then I save off and mask away all the lower bits. Then based on the value of the lower bits I round the "new" LSB up or down and finally use static_cast to create a float.
I left in some couts that you can remove as desired.
const unsigned long long mask_bit_count = 31;
float ull_to_float2(unsigned long long val)
{
// How many bits are needed?
int b = sizeof(unsigned long long) * CHAR_BIT - 1;
for(; b >= 0; --b)
{
if(val & (1ull << b))
{
break;
}
}
std::cout << "Need " << (b + 1) << " bits." << std::endl;
// If there are few enough significant bits, use normal cast and done.
if(b < mask_bit_count)
{
return static_cast<float>(val & ~1ull);
}
// Save off the low-order useless bits:
unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
std::cout << "Saved low bits=" << low_bits << std::endl;
std::cout << val << "->mask->";
// Now mask away those useless low bits:
val &= ~((1ull << (b - mask_bit_count)) - 1);
std::cout << val << std::endl;
// Finally, decide how to round the new LSB:
if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
{
std::cout << "Rounding up " << val;
// Round up.
val |= (1ull << (b - mask_bit_count));
std::cout << " to " << val << std::endl;
}
else
{
// Round down.
val &= ~(1ull << (b - mask_bit_count));
}
return static_cast<float>(val);
}
I did this in Smalltalk for arbitrary precision integer (LargeInteger), implemented and tested in Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk, and even blogged about it if you can read Smalltalk code http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .
The trick for accelerating the algorithm is this one: IEEE 754 compliant FPU will round exactly the result of an inexact operation. So we can afford 1 inexact operation and let the hardware rounds correctly for us. That let us handle easily first 48 bits. But we cannot afford two inexact operations, so we sometimes have to care of the lowest bits differently...
Hope the code is documented enough:
#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long high=val>>prec; // the high bits above float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
// handle trivial cases, 48 bits or less,
// let FPU apply correct rounding after exactly 1 inexact operation
if( high <= mask )
return ldexpf((float) high,prec) + (float) (val & mask);
// more than 48 bits,
// what scaling s is needed to isolate highest 48 bits of val?
int s = 0;
for( ; high > mask ; high >>= 1) ++s;
// high now contains highest 24 bits
float f_high = ldexpf( (float) high , prec + s );
// store next 24 bits in mid
unsigned long long mid = (val >> s) & mask;
// care of rare case when trailing low bits can change the rounding:
// can mid bits be a case of perfect tie or perfect zero?
if( (mid & tmsk) == 0ull )
{
// if low bits are zero, mid is either an exact tie or an exact zero
// else just increment mid to distinguish from such case
unsigned long long low = val & ((1ull << s) - 1);
if(low > 0ull) mid++;
}
return f_high + ldexpf( (float) mid , s );
}
Bonus: this code should round according to your FPU rounding mode whatever it may be, since we implicitely used the FPU to perform rounding with + operation.
However, beware of aggressive optimizations in standards < C99, who knows when the compiler will use extended precision... (unless you force something like -ffloat-store).
If you always want to round to nearest even, whatever the current rounding mode, then you'll have to increment high bits when:
mid bits > tie, where tie=1ull<<(prec-1);
mid bits == tie and (low bits > 0 or high bits is odd).
EDIT:
If you stick to round-to-nearest-even tie breaking, then another solution is to use Shewchuck EXPANSION-SUM of non adjacent parts (fhigh,flow) and (fmid) see http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps :
#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long high=val>>(2*prec); // the high bits
unsigned long long mid=(val>>prec) & mask; // the mid bits
unsigned long long low=val & mask; // the low bits
float fhigh = ldexpf((float) high,2*prec);
float fmid = ldexpf((float) mid,prec);
float flow = (float) low;
float sum1 = fmid + flow;
float residue1 = flow - (sum1 - fmid);
float sum2 = fhigh + sum1;
float residue2 = sum1 - (sum2 - fhigh);
return (residue1 + residue2) + sum2;
}
This makes a branch-free algorithm with a bit more ops. It may work with other rounding modes, but I let you analyze the paper to make sure...
What is possible between between 8-byte integers and the float format is straightforward to explain but less so to implement!
The next paragraph concerns what is representable in 8 byte signed integers.
All positive integers between 1 (2^0) and 16777215 (2^24-1) are exactly representable in iEEE754 single precision (float). Or, to be precise, all numbers between 2^0 and 2^24-2^0 in increments of 2^0. The next range of exactly representable positive integers is 2^1 to 2^25-2^1 in increments of 2^1 and so on up to 2^39 to 2^63-2^39 in increments of 2^39.
Unsigned 8-byte integer values can be expressed up to 2^64-2^40 in increments of 2^40.
The single precison format doesn't stop here but goes on all the way up to the range 2^103 to 2^127-2^103 in increments of 2^103.
For 4-byte integers (long) the highest float range is 2^7 to 2^31-2^7 in 2^7 increments.
On the x86 architecture the largest integer type supported by the floating point instruction set is the 8 byte signed integer. 2^64-1 cannot be loaded by conventional means.
This means that for a given range increment expressed as "2^i where i is an integer >0" all integers that end with the bit pattern 0x1 up to 2^i-1 will not be exactly representable within that range in a float
This means that what you call rounding upwards is actually dependent on what range you are working in. It is of no use to try to round up by 1 (2^0) or 16 (2^4) if the granularity of the range you are in is 2^19.
An additional consequence of what you propose to do (rounding 2^63-1 to 2^63) could result in an (long integer format) overflow if you attempt the following conversion: longlong_int=(long long) ((float) 2^63).
Check out this small program I wrote (in C) which should help illustrate what is possible and what isn't.
int main (void)
{
__int64 basel=1,baseh=16777215,src,dst,j;
float cnvl,cnvh,range;
int i=0;
while (i<40)
{
src=basel<<i;
cnvl=(float) src;
dst=(__int64) cnvl; /* compare dst with basel */
src=baseh<<i;
cnvh=(float) src;
dst=(__int64) cnvh; /* compare dst with baseh */
j=basel;
while (j<=baseh)
{
range=(float) j;
dst=(__int64) range;
if (j!=dst) dst/=0;
j+=basel;
}
++i;
}
return i;
}
This program shows the representable integer value ranges. There is overlap beteen them: for example 2^5 is representable in all ranges with a lower boundary 2^b where 1=