Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32 - bit-manipulation

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element.
Is there an efficient way to implement this using AVX and AVX2 instructions only?
Currently I'm using a loop which extracts each element and applies the _lzcnt_u32 function.
Related: to bit-scan one large bitmap, see Count leading zeros in __m256i word which uses pmovmskb -> bitscan to find which byte to do a scalar bitscan on.
This question is about doing 8 separate lzcnts on 8 separate 32-bit elements when you're actually going to use all 8 results, not just select one.

float represents numbers in an exponential format, so int->FP conversion gives us the position of the highest set bit encoded in the exponent field.
We want int->float with magnitude rounded down (truncate the value towards 0), not the default rounding of nearest. That could round up and make 0x3FFFFFFF look like 0x40000000. If you're doing a lot of these conversions without doing any FP math, you could set the rounding mode in the MXCSR1 to truncation then set it back when you're done.
Otherwise you can use v & ~(v>>8) to keep the 8 most-significant bits and zero some or all lower bits, including a potentially-set bit 8 below the MSB. That's enough to ensure all rounding modes never round up to the next power of two. It always keeps the 8 MSB because v>>8 shifts in 8 zeros, so inverted that's 8 ones. At lower bit positions, wherever the MSB is, 8 zeros are shifted past there from higher positions, so it will never clear the most significant bit of any integer. Depending on how set bits below the MSB line up, it might or might not clear more below the 8 most significant.
After conversion, we use an integer shift on the bit-pattern to bring the exponent (and sign bit) to the bottom and undo the bias with a saturating subtract. We use min to set the result to 32 if no bits were set in the original 32-bit input.
__m256i avx2_lzcnt_epi32 (__m256i v) {
// prevent value from being rounded up to the next power of two
v = _mm256_andnot_si256(_mm256_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm256_castps_si256(_mm256_cvtepi32_ps(v)); // convert an integer to float
v = _mm256_srli_epi32(v, 23); // shift down the exponent
v = _mm256_subs_epu16(_mm256_set1_epi32(158), v); // undo bias
v = _mm256_min_epi16(v, _mm256_set1_epi32(32)); // clamp at 32
return v;
}
Footnote 1: fp->int conversion is available with truncation (cvtt), but int->fp conversion is only available with default rounding (subject to MXCSR).
AVX512F introduces rounding-mode overrides for 512-bit vectors which would solve the problem, __m512 _mm512_cvt_roundepi32_ps( __m512i a, int r);. But all CPUs with AVX512F also support AVX512CD so you could just use _mm512_lzcnt_epi32. And with AVX512VL, _mm256_lzcnt_epi32

#aqrit's answer looks like a more-clever use of FP bithacks. My answer below is based on the first place I looked for a bithack which was old and aimed at scalar so it didn't try to avoid double (which is wider than int32 and thus a problem for SIMD).
It uses HW signed int->float conversion and saturating integer subtracts to handle the MSB being set (negative float), instead of stuffing bits into a mantissa for manual uint->double. If you can set MXCSR to round down across a lot of these _mm256_lzcnt_epi32, that's even more efficient.
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogIEEE64Float suggests stuffing integers into the mantissa of a large double, then subtracting to get the FPU hardware to get a normalized double. (I think this bit of magic is doing uint32_t -> double, with the technique #Mysticial explains in How to efficiently perform double/int64 conversions with SSE/AVX? (which works for uint64_t up to 252-1)
Then grab the exponent bits of the double and undo the bias.
I think integer log2 is the same thing as lzcnt, but there might be an off-by-1 at powers of 2.
The Standford Graphics bithack page lists other branchless bithacks you could use that would probably still be better than 8x scalar lzcnt.
If you knew your numbers were always small-ish (like less than 2^23) you could maybe do this with float and avoid splitting and blending.
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
The code above loads a 64-bit (IEEE-754 floating-point) double with a 32-bit integer (with no paddding bits) by storing the integer in the mantissa while the exponent is set to 252. From this newly minted double, 252 (expressed as a double) is subtracted, which sets the resulting exponent to the log base 2 of the input value, v. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias, 0x3FF (which is 1023 decimal).
To do this with AVX2, blend and shift+blend odd/even halves with set1_epi32(0x43300000) and _mm256_castps_pd to get a __m256d. And after subtracting, _mm256_castpd_si256 and shift / blend the low/high halves into place then mask to get the exponents.
Doing integer operations on FP bit-patterns is very efficient with AVX2, just 1 cycle of extra latency for a bypass delay when doing integer shifts on the output of an FP math instruction.
(TODO: write it with C++ intrinsics, edit welcome or someone else could just post it as an answer.)
I'm not sure if you can do anything with int -> double conversion and then reading the exponent field. Negative numbers have no leading zeros and positive numbers give an exponent that depends on the magnitude.
If you did want that, you'd go one 128-bit lane at a time, shuffling to feed xmm -> ymm packed int32_t -> packed double conversion.

The question is also tagged AVX, but there are no instructions for integer processing in AVX, which means one needs to fall back to SSE on platforms that support AVX but not AVX2. I am showing an exhaustively tested, but a bit pedestrian version below. The basic idea here is as in the other answers, in that the count of leading zeros is determined by the floating-point normalization that occurs during integer to floating-point conversion. The exponent of the result has a one-to-one correspondence with the count of leading zeros, except that the result is wrong in the case of an argument of zero. Conceptually:
clz (a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
where float_as_uint32() is a re-interpreting cast and uint32_to_float_rz() is a conversion from unsigned integer to floating-point with truncation. A normal, rounding, conversion could bump up the conversion result to the next power of two, resulting in an incorrect count of leading zero bits.
SSE does not provide truncating integer to floating-point conversion as a single instruction, nor conversions from unsigned integers. This functionality needs to be emulated. The emulation does not need to be exact, as long as it does not change the magnitude of the conversion result. The truncation part is handled by the invert - right shift - andn technique from aqrit's answer. To use signed conversion, we cut the number in half before the conversion, then double and increment after the conversion:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
This approach is translated into SSE intrinsics in sse_clz() below.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"
/* compute count of leading zero bits using floating-point normalization.
clz(a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
The problematic part here is uint32_to_float_rz(). SSE does not offer
conversion of unsigned integers, and no rounding modes in integer to
floating-point conversion. Since all we need is an approximate version
that preserves order of magnitude:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
*/
__m128i sse_clz (__m128i a)
{
__m128 fp1 = _mm_set_ps1 (1.0f);
__m128i zero = _mm_set1_epi32 (0);
__m128i i158 = _mm_set1_epi32 (158);
__m128i iszero = _mm_cmpeq_epi32 (a, zero);
__m128i lsr1 = _mm_srli_epi32 (a, 1);
__m128i lsr2 = _mm_srli_epi32 (a, 2);
__m128i atrunc = _mm_andnot_si128 (lsr2, lsr1);
__m128 atruncf = _mm_cvtepi32_ps (atrunc);
__m128 atruncf2 = _mm_add_ps (atruncf, atruncf);
__m128 conv = _mm_add_ps (atruncf2, fp1);
__m128i convi = _mm_castps_si128 (conv);
__m128i lsr23 = _mm_srli_epi32 (convi, 23);
__m128i res = _mm_sub_epi32 (i158, lsr23);
return _mm_sub_epi32 (res, iszero);
}
/* Portable reference implementation of 32-bit count of leading zeros */
int clz32 (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
/* Test floating-point based count leading zeros exhaustively */
int main (void)
{
__m128i res;
uint32_t resi[4], refi[4];
uint32_t count = 0;
do {
refi[0] = clz32 (count);
refi[1] = clz32 (count + 1);
refi[2] = clz32 (count + 2);
refi[3] = clz32 (count + 3);
res = sse_clz (_mm_set_epi32 (count + 3, count + 2, count + 1, count));
memcpy (resi, &res, sizeof resi);
if ((resi[0] != refi[0]) || (resi[1] != refi[1]) ||
(resi[2] != refi[2]) || (resi[3] != refi[3])) {
printf ("error # %08x %08x %08x %08x\n",
count, count+1, count+2, count+3);
return EXIT_FAILURE;
}
count += 4;
} while (count);
return EXIT_SUCCESS;
}

Related

SSE2 packed 8-bit integer signed multiply (high-half): Decomposing a m128i (16x8 bit) into two m128i (8x16 each) and repack

I'm trying to multiply two m128i byte per byte (8 bit signed integers).
The problem here is overflow. My solution is to store these 8 bit signed integers into 16 bit signed integers, multiply, then pack the whole thing into a m128i of 16 x 8 bit integers.
Here is the __m128i mulhi_epi8(__m128i a, __m128i b) emulation I made:
inline __m128i mulhi_epi8(__m128i a, __m128i b)
{
auto a_decomposed = decompose_epi8(a);
auto b_decomposed = decompose_epi8(b);
__m128i r1 = _mm_mullo_epi16(a_decomposed.first, b_decomposed.first);
__m128i r2 = _mm_mullo_epi16(a_decomposed.second, b_decomposed.second);
return _mm_packs_epi16(_mm_srai_epi16(r1, 8), _mm_srai_epi16(r2, 8));
}
decompose_epi8 is implemented in a non-simd way:
inline std::pair<__m128i, __m128i> decompose_epi8(__m128i input)
{
std::pair<__m128i, __m128i> result;
// result.first => should contain 8 shorts in [-128, 127] (8 first bytes of the input)
// result.second => should contain 8 shorts in [-128, 127] (8 last bytes of the input)
for (int i = 0; i < 8; ++i)
{
result.first.m128i_i16[i] = input.m128i_i8[i];
result.second.m128i_i16[i] = input.m128i_i8[i + 8];
}
return result;
}
This code works well. My goal now is to implement a simd version of this for loop. I looked at the Intel Intrinsics Guide but I can't find a way to do this. I guess shuffle could do the trick but I have trouble conceptualising this.
As you want to do signed multiplication, you need to sign-extend each byte to 16bit words, or move them into the upper half of each 16bit word. Since you pack the results back together afterwards, you can split the input into odd and even bytes, instead of the higher and lower half. Then sign-extension of the odd bytes can be done by arithmetically shifting all 16bit parts to the right You can extract the odd bytes by masking out the even bytes, and to get the even bytes, you can shift all 16bit parts to the left (both need to be multiplied by _mm_mulhi_epi16).
The following should work with SSE2:
__m128i mulhi_epi8(__m128i a, __m128i b)
{
__m128i mask = _mm_set1_epi16(0xff00);
// mask higher bytes:
__m128i a_hi = _mm_and_si128(a, mask);
__m128i b_hi = _mm_and_si128(b, mask);
__m128i r_hi = _mm_mulhi_epi16(a_hi, b_hi);
// mask out garbage in lower half:
r_hi = _mm_and_si128(r_hi, mask);
// shift lower bytes to upper half
__m128i a_lo = _mm_slli_epi16(a,8);
__m128i b_lo = _mm_slli_epi16(b,8);
__m128i r_lo = _mm_mulhi_epi16(a_lo, b_lo);
// shift result to the lower half:
r_lo = _mm_srli_epi16(r_lo,8);
// join result and return:
return _mm_or_si128(r_hi, r_lo);
}
Note: a previous version used shifts to sign-extend the odd bytes. On most Intel CPUs this would increase P0 usage (which needs to be used for multiplication as well). Bit-logic can operate on more ports, so this version should have better throughput.

Is there a good way of finding modulus of two variables using SSE? (without SVML)

I'm trying to learn to use SSE, one of the programs I was making required the use of modulus division and so I wrote this to do it (sorry it's overcommented):
__m128i SSEModDiv(__m128i input, __m128i divisors)
{
//Error Checking (div by zero)
/*__m128i zeros = _mm_set1_epi32(0);
__m128i error = _mm_set1_epi32(-1);
__m128i zerocheck = _mm_cmpeq_epi32(zeros, divisors);
if (_mm_extract_epi16(zerocheck, 0) != 0)
return error;
if (_mm_extract_epi16(zerocheck, 2) != 0)
return error;
if (_mm_extract_epi16(zerocheck, 4) != 0)
return error;
if (_mm_extract_epi16(zerocheck, 6) != 0)
return error;*/
//Now for the real work
__m128 inputf = _mm_cvtepi32_ps(input);
__m128 divisorsf = _mm_cvtepi32_ps(divisors);
/*__m128 recip = _mm_rcp_ps(divisorsf); //Takes reciprocal
__m128 divided = _mm_mul_ps(inputf, recip); //multiplies by reciprical values*/
__m128 divided = _mm_div_ps(inputf, divisorsf);
__m128i intermediateint = _mm_cvttps_epi32(divided); //makes an integer version truncated
__m128 intermediate = _mm_cvtepi32_ps(intermediateint);
__m128 multiplied = _mm_mul_ps(intermediate, divisorsf); //multiplies the intermediate with the divisors
__m128 mods = _mm_sub_ps(inputf, multiplied); //subtracts to get moduli
return _mm_cvtps_epi32(mods);
}
The problem is this is about as fast as taking the modulus of each element of the four 32-bit integers individually in release and about 10 times slower in debug (found by profiling).
Could anyone give me any pointers on how to make this function faster?
-I can't use SVML be cause I'm using Visual Studio-
For general values of of input and divisors there are no useful SIMD x86 instructions for integer division or modulus so it's best to use scalar integer division. However, there are special cases where SIMD integer modulus can be done faster.
For example if you want to do (a+b)%c and a and b are already reduced (i.e. a<c and b<c) then you can use a comparison and a subtraction like this:
z = a + b
if(z>=c) z-=c;
I did this in this example vectorizing-modular-arithmetic
Another example is if the divisors are not compile time constants but are nevertheless constant within a loop then you can use an analogous idea from floating point division. A common trick in floating point division is to pre-compute the reciprocal of the divisor and do multiplication like this:
float fact = 1.0/x;
for(int i=0; i<n; i++) {
z[i] = fact*y[i]; //z[i] = y[i]/x;
}
You can use a similar technique for integer division which converts the integer division into integer multiplication with a shift.
y / x ≈ y * (2 n / x) >> n
There are several different techniques to determine the factor (aka magic number) (2 n / x) and the shift n. In fact most compilers will already do this for compile time constants and division. If you try for example x/7 and look at the assembly output from GCC or MSVC you will see that they don't actually do integer division, they do multiplication and a shift using the same magic numbers and shifts as calculated from http://www.hackersdelight.org/magic.htm.
I do this at runtime using Agner Fog's Vector Class Library or his Subroutine library. Both libraries will calculate the magic numbers and shifts for you at runtime for SSE and AVX integer division.
But as I said at the start of this answer if you want to do
for(int i=0; i<n; i++) {
z[i] = y[i]%x[i];
}
and x[i] is not a constant within the loop it's best to stick to scalar division/modulus.

Storing non-negative floating point values

Is there an efficient way to store non-negative floating point values using the existing float32 and float64 formats?
Imagine the default float32 behaviour which allows negative/positive:
val = bytes.readFloat32();
Is it possible to allow for greater positive values if negative values are not necessary?
val = bytes.readFloat32() + 0xFFFFFFFF;
Edit: Essentially when I know I'm storing only positive values, the float format could be modified a bit to allow for greater range or precision for the same amount of bits.
Eg. The float32 format is defined as 1 bit for sign, 8 bits for exponent, 23 bits for fraction
What if I don't need the sign bit, can we have 8 bits for exponent, 24 bits for fraction to give greater precision for the same 32 bits?
Floating-point numbers (float32 and float64) have an explicit sign bit. The equivalent of unsigned integers doesn't exist for floating-point numbers.
So there is no easy way to double the range of positive floating-point numbers.
No, not for free.
You can extend the range/accuracy in many ways using other numeric representations. The intent won't be clear, and the performance will typically be poor if you want the range and accuracy of float or double using another numeric representation (of equal size).
Just stick with float or double unless performance/storage is very very important, and you can represent your values well (or better!) using another numeric representation.
There's almost no support for unsigned float in hardware so you won't have such off-the-shelf feature but you can still have quite efficient unsigned float by storing the least significant bit in the sign bit. This way you can utilize the available floating-point hardware support instead of writing a software float solution. To do that you can
manipulate it manually after each operation
This way you need some small correction to the lsb (A.K.A sign bit), for example 1 more long division step, or a 1-bit adder for the addition
or by doing the math in higher precision if available
For example for float you can do operations in double then cast back to float when storing if sizof(float) < sizeof(double)
Here's a simple PoC implementation:
#include <cmath>
#include <cfenv>
#include <bit>
#include <type_traits>
// Does the math in double precision when hardware double is available
#define HAS_NATIVE_DOUBLE
class UFloat
{
public:
UFloat(double d) : UFloat(0.0f)
{
if (d < 0)
throw std::range_error("Value must be non-negative!");
uint64_t dbits = std::bit_cast<uint64_t>(d);
bool lsb = dbits & lsbMask;
dbits &= ~lsbMask; // turn off the lsb
d = std::bit_cast<double>(dbits);
value = lsb ? -(float)d : (float)d;
}
UFloat(const UFloat &rhs) : UFloat(rhs.value) {}
// =========== Operators ===========
UFloat &operator+=(const UFloat &rhs)
{
#ifdef HAS_NATIVE_DOUBLE
// Calculate in higher precision then round back
setValue((double)value + rhs.value);
#else
// Calculate the least significant bit manually
bool lhsLsb = std::signbit(value);
bool rhsLsb = std::signbit(rhs.value);
// Clear the sign bit to get the higher significant bits
// then get the sum
value = std::abs(value);
value += std::abs(rhs.value);
if (std::isfinite(value))
{
if (lhsLsb ^ rhsLsb) // Only ONE of the 2 least significant bits is 1
{
// The sum's lsb is 1, so we'll set its sign bit
value = -value;
}
else if (lhsLsb)
{
// BOTH least significant bits are 1s,
// so we'll add the carry to the next bit
value = std::nextafter(value, INFINITY);
// The lsb of the sum is 0, so the sign bit isn't changed
}
}
#endif
return *this;
}
UFloat &operator*=(const UFloat &rhs)
{
#ifdef HAS_NATIVE_DOUBLE
// Calculate in higher precision then round back
setValue((double)value * rhs.value);
#else
// Calculate the least significant bit manually
bool lhsLsb = std::signbit(value);
bool rhsLsb = std::signbit(rhs.value);
// Clear the sign bit to get the higher significant bits
// then get the product
float lhsMsbs = std::abs(value);
float rhsMsbs = std::abs(rhs.value);
// Suppose we have X.xPm with
// X: the high significant bits
// x: the least significant one
// and m: the exponent. Same to Y.yPn
// X.xPm * Y.yPn = (X + 0.x)*2^m * (Y + 0.y)*2^n
// = (X + x/2)*2^m * (Y + y/2)*2^n
// = (X*Y + X*y/2 + Y*x/2 + x*y/4)*2^(m + n)
value = lhsMsbs * rhsMsbs; // X*Y
if (std::isfinite(value))
{
uint32_t rhsMsbsBits = std::bit_cast<uint32_t>(rhsMsb);
value += rhsMsbs*lhsLsb / 2; // X*y/2
uint32_t lhsMsbsBits = std::bit_cast<uint32_t>(lhsMsbs);
value += lhsMsbs*rhsLsb / 2; // Y*x/2
int lsb = (rhsMsbsBits | lhsMsbsBits) & 1; // the product's lsb
lsb += lhsLsb & rhsLsb;
if (lsb & 1)
value = -value; // set the lsb
if (lsb > 1) // carry to the next bit
value = std::nextafter(value, INFINITY);
}
#endif
return *this;
}
UFloat &operator/=(const UFloat &rhs)
{
#ifdef HAS_NATIVE_DOUBLE
// Calculate in higher precision then round back
setValue((double)value / rhs.value);
#else
// Calculate the least significant bit manually
// Do just one more step of long division,
// since we only have 1 bit left to divide
throw std::runtime_error("Not Implemented yet!");
#endif
return *this;
}
double getUnsignedValue() const
{
if (!std::signbit(value))
{
return value;
}
else
{
double result = std::abs(value);
uint64_t doubleValue = std::bit_cast<uint64_t>(result);
doubleValue |= lsbMask; // turn on the least significant bit
result = std::bit_cast<double>(doubleValue);
return result;
}
}
private:
// The unsigned float value, with the least significant bit (lsb)
// being stored in the sign bit
float value;
// the first bit after the normal mantissa bits
static const uint64_t lsbMask = 1ULL << (DBL_MANT_DIG - FLT_MANT_DIG - 1);
// =========== Private Constructor ===========
UFloat(float rhs) : value(rhs)
{
std::fesetround(FE_TOWARDZERO); // We'll round the value ourselves
#ifdef HAS_NATIVE_DOUBLE
static_assert(sizeof(float) < sizeof(double));
#endif
}
void setValue(double d)
{
// get the bit pattern of the double value
auto bits = std::bit_cast<std::uint64_t>(d);
bool lsb = bits & lsbMask;
// turn off the lsb to avoid rounding when converting to float
bits &= ~lsbMask;
d = std::bit_cast<double>(bits);
value = (float)d;
if (lsb)
value = -value;
}
}
Some more tuning may be necessary to get the correct lsb
Either way you'll need more operations than normal so this may only be good for big arrays where cache footprint is a concern. In that case I suggest to use this as a storage format only, like how FP16 is treated on most current architectures: there are only load/store instructions for it which expands to float or double and convert back. All arithmetic operations are done in float or double only
So the unsigned float should exist in memory only, and will be decoded to the full double on load. This way you work on the native double type and won't need the correction after each operator
Alternatively this can be used with SIMD to operate on multiple unsigned floats at the same time

Generating random floating-point values based on random bit stream

Given a random source (a generator of random bit stream), how do I generate a uniformly distributed random floating-point value in a given range?
Assume that my random source looks something like:
unsigned int GetRandomBits(char* pBuf, int nLen);
And I want to implement
double GetRandomVal(double fMin, double fMax);
Notes:
I don't want the result precision to be limited (for example only 5 digits).
Strict uniform distribution is a must
I'm not asking for a reference to an existing library. I want to know how to implement it from scratch.
For pseudo-code / code, C++ would be most appreciated
I don't think I'll ever be convinced that you actually need this, but it was fun to write.
#include <stdint.h>
#include <cmath>
#include <cstdio>
FILE* devurandom;
bool geometric(int x) {
// returns true with probability min(2^-x, 1)
if (x <= 0) return true;
while (1) {
uint8_t r;
fread(&r, sizeof r, 1, devurandom);
if (x < 8) {
return (r & ((1 << x) - 1)) == 0;
} else if (r != 0) {
return false;
}
x -= 8;
}
}
double uniform(double a, double b) {
// requires IEEE doubles and 0.0 < a < b < inf and a normal
// implicitly computes a uniform random real y in [a, b)
// and returns the greatest double x such that x <= y
union {
double f;
uint64_t u;
} convert;
convert.f = a;
uint64_t a_bits = convert.u;
convert.f = b;
uint64_t b_bits = convert.u;
uint64_t mask = b_bits - a_bits;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
int b_exp;
frexp(b, &b_exp);
while (1) {
// sample uniform x_bits in [a_bits, b_bits)
uint64_t x_bits;
fread(&x_bits, sizeof x_bits, 1, devurandom);
x_bits &= mask;
x_bits += a_bits;
if (x_bits >= b_bits) continue;
double x;
convert.u = x_bits;
x = convert.f;
// accept x with probability proportional to 2^x_exp
int x_exp;
frexp(x, &x_exp);
if (geometric(b_exp - x_exp)) return x;
}
}
int main() {
devurandom = fopen("/dev/urandom", "r");
for (int i = 0; i < 100000; ++i) {
printf("%.17g\n", uniform(1.0 - 1e-15, 1.0 + 1e-15));
}
}
Here is one way of doing it.
The IEEE Std 754 double format is as follows:
[s][ e ][ f ]
where s is the sign bit (1 bit), e is the biased exponent (11 bits) and f is the fraction (52 bits).
Beware that the layout in memory will be different on little-endian machines.
For 0 < e < 2047, the number represented is
(-1)**(s) * 2**(e – 1023) * (1.f)
By setting s to 0, e to 1023 and f to 52 random bits from your bit stream, you get a random double in the interval [1.0, 2.0). This interval is unique in that it contains 2 ** 52 doubles, and these doubles are equidistant. If you then subtract 1.0 from the constructed double, you get a random double in the interval [0.0, 1.0). Moreover, the property about being equidistant is preserve.
From there you should be able to scale and translate as needed.
I'm surprised that for question this old, nobody had actual code for the best answer. User515430's answer got it right--you can take advantage of IEEE-754 double format to directly put 52 bits into a double with no math at all. But he didn't give code. So here it is, from my public domain ojrandlib:
double ojr_next_double(ojr_generator *g) {
uint64_t r = (OJR_NEXT64(g) & 0xFFFFFFFFFFFFFull) | 0x3FF0000000000000ull;
return *(double *)(&r) - 1.0;
}
NEXT64() gets a 64-bit random number. If you have a more efficient way of getting only 52 bits, use that instead.
This is easy, as long as you have an integer type with as many bits of precision as a double. For instance, an IEEE double-precision number has 53 bits of precision, so a 64-bit integer type is enough:
#include <limits.h>
double GetRandomVal(double fMin, double fMax) {
unsigned long long n ;
GetRandomBits ((char*)&n, sizeof(n)) ;
return fMin + (n * (fMax - fMin))/ULLONG_MAX ;
}
This is probably not the answer you want, but the specification here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3225.pdf
in sections [rand.util.canonical] and [rand.dist.uni.real], contains sufficient information to implement what you want, though with slightly different syntax. It isn't easy, but it is possible. I speak from personal experience. A year ago I knew nothing about random numbers, and I was able to do it. Though it took me a while... :-)
The question is ill-posed. What does uniform distribution over floats even mean?
Taking our cue from discrepancy, one way to operationalize your question is to define that you want the distribution that minimizes the following value:
Where x is the random variable you are sampling with your GetRandomVal(double fMin, double fMax) function, and means the probability that a random x is smaller or equal to t.
And now you can go on and try to evaluate eg a dabbler's answer. (Hint all the answers that fail to use the whole precision and stick to eg 52 bits will fail this minimization criterion.)
However, if you just want to be able to generate all float bit patterns that fall into your specified range with equal possibility, even if that means that eg asking for GetRandomVal(0,1000) will create more values between 0 and 1.5 than between 1.5 and 1000, that's easy: any interval of IEEE floating point numbers when interpreted as bit patterns map easily to a very small number of intervals of unsigned int64. See eg this question. Generating equally distributed random values of unsigned int64 in any given interval is easy.
I may be misunderstanding the question, but what stops you simply sampling the next n bits from the random bit stream and converting that to a base 10 number number ranged 0 to 2^n - 1.
To get a random value in [0..1[ you could do something like:
double value = 0;
for (int i=0;i<53;i++)
value = 0.5 * (value + random_bit()); // Insert 1 random bit
// or value = ldexp(value+random_bit(),-1);
// or group several bits into one single ldexp
return value;

Converting from unsigned long long to float with round to nearest even

I need to write a function that rounds from unsigned long long to float, and the rounding should be toward nearest even.
I cannot just do a C++ type-cast, since AFAIK the standard does not specify the rounding.
I was thinking of using boost::numeric, but i could not find any useful lead after reading the documentation. Can this be done using that library?
Of course, if there is an alternative, i would be glad to use it.
Any help would be much appreciated.
EDIT: Adding an example to make things a bit clearer.
Suppose i want to convert 0xffffff7fffffffff to its floating point representation. The C++ standard permits either one of:
0x5f7fffff ~ 1.9999999*2^63
0x5f800000 = 2^64
Now if you add the restriction of round to nearest even, only the first result is acceptable.
Since you have so many bits in the source that can't be represented in the float and you can't (apparently) rely on the language's conversion, you'll have to do it yourself.
I devised a scheme that may or may not help you. Basically, there are 31 bits to represent positive numbers in a float so I pick up the 31 most significant bits in the source number. Then I save off and mask away all the lower bits. Then based on the value of the lower bits I round the "new" LSB up or down and finally use static_cast to create a float.
I left in some couts that you can remove as desired.
const unsigned long long mask_bit_count = 31;
float ull_to_float2(unsigned long long val)
{
// How many bits are needed?
int b = sizeof(unsigned long long) * CHAR_BIT - 1;
for(; b >= 0; --b)
{
if(val & (1ull << b))
{
break;
}
}
std::cout << "Need " << (b + 1) << " bits." << std::endl;
// If there are few enough significant bits, use normal cast and done.
if(b < mask_bit_count)
{
return static_cast<float>(val & ~1ull);
}
// Save off the low-order useless bits:
unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
std::cout << "Saved low bits=" << low_bits << std::endl;
std::cout << val << "->mask->";
// Now mask away those useless low bits:
val &= ~((1ull << (b - mask_bit_count)) - 1);
std::cout << val << std::endl;
// Finally, decide how to round the new LSB:
if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
{
std::cout << "Rounding up " << val;
// Round up.
val |= (1ull << (b - mask_bit_count));
std::cout << " to " << val << std::endl;
}
else
{
// Round down.
val &= ~(1ull << (b - mask_bit_count));
}
return static_cast<float>(val);
}
I did this in Smalltalk for arbitrary precision integer (LargeInteger), implemented and tested in Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk, and even blogged about it if you can read Smalltalk code http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .
The trick for accelerating the algorithm is this one: IEEE 754 compliant FPU will round exactly the result of an inexact operation. So we can afford 1 inexact operation and let the hardware rounds correctly for us. That let us handle easily first 48 bits. But we cannot afford two inexact operations, so we sometimes have to care of the lowest bits differently...
Hope the code is documented enough:
#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long high=val>>prec; // the high bits above float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
// handle trivial cases, 48 bits or less,
// let FPU apply correct rounding after exactly 1 inexact operation
if( high <= mask )
return ldexpf((float) high,prec) + (float) (val & mask);
// more than 48 bits,
// what scaling s is needed to isolate highest 48 bits of val?
int s = 0;
for( ; high > mask ; high >>= 1) ++s;
// high now contains highest 24 bits
float f_high = ldexpf( (float) high , prec + s );
// store next 24 bits in mid
unsigned long long mid = (val >> s) & mask;
// care of rare case when trailing low bits can change the rounding:
// can mid bits be a case of perfect tie or perfect zero?
if( (mid & tmsk) == 0ull )
{
// if low bits are zero, mid is either an exact tie or an exact zero
// else just increment mid to distinguish from such case
unsigned long long low = val & ((1ull << s) - 1);
if(low > 0ull) mid++;
}
return f_high + ldexpf( (float) mid , s );
}
Bonus: this code should round according to your FPU rounding mode whatever it may be, since we implicitely used the FPU to perform rounding with + operation.
However, beware of aggressive optimizations in standards < C99, who knows when the compiler will use extended precision... (unless you force something like -ffloat-store).
If you always want to round to nearest even, whatever the current rounding mode, then you'll have to increment high bits when:
mid bits > tie, where tie=1ull<<(prec-1);
mid bits == tie and (low bits > 0 or high bits is odd).
EDIT:
If you stick to round-to-nearest-even tie breaking, then another solution is to use Shewchuck EXPANSION-SUM of non adjacent parts (fhigh,flow) and (fmid) see http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps :
#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long high=val>>(2*prec); // the high bits
unsigned long long mid=(val>>prec) & mask; // the mid bits
unsigned long long low=val & mask; // the low bits
float fhigh = ldexpf((float) high,2*prec);
float fmid = ldexpf((float) mid,prec);
float flow = (float) low;
float sum1 = fmid + flow;
float residue1 = flow - (sum1 - fmid);
float sum2 = fhigh + sum1;
float residue2 = sum1 - (sum2 - fhigh);
return (residue1 + residue2) + sum2;
}
This makes a branch-free algorithm with a bit more ops. It may work with other rounding modes, but I let you analyze the paper to make sure...
What is possible between between 8-byte integers and the float format is straightforward to explain but less so to implement!
The next paragraph concerns what is representable in 8 byte signed integers.
All positive integers between 1 (2^0) and 16777215 (2^24-1) are exactly representable in iEEE754 single precision (float). Or, to be precise, all numbers between 2^0 and 2^24-2^0 in increments of 2^0. The next range of exactly representable positive integers is 2^1 to 2^25-2^1 in increments of 2^1 and so on up to 2^39 to 2^63-2^39 in increments of 2^39.
Unsigned 8-byte integer values can be expressed up to 2^64-2^40 in increments of 2^40.
The single precison format doesn't stop here but goes on all the way up to the range 2^103 to 2^127-2^103 in increments of 2^103.
For 4-byte integers (long) the highest float range is 2^7 to 2^31-2^7 in 2^7 increments.
On the x86 architecture the largest integer type supported by the floating point instruction set is the 8 byte signed integer. 2^64-1 cannot be loaded by conventional means.
This means that for a given range increment expressed as "2^i where i is an integer >0" all integers that end with the bit pattern 0x1 up to 2^i-1 will not be exactly representable within that range in a float
This means that what you call rounding upwards is actually dependent on what range you are working in. It is of no use to try to round up by 1 (2^0) or 16 (2^4) if the granularity of the range you are in is 2^19.
An additional consequence of what you propose to do (rounding 2^63-1 to 2^63) could result in an (long integer format) overflow if you attempt the following conversion: longlong_int=(long long) ((float) 2^63).
Check out this small program I wrote (in C) which should help illustrate what is possible and what isn't.
int main (void)
{
__int64 basel=1,baseh=16777215,src,dst,j;
float cnvl,cnvh,range;
int i=0;
while (i<40)
{
src=basel<<i;
cnvl=(float) src;
dst=(__int64) cnvl; /* compare dst with basel */
src=baseh<<i;
cnvh=(float) src;
dst=(__int64) cnvh; /* compare dst with baseh */
j=basel;
while (j<=baseh)
{
range=(float) j;
dst=(__int64) range;
if (j!=dst) dst/=0;
j+=basel;
}
++i;
}
return i;
}
This program shows the representable integer value ranges. There is overlap beteen them: for example 2^5 is representable in all ranges with a lower boundary 2^b where 1=