How to quantize floating point to unsigned byte in GLSL - c++

I used floating point texture as data buffer in GLSL and need to save the data on a normal texture (each pixel's color has 1 byte). In my situation, floating point is [-2048.0, 2048.0] and so I have to quantize [-2048.0, 2048.0] to [0, 255]. I think the C++ code for this problem is like :
float fvalue = ... ; // floating point data
fvalue /= 16.0f; // [-128.0, 128.0]
fvalue = roundf(fvalue); // [-128, 128]
if(fvalue > 127.0f) fvalue = 127.0f;
else if(fvalue < -128.0f) fvalue = -128.0f;
u_char byte = (int)fvalue + 128; // [0, 255]
//*inverse quantization*
u_char byte = ...; // [0, 255]
float fvalue = byte - 128; // [-128, 127]
fvalue *= 16.0f; // [-2048, 2032] (it can't be helped?)
I'm not certain this code is good, but moreover I'm not really sure what is great in GLSL (GLSL handles byte value [0, 255] as floating point [0.0, 1.0]). My code is :
vec3 F = ...; //F is floating vector [-2048.0, 2048.0]
F /= 16; // [-128.0, 128.0]
F /= 256; // [-0.5, 0.5]
F += vec3(0.50f); // [0.0, 1.0]
gl_FragData[0] = vec4(F, 1.0);
//*inverse quantization*
vec3 F = texture2D(...); //byte data [0.0, 1.0]
F -= vec3(0.50f); //byte data [-0.5, 0.5]
F *= 256; //[-128, 128]
F *= 16; //[-2048, 2048]
This didn't work well. However, if I rewrite codes F += vec3(0.50f); to F += vec3(0.51f); and also F -= vec3(0.50f); to F -= vec3(0.51f);, It seems works well. But I don't think the value 0.51f is reasonable. In fact, this works well in one hardware, while this doesn't work well in another hardware.
I want to know the good way to quantize (also inv-quantize) float values.

I can find the way which works "well". I'm afraid to say I can't explain reasonably why it works and so I don't know whether this is versatile method.
vec3 F = ...; //F is floating vector [-2048.0, 2048.0]
F += 2048;
F /= 16;
F /= 255;
gl_FragData[0] = vec4(F, 1.0);
//*inverse quantization*
vec3 F = texture2D(...); //byte data [0.0, 1.0]
F *= 255.0;
F *= 16.0;
F -= vec3(2048 + 8); //adding bias -16.0/2.0
F = 2.0 * F * qp * Q / 16.0;

First of all, each pixel having 1 byte does not adequately convey what you are trying to describe. This so-called "normal texture" is more accurately referred to as "unsigned normalized" (often shortened to unorm).
You want an 8-bit unorm texture here (ideally with multiple components); these are textures that store fixed-point data and are treated like floating-point (in the range [0.0,1.0]) when sampled by normalizing the data to its intrinsic range (e.g. promoting to floating-point and dividing by 255.0).
Given what was just described, you simply need to transform the original data [-2048.0,2048.0] into [0.0,1.0] and then multiply by 255.
This is rather undesirable though, because you will lose the ability to represent the original range without severe aliasing. Instead, multiply by 4294967295 (2564-1) and pack 8-bits into R, 8-bits into G, 8-bits into B and 8-bits into A. You have made no attempt to pack the components in the shader shown.


Seeded Random Uniform float generator using SIMD? [duplicate]

I have a __m256 value that holds random bits.
I would like to to "interpret" it, to obtain another __m256 that holds float
values in a uniform [0.0f, 1.0f] range.
Planning to do it using:
__m256 randomBits = /* generated random bits, uniformly distribution */;
__m256 invFloatRange = _mm256_set1_ps( numeric_limits<float>::min() ); //min is a smallest increment of float precision
__m256 float01 = _mm256_mul(randomBits, invFloatRange);
//float01 is now ready to be used
Question 1:
However, will this cause a problem in very rare cases where randomBits has all bits as 1 and is therefore NAN?
What can I do to protect myself from this?
I want the float01 to always be a usable number
Question 2:
Will the [0 to 1] range remain uniform after I obtain it using the above approach? I know float has varying precision at different magnitudes
Reinterpreting an int32_t as float, one can
auto const one = _mm256_set1_epi32(0x7f800000);
a = _mm256_and_si256(a, _mm256_set1_epi32(0x007fffff));
a = _mm256_or_si256(a, one);
return _mm256_sub_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(one));
The and/or sequence will reuse the 23 LSBs of the input sequence to produce a uniform distribution of values between 1.0f <= a < 2.0f. And then the bias of 1.0f is removed.
As #Soonts has pointed out, floats can be created uniformly in [0, 1] range:
I ended up using the answer below:
//converts __m256i values into __m256 values, that contains floats in [0,1] range.
inline void int_rand_int_toFloat01( const __m256i* m256i_vals,
__m256* m256f_vals){ //<-- stores here.
const static __m256 c = _mm256_set1_ps(0x1.0p-24f); // or (1.0f / (uint32_t(1) << 24));
__m256i* rnd = ((__m256i*)m256i_vals);
__m256* output = ((__m256*)m256f_vals);
// remember that '_mm256_cvtepi32_ps' will convert 32-bit ints into a 32-bit floats
__m256 converted = _mm256_cvtepi32_ps(_mm256_srli_epi32(*rnd, 8));
*output = _mm256_mul_ps( converted, c);

Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

According to articles like this, half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?
Naive code:
uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;
The problem here is that first the number i is converted into float losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.
Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.
Update: the solution with a double is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?
You can do the calculation using double instead so you don't lose any precision on the uint32_t value, then assign the result to a float.
float f = (double)i / (1ui32<<31) - 1.0;
In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:
float i32_to_f32(int x)
int exp;
union _f32 // semi result
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0x00000000) return 0.0f;
if (x< -0x1FFFFFFF) return -1.0f;
if (x> +0x1FFFFFFF) return +1.0f;
// conversion
y.u=0; // reset bits
if (x<0){ y.u|=0x80000000; x=-x; } // sign (31 bits left)
exp=((x>>23)&63)-64; // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
y.u|=(exp+127)<<23; // exponent bias and bit position
y.u|=x&0x007FFFFF; // mantissa
return y.f;
int f32_to_i32(float x)
int exp,man,i;
union _f32 // semi result
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0.0f) return 0x00000000;
if (x<=-1.0f) return -0x1FFFFFFF;
if (x>=+1.0f) return +0x1FFFFFFF;
// conversion
exp=(y.u>>23)&255; exp-=127; // exponent bias and bit position
if (exp<-64) return 0.0f;
man=y.u&0x007FFFFF; // mantissa
i =(exp<<23)&0x1F800000;
i|= man;
if (y.u>=0x80000000) i=-i; // sign
return i;
I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).
But the distribution is not linear nor uniform at all:
in Red is the float in range <-1,+1> and Blue is integer in range <-1FFFFFFF,+1FFFFFFF>.
On the other hand there is no rounding at all in both conversions ...
PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).
The thing to realize is while (float)i does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i is smaller, it will lose less than 8-bits. But things will work out well).
So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX.
This expression works: (float)(int)(value^0x80000000)/0x80000000.
Here's how it works:
The (int)(value^0x80000000) part flips the sign bit, so 0x0 gets mapped to INT_MIN, and 0xffffffff gets mapped to INT_MAX.
Then there is conversion to float. This is where some rounding happens, and we lose precision (but it is not a problem).
Then just divide by 0x80000000 to get into the range [-1..1]. As this division just adjusts the exponent part, this division doesn't lose any precision.
So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float (this theoretical rounding has the same effect as the rounding at the 2. step)
But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0).
I suggest (if yout want to avoid division and use an accurately float-representable start value of 1.0*2^-32):
float e = i * ldexp(1.0,-32) - 1.0;
Any ideas how to achieve this without using double-precision floating-point?
Without assuming too much about the insides of float:
Shift u until the most significant bit is set, halving the float conversion value.
"keeping the uniform distribution"
50% of the uint32_t values will be in the [0.5 ... 1.0)
25% of the uint32_t values will be in the [0.25 ... 0.5)
12.5% of the uint32_t values will be in the [0.125 ... 0.25)
6.25% of the uint32_t values will be in the [0.0625 ... 0.125)
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
float ui32to0to1(uint32_t u) {
if (u) {
float band = 1.0f/(1llu<<32);
while ((u & 0x80000000) == 0) {
u <<= 1;
band *= 0.5f;
return (float)u * band;
return 0.0f;
Some test code to show functional equivalence to double.
int test(uint32_t u) {
volatile float f0 = (float) ((double)u / (1llu<<32));
volatile float f1 = ui32to0to1(u);
if (f0 != f1) {
printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
return 1;
return 0;
int main(void) {
for (int i=0; i<100000000; i++) {
test(rand()*65535u ^ rand());
return 0;
Various optimizations are possible, especially with assuming properties of float. Yet for an initial answer, I'll stick to a general approach.
For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG which is usually 24.
float ui32to0to1(uint32_t u) {
float band = 1.0f/(1llu<<32);
for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
u <<= 1;
band *= 0.5f;
return (float)u * band;
This answers maps [0 to 232-1] to [0.0 to 1.0)
To map to [0 to 232-1] to (-1.0 to 1.0). It can form -0.0.
if (u >= 0x80000000) {
return ui32to0to1((u - 0x80000000)*2);
} else
return -ui32to0to1((0x7FFFFFFF - u)*2);

sum of weights should be exactly 1.0 no matter on which platform it is

I have such a function that calculates weights according to Gaussian distribution:
const float dx = 1.0f / static_cast<float>(points - 1);
const float sigma = 1.0f / 3.0f;
const float norm = 1.0f / (sqrtf(2.0f * static_cast<float>(M_PI)) * sigma);
const float divsigma2 = 0.5f / (sigma * sigma);
m_weights[0] = 1.0f;
for (int i = 1; i < points; i++)
float x = static_cast<float>(i)* dx;
m_weights[i] = norm * expf(-x * x * divsigma2) * dx;
m_weights[0] -= 2.0f * m_weights[i];
In all the calc above the number does not matter. The only thing matters is that m_weights[0] = 1.0f; and each time I calculate m_weights[i] I subtract it twice from m_weights[0] like this:
m_weights[0] -= 2.0f * m_weights[i];
to ensure that w[0] + 2 * w[i] (1..N) will sum to exactly 1.0f. But it does not. This assert fails:
float wSum = 0.0f;
for (size_t i = 0; i < m_weights.size(); ++i)
float w = m_weights[i];
if (i == 0) {
wSum += w;
} else {
wSum += (w + w);
assert(wSum == 1.0 && "Weights sum is not 1.");
How can I ensure the sum to be 1.0f on all platforms?
You can't. Floating point isn't like that. Even adding the same values can produce different results according to the cpu used.
All you can do is define some accuracy value and ensure that you end up with 1.0 +/- that value.
Because the precision of float is only 23 bits (see e.g. ), rounding error quickly accumulates therefore even if the rest of code is correct, your sum becomes something like 1.0000001 or 0.9999999 (have you watched it in the debugger or tried to print it to console, by the way?). To improve precision you can replace float with double, but still the sum will not be exactly 1.0: the error will just be smaller, something like 1e-16 instead of 1e-7.
The second thing to do is to replace strict comparison to 1.0 with a range comparison, like:
assert(fabs(wSum - 1.0) <= 1e-13 && "Weights sum is not 1.");
Here 1e-13 is the epsilon within which you consider two floating-point numbers equal. If you choose to go with float (not double), you may need epsilon like 1e-6 .
Depending on how large your weights are and how many points there are, accumulated error can become larger than that epsilon. In that case you would need special algorithms for keeping the precision higher, such as sorting the numbers by their absolute values prior to summing them up starting with the smallest numbers.
How can I ensure the sum to be 1.0f on all platforms?
As the other answers (and comments) have stated, you can't achieve this, due to the inexactness of floating point calculations.
One solution is that, instead of using double, use a fixed point or multi-precision library such as GMP, Boost Multiprecision Library, or one of the many others out there.

Code for normal distribution returns unexpected values [duplicate]

From this question: Random number generator which gravitates numbers to any given number in range? I did some research since I've come across such a random number generator before. All I remember was the name "Mueller", so I guess I found it, here:
Box-Mueller transform
I can find numerous implementations of it in other languages, but I can't seem to implement it correctly in C#.
This page, for instance, The Box-Muller Method for Generating Gaussian Random Numbers says that the code should look like this (this is not C#):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
double gaussian(void)
static double v, fac;
static int phase = 0;
double S, Z, U1, U2, u;
if (phase)
Z = v * fac;
U1 = (double)rand() / RAND_MAX;
U2 = (double)rand() / RAND_MAX;
u = 2. * U1 - 1.;
v = 2. * U2 - 1.;
S = u * u + v * v;
} while (S >= 1);
fac = sqrt (-2. * log(S) / S);
Z = u * fac;
phase = 1 - phase;
return Z;
Now, here's my implementation of the above in C#. Note that the transform produces 2 numbers, hence the trick with the "phase" above. I simply discard the second value and return the first.
public static double NextGaussianDouble(this Random r)
double u, v, S;
u = 2.0 * r.NextDouble() - 1.0;
v = 2.0 * r.NextDouble() - 1.0;
S = u * u + v * v;
while (S >= 1.0);
double fac = Math.Sqrt(-2.0 * Math.Log(S) / S);
return u * fac;
My question is with the following specific scenario, where my code doesn't return a value in the range of 0-1, and I can't understand how the original code can either.
u = 0.5, v = 0.1
S becomes 0.5*0.5 + 0.1*0.1 = 0.26
fac becomes ~3.22
the return value is thus ~0.5 * 3.22 or ~1.6
That's not within 0 .. 1.
What am I doing wrong/not understanding?
If I modify my code so that instead of multiplying fac with u, I multiply by S, I get a value that ranges from 0 to 1, but it has the wrong distribution (seems to have a maximum distribution around 0.7-0.8 and then tapers off in both directions.)
Your code is fine. Your mistake is thinking that it should return values exclusively within [0, 1]. The (standard) normal distribution is a distribution with nonzero weight on the entire real line. That is, values outside of [0, 1] are possible. In fact, values within [-1, 0] are just as likely as values within [0, 1], and moreover, the complement of [0, 1] has about 66% of the weight of the normal distribution. Therefore, 66% of the time we expect a value outside of [0, 1].
Also, I think this is not the Box-Mueller transform, but is actually the Marsaglia polar method.
I am no mathematician, or statistician, but if I think about this I would not expect a Gaussian distribution to return numbers in an exact range. Given your implementation the mean is 0 and the standard deviation is 1 so I would expect values distributed on the bell curve with 0 at the center and then reducing as the numbers deviate from 0 on either side. So the sequence would definitely cover both +/- numbers.
Then since it is statistical, why would it be hard limited to -1..1 just because the is 1? There can statistically be some play on either side and still fulfill the statistical requirement.
The uniform random variate is indeed within 0..1, but the gaussian random variate (which is what Box-Muller algorithm generates) can be anywhere on the real line. See wiki/NormalDistribution for details.
I think the function returns polar coordinates. So you need both values to get correct results.
Also, Gaussian distribution is not between 0 .. 1. It can easily end up as 1000, but probability of such occurrence is extremely low.
This is a monte carlo method so you can't clamp the result, but what you can do is ignore samples.
// return random value in the range [0,1].
double gaussian_random()
double sigma = 1.0/8.0; // or whatever works.
while ( 1 ) {
double z = gaussian() * sigma + 0.5;
if (z >= 0.0 && z <= 1.0)
return z;