I am writing a unit test for a math function and I would like to be able to "walk" all possible floats/doubles.
Due to IEEE shenanigans, floating types cannot be incremented (++) at their extremities. See this question for more details. That answer states :
one can only add multiples of 2^(n-N)
But never mentions what little n is.
A solution to iterate all possible values from +0.0 to +infinity is given in this great blog post. The technique involves using a union with an int to walk the different values of a float. This works due to the following properties explained in the post, though they are only valid for positive numbers.
Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero
His solution for +0.0 to +infinity (0.f to std::numeric_limits<float>::max()) :
union Float_t {
int32_t RawExponent() const { return (i >> 23) & 0xFF; }
int32_t i;
float f;
};
Float_t allFloats;
allFloats.f = 0.0f;
while (allFloats.RawExponent() < 255) {
allFloats.i += 1;
}
Is there a solution for -infinity to +0.0 (std::numeric_limits<float>::lowest() to 0.f)?
I've tested std::nextafter and std::nexttoward and couldn't get them to work. Maybe this is an MSVC issue?
I would be ok with any sort of hack since this is a unit test. Thanks!
You can walk all 32-bit bit representations by using all values of a 32-bit unsigned int. Then you will walk really all representations, positive and negative, including both nulls (there are two) and also all the not a number representations (NaN). You may or may not want to filter out the NaN representations, or just filter out the signaling ones and leave the non signaling ones in. This depends on your use case.
Example:
for (uint32_t i = 0;;)
{
float f;
// Type punning: Force the bit representation of i into f.
// Type punning is hard because mostly undefined in C/C++.
// Using memcpy() usually avoids any type punning warning.
memcpy(&f, &i, sizeof(f));
// Use f here.
// Warning: Using signaling NaNs may throw exceptions or raise signals.
i++;
if (i == 0)
break;
}
Instead you can also walk a 32-bit int from -2**31 to +(2**31-1). This makes no difference.
Pascal Cuoq correctly points out std::nextafter is the right solution. I had a problem elsewhere in my code. Sorry for the unnecessary question.
#include <cassert>
#include <cmath>
#include <limits>
float i = std::numeric_limits<float>::lowest();
float hi = std::numeric_limits<float>::max();
float new_i = std::nextafterf(i, hi);
assert(i != new_i);
double d = std::numeric_limits<double>::lowest();
double hi_d = std::numeric_limits<double>::max();
double new_d = std::nextafter(d, hi_d);
assert(d != new_d);
long double ld = std::numeric_limits<long double>::lowest();
long double hi_ld = std::numeric_limits<long double>::max();
long double new_ld = std::nextafterl(ld, hi_ld);
assert(ld != new_ld);
for (float d = std::numeric_limits<float>::lowest();
d < std::numeric_limits<float>::max();
d = std::nextafterf(
d, std::numeric_limits<float>::max())) {
// Wait a lifetime?
}
Iterating through all the float values can be done with simple understanding of the floating-point representation:
The distance between consecutive subnormal values is the minimum normal times the “epsilon”. Simply iterate through all the subnormals using this distance as an increment.
The distance between the normal values at the lowest exponent is the same. Step through them with the same increment.
For each exponent, the distance increases according to the floating-point radix. Simply multiply the increment by the radix and step through all the values for the next exponent.
Repeat until infinity is reached.
Observe that the inner loop in the code below is simply:
for (; x < Limit; x += Increment)
Test(x);
This has the advantage that only normal floating-point arithmetic is used. The inner loop contains only one addition and one comparison (plus any tests you want to perform with each number). No library functions are called in the loop, no representations are dissected or copied to general registers or otherwise manipulated. There is nothing to impede performance.
This code steps through only the non-negative numbers. The negative numbers can be tested separately in the same way or can share this code by inserting a call Test(-x).
#include <limits>
static void Test(float x)
{
// Insert unit test for value x here.
}
int main(void)
{
typedef float T;
static const int Radix = std::numeric_limits<T>::radix;
static const T Infinity = std::numeric_limits<T>::infinity();
/* Increment is the current distance between floating-point numbers. We
start it at distance between subnormal numbers.
*/
T Increment =
std::numeric_limits<T>::min() * std::numeric_limits<T>::epsilon();
/* Limit is the next boundary where the distance between floating-point
numbers changes. We will increment up to that limit and then adjust
the limit and increment. We start it at the top of the first set of
normals, which allows the first loop to increment first through the
subnormals and then through the normals with the lowest exponent.
(These two sets have the same step size between adjacent values.)
*/
T Limit = std::numeric_limits<T>::min() * Radix;
/* Start with zero and continue until we reach infinity.
We execute an inner loop that iterates through all the significands of
one floating-point exponent. Each time it completes, we step up the
limit and increment.
*/
for (T x = 0; x < Infinity; Limit *= Radix, Increment *= Radix)
// Increment x through all the significands with the current exponent.
for (; x < Limit; x += Increment)
// Test with the current value of x.
Test(x);
// Also test infinity.
Test(Infinity);
}
(This code assumes the floating-point type has subnormals, and that they are not flushed to zero. The code can be readily adjusted to support these alternatives as well.)
Related
I have two integers n and d. These can be exactly represented by double dn(n) and double dd(d). Is there a reliable way in C++ to check if
double result = dn/dd
contains a rounding error? If it was just an integer-division checking if (n/d) * d==n would work but doing that with double precision arithmetic could hide rounding errors.
Edit: Shortly after posting this it struck me that changing the rounding mode to round_down would make the (n/d)*d==n test work for double. But if there is a simpler solution, I'd still like to hear it.
If a hardware FMA is available, then, in most cases (cases where n is expected not to be small, per below), the fastest test may be:
#include <cmath>
…
double q = dn/dd;
if (std::fma(-q, dd, dn))
std::cout << "Quotient was not exact.\n";
This can fail if nd−q•dd is so small it is rounded to zero, which occurs in round-to-nearest-ties-to-even mode if its magnitude is smaller than half the smallest representable positive value (commonly 2−1074). That can happen only if dn itself is small. I expect I could calculate some bound on dn for that if desired, and, given that dn = n and n is an integer, that should not occur.
Ignoring the exponent bounds, a way to test the significands for divisibility is:
#include <cfloat>
#include <cmath>
…
int sink; // Needed for frexp argument but will be ignored.
double fn = std::ldexp(std::frexp(n, &sink), DBL_MANT_DIG);
double fd = std::frexp(d, &sink);
if (std::fmod(fn, fd))
std::cout << "Quotient will not be exact.\n";
Given that n and d are integers that are exactly representable in the floating-point type, I think we could show their exponents cannot be such that the above test would fail. There are cases where n is a small integer and d is large (a value from 21023 to 21024−2972, inclusive) that I need to think about.
If you ignore overflow and underflow (which you should be able to do unless the integer types representing d and n are very wide), then the (binary) floating-point division dn/dd is exact iff d is a divisor of n times a power of two.
An algorithm to check for this may look like:
assert(d != 0);
while (d & 1 == 0) d >>= 1; // extract largest odd divisor of d
int exact = n % d == 0;
This is cheaper than changing the FPU rounding mode if you want the rounding mode to be “to nearest” the rest of the time, and there probably exist bit-twiddling tricks that can speed up the extraction of the largest odd divisor of d.
Is there a reliable way in C++ to check if double result = dn/dd contains a rounding error?
Should your system allow access to the various FP flags, test for FE_INEXACT after the division.
If FP code is expensive, than at least this code can be used to check integer only solutions.
A C solution follow, (I do not have access to a compliant C++ compiler to test right now)
#include <fenv.h>
// Return 0: no rounding error
// Return 1: rounding error
// Return -1: uncertain
#pragma STDC FENV_ACCESS ON
int Rounding_error_detection(int n, int d) {
double dn = n;
double dd = d;
if (feclearexcept(FE_INEXACT)) return -1;
volatile double result = dn/dd;
(void) result;
int set_excepts = fetestexcept(FE_INEXACT);
return set_excepts != 0;
}
Test code
void Rounding_error_detection_Test(int n, int d) {
printf("Rounding_error_detection(%d, %d) --> %d\n",
n, d, Rounding_error_detection(n,d));
}
int main(void) {
Rounding_error_detection_Test(3, 6);
Rounding_error_detection_Test(3, 7);
}
Output
Rounding_error_detection(3, 6) --> 0
Rounding_error_detection(3, 7) --> 1
If the quotient q=dn/dd is exact, it will divide dn exactly dd times.
Since you have dd being integer, you could test exactness with integer division.
Instead of testing the quotient multiplied by dd with (dn/dd)*dd==dn where round off errors can compensate, you should rather test the remainder.
Indeed std:remainder is always exact:
if(std:remainder(dn,dn/dd)!=0)
std::cout << "Quotient was not exact." << std::endl;
I want to determine (in c++) if one float number is the multiplicative inverse of another float number. The problem is that i have to use a third variable to do it. For instance this code:
float x=5,y=0.2;
if(x==(1/y)) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;
will output: "they are not..." which is wrong and this code:
float x=5,y=0.2,z;
z=1/y;
if(x==z) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;
will output: "they are..." which is right.why is this happening?
The Float Precision Problem
You have two problems here, but both come from the same root
You can't compare floats precisely. You can't subtract or divide them precisely. You can't count anything for them precisely. Any operation with them could (and almost always does) bring some error into the result. Even a=0.2f is not a precise operation. The deeper reasons of that are very well explained by the authors of the other answers here. (My thanks and votes to them for that.)
Here comes your first and more simple error. You should never, never, never, never, NEVER use on them == or its equivalent in any language.
Instead of a==b, use Abs(a-b)<HighestPossibleError instead.
But this is not the sole problem in your task.
Abs(1/y-x)<HighestPossibleError won't work, either. At least, it won't work often enough. Why?
Let's take pair x=1000 and y=0.001. Let's take the "starting" relative error of y for 10-6.
(Relative error = error/value).
Relative errors of values are adding to at multiplication and division.
1/y is about 1000. Its relative error is the same 10-6. ("1" hasn't errors)
That makes absolute error =1000*10-6=0.001. When you subtract x later, that error will be all that remains. (Absolute errors are adding to at adding and subtracting, and the error of x is negligibly small.) Surely, you are not counting on so large errors, HighestPossibleError would be surely set lower and your program would throw off a good pair of x,y
So, the next two rule for float operations: try not to divide greater valuer by lesser one and God save you from subtracting the close values after that.
There are two simple ways to escape this problem.
By founding what of x,y has the greater abs value and divide 1 by the greater one and only later to subtract the lesser one.
If you want to compare 1/y against x, while you are working yet with letters, not values, and your operations make no errors, multiply the both sides of comparison by y
and you have 1 against x*y. (Usually you should check signs in that operation, but here we use abs values, so, it is clean.) The result comparison has no division at all.
In a shorter way:
1/y V x <=> y*(1/y) V x*y <=> 1 V x*y
We already know that such comparison as 1 against x*y should be done so:
const float HighestPossibleError=1e-10;
if(Abs(x*y-1.0)<HighestPossibleError){...
That is all.
P.S. If you really need it all on one line, use:
if(Abs(x*y-1.0)<1e-10){...
But it is bad style. I wouldn't advise it.
P.P.S. In your second example the compiler optimizes the code so, that it sets z to 5 before running any code. So, checking 5 against 5 works even for floats.
The problem is that 0.2 cannot be represented exactly in binary, because its binary expansion has an infinite number of digits:
1/5: 0.0011001100110011001100110011001100110011...
This is similar to how 1/3 cannot be represented exactly in decimal. Since x is stored in a float which has a finite number of bits, these digits will get cut off at some point, for example:
x: 0.0011001100110011001100110011001
The problem arises because CPUs often use a higher precision internally, so when you've just calculated 1/y, the result will have more digits, and when you load x to compare them, x will get extended to match the internal precision of the CPU.
1/y: 0.0011001100110011001100110011001100110011001100110011
x: 0.0011001100110011001100110011001000000000000000000000
So when you do a direct bit-by-bit comparison, they are different.
In your second example, however, storing the result into a variable means it gets truncated before doing the comparison, so comparing them at this precision, they're equal:
x: 0.0011001100110011001100110011001
z: 0.0011001100110011001100110011001
Many compilers have switches you can enable to force intermediate values to be truncated at every step for consistency, however the usual advice is to avoid doing direct comparisons between floating-point values and instead check if they differ by less than some epsilon value, which is what Gangnus is suggesting.
You will have to precisely define what it means for two approximations to be multiplicative inverses. Otherwise, you won't know what it is you're supposed to be testing.
0.2 has no exact binary representation. If you store numbers that have no exact representation with limited precision, you won't get answers that are exactly correct.
The same things happens in decimal. For example, 1/3 has no exact decimal representation. You can store it as .333333. But then you have a problem. Are 3 and .333333 multiplicative inverses? If you multiply them, you get .999999. If you want the answer to be "yes" you'll have to create a test for multiplicative inverses that isn't as simple as multiplying and testing for equality to 1.
The same thing happens with binary.
The discussions in other replies are great and so I won't repeat any of them, but there's no code. Here's a little bit of code to actually check if a pair of floats gives exactly 1.0 when multiplied.
The code makes a few assumptions/assertions (which are normally met on the x86 platform):
- float's are 32-bit binary (AKA single precision) IEEE-754
- either int's or long's are 32-bit (I decided not to rely on the availability of uint32_t)
- memcpy() copies floats to ints/longs such that 8873283.0f becomes 0x4B076543 (i.e. certain "endianness" is expected)
One extra assumption is this:
- it receives the actual floats that * would multiply (i.e. multiplication of floats wouldn't use higher precision values that the math hardware/library can use internally)
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>
#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned int uint32;
#else
typedef unsigned long uint32;
#endif
typedef unsigned long long uint64;
C_ASSERT(CHAR_BIT == 8);
C_ASSERT(sizeof(uint32) == 4);
C_ASSERT(sizeof(float) == 4);
int ProductIsOne(float f1, float f2)
{
uint32 m1, m2;
int e1, e2, s1, s2;
int e;
uint64 m;
// Make sure floats are 32-bit IEE754 and
// reinterpreted as integers as we expect
{
static const float testf = 8873283.0f;
uint32 testi;
memcpy(&testi, &testf, sizeof(testf));
assert(testi == 0x4B076543);
}
memcpy(&m1, &f1, sizeof(f1));
s1 = m1 >= 0x80000000;
m1 &= 0x7FFFFFFF;
e1 = m1 >> 23;
m1 &= 0x7FFFFF;
if (e1 > 0) m1 |= 0x800000;
memcpy(&m2, &f2, sizeof(f2));
s2 = m2 >= 0x80000000;
m2 &= 0x7FFFFFFF;
e2 = m2 >> 23;
m2 &= 0x7FFFFF;
if (e2 > 0) m2 |= 0x800000;
if (e1 == 0xFF || e2 == 0xFF || s1 != s2) // Inf, NaN, different signs
return 0;
m = (uint64)m1 * m2;
if (!m || (m & (m - 1))) // not a power of 2
return 0;
e = e1 + !e1 - 0x7F - 23 + e2 + !e2 - 0x7F - 23;
while (m > 1) m >>= 1, e++;
return e == 0;
}
const float testData[][2] =
{
{ .1f, 10.0f },
{ 0.5f, 2.0f },
{ 0.25f, 2.0f },
{ 4.0f, 0.25f },
{ 0.33333333f, 3.0f },
{ 0.00000762939453125f, 131072.0f }, // 2^-17 * 2^17
{ 1.26765060022822940E30f, 7.88860905221011805E-31f }, // 2^100 * 2^-100
{ 5.87747175411143754E-39f, 1.70141183460469232E38f }, // 2^-127 (denormalized) * 2^127
};
int main(void)
{
int i;
for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
printf("%g * %g %c= 1\n",
testData[i][0], testData[i][1],
"!="[ProductIsOne(testData[i][0], testData[i][1])]);
return 0;
}
Output (see at ideone.com):
0.1 * 10 != 1
0.5 * 2 == 1
0.25 * 2 != 1
4 * 0.25 == 1
0.333333 * 3 != 1
7.62939e-06 * 131072 == 1
1.26765e+30 * 7.88861e-31 == 1
5.87747e-39 * 1.70141e+38 == 1
What is striking is that whatever the rounding rule is, you expect the outcome of the two versions to be the same (either twice wrong or twice right)!
Most probably, in the first case a promotion to higher accuracy in the FPU registers takes place when evaluating x==1/y, whereas z= 1/y really stores the single-precision result.
Other contributors have explaine why 5==1/0.2 can fail, I needn't repeat that.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Most effective way for float and double comparison
I have two values(floats) I am attempting to add together and average. The issue I have is that occasionally these values would add up to zero, thus not requiring them to be averaged.
The situation I am in specifically contains the values "-1" and "1", yet when added together I am given the value "-1.19209e-007" which is clearly not 0. Any information on this?
I'm sorry but this doesn't make sense to me.
Two floating point values, if they are exactly the same but with opposite sign, subtracted will produce always 0. This is how floating point operations works.
float a = 0.2f;
float b = -0.2f;
float f = (a - b) / 2;
printf("%f %d\n", f, f != 0); // will print out 0.0000 0
Will be always 0 also if the compiler doesn't optimize the code.
There is not any kind of rounding error to take in account if a and b have the same value but opposite sign! That is, if the higher bit of a is 0 and the higher bit of b is 1 and all other bits are the same, the result cannot be other than 0.
But if a and b are slightly different, of course, the result can be non-zero.
One possible solution to avoid this can be using a tolerance...
float f = (a + b) / 2;
if (abs(f) < 0.000001f)
f = 0;
We are using a simple tolerance to see if our value is near to zero.
A nice example code to show this is...
int main(int argc)
{
for (int i = -10000000; i <= 10000000 * argc; ++i)
{
if (i != 0)
{
float a = 3.14159265f / i;
float b = -a + (argc - 1);
float f = (a + b) / 2;
if (f != 0)
printf("%f %d\n", a, f);
}
}
printf("completed\n");
return 0;
}
I'm using "argc" here as a trick to force the compiler to not optimize out our code.
At least right off, this sounds like typical floating point imprecision.
The usual way to deal with it is to round your numbers to the correct number of significant digits. In this case, your average would be -1.19209e-08 (i.e., 0.00000001192). To (say) six or seven significant digits, that is zero.
Takes the sum of all your numbers, divide by your count. Round off your answer to something reasonable before you do prints, reports comparisons, or whatever you're doing.
again, do some searching on this but here is the basic explanation ...
the computer approximates floating point numbers by base 2 instead of base 10. this means that , for example, 0.2 (when converted to binary) is actually 0.001100110011 ... on forever. since the computer cannot add these on forever, it must approximate it.
because of these approximations, we lose "precision" of calculations. hence "single" and "double" precision floating point numbers. this is why you never test for a float to be actually 0. instead, you test whether is below some threshhold which you want to use as zero.
Given a random source (a generator of random bit stream), how do I generate a uniformly distributed random floating-point value in a given range?
Assume that my random source looks something like:
unsigned int GetRandomBits(char* pBuf, int nLen);
And I want to implement
double GetRandomVal(double fMin, double fMax);
Notes:
I don't want the result precision to be limited (for example only 5 digits).
Strict uniform distribution is a must
I'm not asking for a reference to an existing library. I want to know how to implement it from scratch.
For pseudo-code / code, C++ would be most appreciated
I don't think I'll ever be convinced that you actually need this, but it was fun to write.
#include <stdint.h>
#include <cmath>
#include <cstdio>
FILE* devurandom;
bool geometric(int x) {
// returns true with probability min(2^-x, 1)
if (x <= 0) return true;
while (1) {
uint8_t r;
fread(&r, sizeof r, 1, devurandom);
if (x < 8) {
return (r & ((1 << x) - 1)) == 0;
} else if (r != 0) {
return false;
}
x -= 8;
}
}
double uniform(double a, double b) {
// requires IEEE doubles and 0.0 < a < b < inf and a normal
// implicitly computes a uniform random real y in [a, b)
// and returns the greatest double x such that x <= y
union {
double f;
uint64_t u;
} convert;
convert.f = a;
uint64_t a_bits = convert.u;
convert.f = b;
uint64_t b_bits = convert.u;
uint64_t mask = b_bits - a_bits;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
int b_exp;
frexp(b, &b_exp);
while (1) {
// sample uniform x_bits in [a_bits, b_bits)
uint64_t x_bits;
fread(&x_bits, sizeof x_bits, 1, devurandom);
x_bits &= mask;
x_bits += a_bits;
if (x_bits >= b_bits) continue;
double x;
convert.u = x_bits;
x = convert.f;
// accept x with probability proportional to 2^x_exp
int x_exp;
frexp(x, &x_exp);
if (geometric(b_exp - x_exp)) return x;
}
}
int main() {
devurandom = fopen("/dev/urandom", "r");
for (int i = 0; i < 100000; ++i) {
printf("%.17g\n", uniform(1.0 - 1e-15, 1.0 + 1e-15));
}
}
Here is one way of doing it.
The IEEE Std 754 double format is as follows:
[s][ e ][ f ]
where s is the sign bit (1 bit), e is the biased exponent (11 bits) and f is the fraction (52 bits).
Beware that the layout in memory will be different on little-endian machines.
For 0 < e < 2047, the number represented is
(-1)**(s) * 2**(e – 1023) * (1.f)
By setting s to 0, e to 1023 and f to 52 random bits from your bit stream, you get a random double in the interval [1.0, 2.0). This interval is unique in that it contains 2 ** 52 doubles, and these doubles are equidistant. If you then subtract 1.0 from the constructed double, you get a random double in the interval [0.0, 1.0). Moreover, the property about being equidistant is preserve.
From there you should be able to scale and translate as needed.
I'm surprised that for question this old, nobody had actual code for the best answer. User515430's answer got it right--you can take advantage of IEEE-754 double format to directly put 52 bits into a double with no math at all. But he didn't give code. So here it is, from my public domain ojrandlib:
double ojr_next_double(ojr_generator *g) {
uint64_t r = (OJR_NEXT64(g) & 0xFFFFFFFFFFFFFull) | 0x3FF0000000000000ull;
return *(double *)(&r) - 1.0;
}
NEXT64() gets a 64-bit random number. If you have a more efficient way of getting only 52 bits, use that instead.
This is easy, as long as you have an integer type with as many bits of precision as a double. For instance, an IEEE double-precision number has 53 bits of precision, so a 64-bit integer type is enough:
#include <limits.h>
double GetRandomVal(double fMin, double fMax) {
unsigned long long n ;
GetRandomBits ((char*)&n, sizeof(n)) ;
return fMin + (n * (fMax - fMin))/ULLONG_MAX ;
}
This is probably not the answer you want, but the specification here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3225.pdf
in sections [rand.util.canonical] and [rand.dist.uni.real], contains sufficient information to implement what you want, though with slightly different syntax. It isn't easy, but it is possible. I speak from personal experience. A year ago I knew nothing about random numbers, and I was able to do it. Though it took me a while... :-)
The question is ill-posed. What does uniform distribution over floats even mean?
Taking our cue from discrepancy, one way to operationalize your question is to define that you want the distribution that minimizes the following value:
Where x is the random variable you are sampling with your GetRandomVal(double fMin, double fMax) function, and means the probability that a random x is smaller or equal to t.
And now you can go on and try to evaluate eg a dabbler's answer. (Hint all the answers that fail to use the whole precision and stick to eg 52 bits will fail this minimization criterion.)
However, if you just want to be able to generate all float bit patterns that fall into your specified range with equal possibility, even if that means that eg asking for GetRandomVal(0,1000) will create more values between 0 and 1.5 than between 1.5 and 1000, that's easy: any interval of IEEE floating point numbers when interpreted as bit patterns map easily to a very small number of intervals of unsigned int64. See eg this question. Generating equally distributed random values of unsigned int64 in any given interval is easy.
I may be misunderstanding the question, but what stops you simply sampling the next n bits from the random bit stream and converting that to a base 10 number number ranged 0 to 2^n - 1.
To get a random value in [0..1[ you could do something like:
double value = 0;
for (int i=0;i<53;i++)
value = 0.5 * (value + random_bit()); // Insert 1 random bit
// or value = ldexp(value+random_bit(),-1);
// or group several bits into one single ldexp
return value;
I have a double value f and would like a way to nudge it very slightly larger (or smaller) to get a new value that will be as close as possible to the original but still strictly greater than (or less than) the original.
It doesn't have to be close down to the last bit—it's more important that whatever change I make is guaranteed to produce a different value and not round back to the original.
Check your math.h file. If you're lucky you have the nextafter and nextafterf functions defined. They do exactly what you want in a portable and platform independent way and are part of the C99 standard.
Another way to do it (could be a fallback solution) is to decompose your float into the mantissa and exponent part. Incrementing is easy: Just add one to the mantissa. If you get an overflow you have to handle this by incrementing your exponent. Decrementing works the same way.
EDIT: As pointed out in the comments it is sufficient to just increment the float in it's binary representation. The mantissa-overflow will increment the exponent, and that's exactly what we want.
That's in a nutshell the same thing that nextafter does.
This won't be completely portable though. You would have to deal with endianess and the fact that not all machines do have IEEE floats (ok - the last reason is more academic).
Also handling NAN's and infinites can be a bit tricky. You cannot simply increment them as they are by definition not numbers.
u64 &x = *(u64*)(&f);
x++;
Yes, seriously.
Edit: As someone pointed out, this does not deal with -ve numbers, Inf, Nan or overflow properly. A safer version of the above is
u64 &x = *(u64*)(&f);
if( ((x>>52) & 2047) != 2047 ) //if exponent is all 1's then f is a nan or inf.
{
x += f>0 ? 1 : -1;
}
In absolute terms, the smallest amount you can add to a floating point value to make a new distinct value will depend on the current magnitude of the value; it will be the type's machine epsilon multiplied by the current exponent.
Check out the IEEE spec for floating point represenation. The simplest way would be to reinterpret the value as an integer type, add 1, then check (if you care) that you haven't flipped the sign or generated a NaN by examining the sign and exponent bits.
Alternatively, you could use frexp to obtain the current mantissa and exponent, and hence calculate a value to add.
I needed to do the exact same thing and came up with this code:
double DoubleIncrement(double value)
{
int exponent;
double mantissa = frexp(value, &exponent);
if(mantissa == 0)
return DBL_MIN;
mantissa += DBL_EPSILON/2.0f;
value = ldexp(mantissa, exponent);
return value;
}
For what it's worth, the value for which standard ++ incrementing ceases to function is 9,007,199,254,740,992.
This may not be exactly what you want, but you still might find numeric_limits in of use. Particularly the members min(), and epsilon().
I don't believe that something like mydouble + numeric_limits::epsilon() will do what you want, unless mydouble is already close to epsilon. If it is, then you're in luck.
I found this code a while back, maybe it will help you determine the smallest you can push it up by then just increment it by that value. Unfortunately i can't remember the reference for this code:
#include <stdio.h>
int main()
{
/* two numbers to work with */
double number1, number2; // result of calculation
double result;
int counter; // loop counter and accuracy check
number1 = 1.0;
number2 = 1.0;
counter = 0;
while (number1 + number2 != number1) {
++counter;
number2 = number2 / 10;
}
printf("%2d digits accuracy in calculations\n", counter);
number2 = 1.0;
counter = 0;
while (1) {
result = number1 + number2;
if (result == number1)
break;
++counter;
number2 = number2 / 10.0;
}
printf("%2d digits accuracy in storage\n", counter );
return (0);
}