How to check dependencies of floats - c++

I want to determine (in c++) if one float number is the multiplicative inverse of another float number. The problem is that i have to use a third variable to do it. For instance this code:
float x=5,y=0.2;
if(x==(1/y)) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;
will output: "they are not..." which is wrong and this code:
float x=5,y=0.2,z;
z=1/y;
if(x==z) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;
will output: "they are..." which is right.why is this happening?

The Float Precision Problem
You have two problems here, but both come from the same root
You can't compare floats precisely. You can't subtract or divide them precisely. You can't count anything for them precisely. Any operation with them could (and almost always does) bring some error into the result. Even a=0.2f is not a precise operation. The deeper reasons of that are very well explained by the authors of the other answers here. (My thanks and votes to them for that.)
Here comes your first and more simple error. You should never, never, never, never, NEVER use on them == or its equivalent in any language.
Instead of a==b, use Abs(a-b)<HighestPossibleError instead.
But this is not the sole problem in your task.
Abs(1/y-x)<HighestPossibleError won't work, either. At least, it won't work often enough. Why?
Let's take pair x=1000 and y=0.001. Let's take the "starting" relative error of y for 10-6.
(Relative error = error/value).
Relative errors of values are adding to at multiplication and division.
1/y is about 1000. Its relative error is the same 10-6. ("1" hasn't errors)
That makes absolute error =1000*10-6=0.001. When you subtract x later, that error will be all that remains. (Absolute errors are adding to at adding and subtracting, and the error of x is negligibly small.) Surely, you are not counting on so large errors, HighestPossibleError would be surely set lower and your program would throw off a good pair of x,y
So, the next two rule for float operations: try not to divide greater valuer by lesser one and God save you from subtracting the close values after that.
There are two simple ways to escape this problem.
By founding what of x,y has the greater abs value and divide 1 by the greater one and only later to subtract the lesser one.
If you want to compare 1/y against x, while you are working yet with letters, not values, and your operations make no errors, multiply the both sides of comparison by y
and you have 1 against x*y. (Usually you should check signs in that operation, but here we use abs values, so, it is clean.) The result comparison has no division at all.
In a shorter way:
1/y V x <=> y*(1/y) V x*y <=> 1 V x*y
We already know that such comparison as 1 against x*y should be done so:
const float HighestPossibleError=1e-10;
if(Abs(x*y-1.0)<HighestPossibleError){...
That is all.
P.S. If you really need it all on one line, use:
if(Abs(x*y-1.0)<1e-10){...
But it is bad style. I wouldn't advise it.
P.P.S. In your second example the compiler optimizes the code so, that it sets z to 5 before running any code. So, checking 5 against 5 works even for floats.

The problem is that 0.2 cannot be represented exactly in binary, because its binary expansion has an infinite number of digits:
1/5: 0.0011001100110011001100110011001100110011...
This is similar to how 1/3 cannot be represented exactly in decimal. Since x is stored in a float which has a finite number of bits, these digits will get cut off at some point, for example:
x: 0.0011001100110011001100110011001
The problem arises because CPUs often use a higher precision internally, so when you've just calculated 1/y, the result will have more digits, and when you load x to compare them, x will get extended to match the internal precision of the CPU.
1/y: 0.0011001100110011001100110011001100110011001100110011
x: 0.0011001100110011001100110011001000000000000000000000
So when you do a direct bit-by-bit comparison, they are different.
In your second example, however, storing the result into a variable means it gets truncated before doing the comparison, so comparing them at this precision, they're equal:
x: 0.0011001100110011001100110011001
z: 0.0011001100110011001100110011001
Many compilers have switches you can enable to force intermediate values to be truncated at every step for consistency, however the usual advice is to avoid doing direct comparisons between floating-point values and instead check if they differ by less than some epsilon value, which is what Gangnus is suggesting.

You will have to precisely define what it means for two approximations to be multiplicative inverses. Otherwise, you won't know what it is you're supposed to be testing.
0.2 has no exact binary representation. If you store numbers that have no exact representation with limited precision, you won't get answers that are exactly correct.
The same things happens in decimal. For example, 1/3 has no exact decimal representation. You can store it as .333333. But then you have a problem. Are 3 and .333333 multiplicative inverses? If you multiply them, you get .999999. If you want the answer to be "yes" you'll have to create a test for multiplicative inverses that isn't as simple as multiplying and testing for equality to 1.
The same thing happens with binary.

The discussions in other replies are great and so I won't repeat any of them, but there's no code. Here's a little bit of code to actually check if a pair of floats gives exactly 1.0 when multiplied.
The code makes a few assumptions/assertions (which are normally met on the x86 platform):
- float's are 32-bit binary (AKA single precision) IEEE-754
- either int's or long's are 32-bit (I decided not to rely on the availability of uint32_t)
- memcpy() copies floats to ints/longs such that 8873283.0f becomes 0x4B076543 (i.e. certain "endianness" is expected)
One extra assumption is this:
- it receives the actual floats that * would multiply (i.e. multiplication of floats wouldn't use higher precision values that the math hardware/library can use internally)
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>
#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned int uint32;
#else
typedef unsigned long uint32;
#endif
typedef unsigned long long uint64;
C_ASSERT(CHAR_BIT == 8);
C_ASSERT(sizeof(uint32) == 4);
C_ASSERT(sizeof(float) == 4);
int ProductIsOne(float f1, float f2)
{
uint32 m1, m2;
int e1, e2, s1, s2;
int e;
uint64 m;
// Make sure floats are 32-bit IEE754 and
// reinterpreted as integers as we expect
{
static const float testf = 8873283.0f;
uint32 testi;
memcpy(&testi, &testf, sizeof(testf));
assert(testi == 0x4B076543);
}
memcpy(&m1, &f1, sizeof(f1));
s1 = m1 >= 0x80000000;
m1 &= 0x7FFFFFFF;
e1 = m1 >> 23;
m1 &= 0x7FFFFF;
if (e1 > 0) m1 |= 0x800000;
memcpy(&m2, &f2, sizeof(f2));
s2 = m2 >= 0x80000000;
m2 &= 0x7FFFFFFF;
e2 = m2 >> 23;
m2 &= 0x7FFFFF;
if (e2 > 0) m2 |= 0x800000;
if (e1 == 0xFF || e2 == 0xFF || s1 != s2) // Inf, NaN, different signs
return 0;
m = (uint64)m1 * m2;
if (!m || (m & (m - 1))) // not a power of 2
return 0;
e = e1 + !e1 - 0x7F - 23 + e2 + !e2 - 0x7F - 23;
while (m > 1) m >>= 1, e++;
return e == 0;
}
const float testData[][2] =
{
{ .1f, 10.0f },
{ 0.5f, 2.0f },
{ 0.25f, 2.0f },
{ 4.0f, 0.25f },
{ 0.33333333f, 3.0f },
{ 0.00000762939453125f, 131072.0f }, // 2^-17 * 2^17
{ 1.26765060022822940E30f, 7.88860905221011805E-31f }, // 2^100 * 2^-100
{ 5.87747175411143754E-39f, 1.70141183460469232E38f }, // 2^-127 (denormalized) * 2^127
};
int main(void)
{
int i;
for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
printf("%g * %g %c= 1\n",
testData[i][0], testData[i][1],
"!="[ProductIsOne(testData[i][0], testData[i][1])]);
return 0;
}
Output (see at ideone.com):
0.1 * 10 != 1
0.5 * 2 == 1
0.25 * 2 != 1
4 * 0.25 == 1
0.333333 * 3 != 1
7.62939e-06 * 131072 == 1
1.26765e+30 * 7.88861e-31 == 1
5.87747e-39 * 1.70141e+38 == 1

What is striking is that whatever the rounding rule is, you expect the outcome of the two versions to be the same (either twice wrong or twice right)!
Most probably, in the first case a promotion to higher accuracy in the FPU registers takes place when evaluating x==1/y, whereas z= 1/y really stores the single-precision result.
Other contributors have explaine why 5==1/0.2 can fail, I needn't repeat that.

Related

Does exist two numbers that multiplied (or divided) each other introduce error?

Here's the bank of tests I'm doing, learning how FP basic ops (+, -, *, /) would introduce errors:
#include <iostream>
#include <math.h>
int main() {
std::cout.precision(100);
double a = 0.499999999999999944488848768742172978818416595458984375;
double original = 47.9;
double target = original * a;
double back = target / a;
std::cout << original << std::endl;
std::cout << back << std::endl;
std::cout << fabs(original - back) << std::endl; // its always 0.0 for the test I did
}
Can you show to me two values (original and a) that, once * (or /), due to FP math, introduce error?
And if they exist, is it possible to establish if that error is introduced by * or /? And how? (since you need both for coming back to the value; 80 bit?)
With + is easy (just add 0.499999999999999944488848768742172978818416595458984375 to 0.5, and you get 1.0, as for 0.5 + 0.5).
But I'm not able to do the same with * or /.
The output of:
#include <cstdio>
int main(void)
{
double a = 1000000000000.;
double b = 1000000000000.;
std::printf("a = %.99g.\n", a);
std::printf("a = %.99g.\n", b);
std::printf("a*b = %.99g.\n", a*b);
}
is:
a = 1000000000000.
a = 1000000000000.
a*b = 999999999999999983222784.
assuming IEEE-754 basic 64-bit binary floating-point with correct rounding to nearest, ties to even.
Obviously, 999999999999999983222784 differs from the exact mathematical result of 1000000000000•1000000000000, 1000000000000000000000000.
Multiply any two large† numbers, and there is likely going to be error because representable values have great distances in the high range of values.
While this error can be great in absolute terms, it is still small in relation to the size of the number itself, so if you perform the reverse division, the error of the first operation is scaled down in the same ratio, and disappears completely. As such, this sequence of operations is stable.
If the result of the multiplication would be greater than the maximum value representable, then it would overflow to inifinity (may depend on configuration), in which case reverse division won't result in the original value, but remains as infinity.
Similarly, if you divide with a great number, you will potentially underflow the smallest representable value resulting in either zero or a subnormal value.
† Numbers do not necessarily have to be huge. It's just easier to perceive the issue when considering huge values. The problem applies to quite small values as well. For example:
2.100000000000000088817841970012523233890533447265625 ×
2.100000000000000088817841970012523233890533447265625
Correct result:
4.410000000000000373034936274052605470949292688633679117285...
Example floating point result:
4.410000000000000142108547152020037174224853515625
Error:
2.30926389122032568296724439173008679117285652827862296732064351090230047702789306640625
× 10^-16
Does exist two numbers that multiplied (or divided) each other introduce error?
This is much easier to see with "%a".
When the precision of the result is insufficient, rounding occurs. Typically double has 53 bits of binary precision. Multiplying 2 27-bit numbers below results in an exact 53-bit answer, but 2 28 bit ones cannot form a 55-bit significant answer.
Division is easy to demo, just try 1.0/n*n.
int main(void) {
double a = 1 + 1.0/pow(2,26);
printf("%.15a, %.17e\n", a, a);
printf("%.15a, %.17e\n", a*a, a*a);
double b = 1 + 1.0/pow(2,27);
printf("%.15a, %.17e\n", b, b);
printf("%.15a, %.17e\n", b*b, b*b);
for (int n = 47; n < 52; n += 2) {
volatile double frac = 1.0/n;
printf("%.15a, %.17e %d\n", frac, frac, n);
printf("%.15a, %.17e\n", frac*n, frac*n);
}
return 0;
}
Output
//v-------v 27 significant bits.
0x1.000000400000000p+0, 1.00000001490116119e+00
//v-------------v 53 significant bits.
0x1.000000800000100p+0, 1.00000002980232261e+00
//v-------v 28 significant bits.
0x1.000000200000000p+0, 1.00000000745058060e+00
//v--------------v not 55 significant bits.
0x1.000000400000000p+0, 1.00000001490116119e+00
// ^^^ all zeros here, not the expected mathematical answer.
0x1.5c9882b93105700p-6, 2.12765957446808505e-02 47
0x1.000000000000000p+0, 1.00000000000000000e+00
0x1.4e5e0a72f053900p-6, 2.04081632653061208e-02 49
0x1.fffffffffffff00p-1, 9.99999999999999889e-01 <==== Not 1.0
0x1.414141414141400p-6, 1.96078431372549017e-02 51
0x1.000000000000000p+0, 1.00000000000000000e+00

Iterate though all possible floating-point values, starting from lowest

I am writing a unit test for a math function and I would like to be able to "walk" all possible floats/doubles.
Due to IEEE shenanigans, floating types cannot be incremented (++) at their extremities. See this question for more details. That answer states :
one can only add multiples of 2^(n-N)
But never mentions what little n is.
A solution to iterate all possible values from +0.0 to +infinity is given in this great blog post. The technique involves using a union with an int to walk the different values of a float. This works due to the following properties explained in the post, though they are only valid for positive numbers.
Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero
His solution for +0.0 to +infinity (0.f to std::numeric_limits<float>::max()) :
union Float_t {
int32_t RawExponent() const { return (i >> 23) & 0xFF; }
int32_t i;
float f;
};
Float_t allFloats;
allFloats.f = 0.0f;
while (allFloats.RawExponent() < 255) {
allFloats.i += 1;
}
Is there a solution for -infinity to +0.0 (std::numeric_limits<float>::lowest() to 0.f)?
I've tested std::nextafter and std::nexttoward and couldn't get them to work. Maybe this is an MSVC issue?
I would be ok with any sort of hack since this is a unit test. Thanks!
You can walk all 32-bit bit representations by using all values of a 32-bit unsigned int. Then you will walk really all representations, positive and negative, including both nulls (there are two) and also all the not a number representations (NaN). You may or may not want to filter out the NaN representations, or just filter out the signaling ones and leave the non signaling ones in. This depends on your use case.
Example:
for (uint32_t i = 0;;)
{
float f;
// Type punning: Force the bit representation of i into f.
// Type punning is hard because mostly undefined in C/C++.
// Using memcpy() usually avoids any type punning warning.
memcpy(&f, &i, sizeof(f));
// Use f here.
// Warning: Using signaling NaNs may throw exceptions or raise signals.
i++;
if (i == 0)
break;
}
Instead you can also walk a 32-bit int from -2**31 to +(2**31-1). This makes no difference.
Pascal Cuoq correctly points out std::nextafter is the right solution. I had a problem elsewhere in my code. Sorry for the unnecessary question.
#include <cassert>
#include <cmath>
#include <limits>
float i = std::numeric_limits<float>::lowest();
float hi = std::numeric_limits<float>::max();
float new_i = std::nextafterf(i, hi);
assert(i != new_i);
double d = std::numeric_limits<double>::lowest();
double hi_d = std::numeric_limits<double>::max();
double new_d = std::nextafter(d, hi_d);
assert(d != new_d);
long double ld = std::numeric_limits<long double>::lowest();
long double hi_ld = std::numeric_limits<long double>::max();
long double new_ld = std::nextafterl(ld, hi_ld);
assert(ld != new_ld);
for (float d = std::numeric_limits<float>::lowest();
d < std::numeric_limits<float>::max();
d = std::nextafterf(
d, std::numeric_limits<float>::max())) {
// Wait a lifetime?
}
Iterating through all the float values can be done with simple understanding of the floating-point representation:
The distance between consecutive subnormal values is the minimum normal times the “epsilon”. Simply iterate through all the subnormals using this distance as an increment.
The distance between the normal values at the lowest exponent is the same. Step through them with the same increment.
For each exponent, the distance increases according to the floating-point radix. Simply multiply the increment by the radix and step through all the values for the next exponent.
Repeat until infinity is reached.
Observe that the inner loop in the code below is simply:
for (; x < Limit; x += Increment)
Test(x);
This has the advantage that only normal floating-point arithmetic is used. The inner loop contains only one addition and one comparison (plus any tests you want to perform with each number). No library functions are called in the loop, no representations are dissected or copied to general registers or otherwise manipulated. There is nothing to impede performance.
This code steps through only the non-negative numbers. The negative numbers can be tested separately in the same way or can share this code by inserting a call Test(-x).
#include <limits>
static void Test(float x)
{
// Insert unit test for value x here.
}
int main(void)
{
typedef float T;
static const int Radix = std::numeric_limits<T>::radix;
static const T Infinity = std::numeric_limits<T>::infinity();
/* Increment is the current distance between floating-point numbers. We
start it at distance between subnormal numbers.
*/
T Increment =
std::numeric_limits<T>::min() * std::numeric_limits<T>::epsilon();
/* Limit is the next boundary where the distance between floating-point
numbers changes. We will increment up to that limit and then adjust
the limit and increment. We start it at the top of the first set of
normals, which allows the first loop to increment first through the
subnormals and then through the normals with the lowest exponent.
(These two sets have the same step size between adjacent values.)
*/
T Limit = std::numeric_limits<T>::min() * Radix;
/* Start with zero and continue until we reach infinity.
We execute an inner loop that iterates through all the significands of
one floating-point exponent. Each time it completes, we step up the
limit and increment.
*/
for (T x = 0; x < Infinity; Limit *= Radix, Increment *= Radix)
// Increment x through all the significands with the current exponent.
for (; x < Limit; x += Increment)
// Test with the current value of x.
Test(x);
// Also test infinity.
Test(Infinity);
}
(This code assumes the floating-point type has subnormals, and that they are not flushed to zero. The code can be readily adjusted to support these alternatives as well.)

Can float values add to a sum of zero? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Most effective way for float and double comparison
I have two values(floats) I am attempting to add together and average. The issue I have is that occasionally these values would add up to zero, thus not requiring them to be averaged.
The situation I am in specifically contains the values "-1" and "1", yet when added together I am given the value "-1.19209e-007" which is clearly not 0. Any information on this?
I'm sorry but this doesn't make sense to me.
Two floating point values, if they are exactly the same but with opposite sign, subtracted will produce always 0. This is how floating point operations works.
float a = 0.2f;
float b = -0.2f;
float f = (a - b) / 2;
printf("%f %d\n", f, f != 0); // will print out 0.0000 0
Will be always 0 also if the compiler doesn't optimize the code.
There is not any kind of rounding error to take in account if a and b have the same value but opposite sign! That is, if the higher bit of a is 0 and the higher bit of b is 1 and all other bits are the same, the result cannot be other than 0.
But if a and b are slightly different, of course, the result can be non-zero.
One possible solution to avoid this can be using a tolerance...
float f = (a + b) / 2;
if (abs(f) < 0.000001f)
f = 0;
We are using a simple tolerance to see if our value is near to zero.
A nice example code to show this is...
int main(int argc)
{
for (int i = -10000000; i <= 10000000 * argc; ++i)
{
if (i != 0)
{
float a = 3.14159265f / i;
float b = -a + (argc - 1);
float f = (a + b) / 2;
if (f != 0)
printf("%f %d\n", a, f);
}
}
printf("completed\n");
return 0;
}
I'm using "argc" here as a trick to force the compiler to not optimize out our code.
At least right off, this sounds like typical floating point imprecision.
The usual way to deal with it is to round your numbers to the correct number of significant digits. In this case, your average would be -1.19209e-08 (i.e., 0.00000001192). To (say) six or seven significant digits, that is zero.
Takes the sum of all your numbers, divide by your count. Round off your answer to something reasonable before you do prints, reports comparisons, or whatever you're doing.
again, do some searching on this but here is the basic explanation ...
the computer approximates floating point numbers by base 2 instead of base 10. this means that , for example, 0.2 (when converted to binary) is actually 0.001100110011 ... on forever. since the computer cannot add these on forever, it must approximate it.
because of these approximations, we lose "precision" of calculations. hence "single" and "double" precision floating point numbers. this is why you never test for a float to be actually 0. instead, you test whether is below some threshhold which you want to use as zero.

Fast fixed point pow, log, exp and sqrt

I've got a fixed point class (10.22) and I have a need of a pow, a sqrt, an exp and a log function.
Alas I have no idea where to even start on this. Can anyone provide me with some links to useful articles or, better yet, provide me with some code?
I'm assuming that once I have an exp function then it becomes relatively easy to implement pow and sqrt as they just become.
pow( x, y ) => exp( y * log( x ) )
sqrt( x ) => pow( x, 0.5 )
Its just those exp and log functions that I'm finding difficult (as though I remember a few of my log rules, I can't remember much else about them).
Presumably, there would also be a faster method for sqrt and pow so any pointers on that front would be appreciated even if its just to say use the methods i outline above.
Please note: This HAS to be cross platform and in pure C/C++ code so I cannot use any assembler optimisations.
A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.
Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.
You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).
[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.
[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits by table lookup. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.
[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.
Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.
[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.
You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)
[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.
Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:
logₘ(x)
logₙ(x) = ───────
logₘ(n)
where, for this algorithm, m equals 2.
A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.
When using these functions, x is expected to be a fixed-point value scaled according to the
specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.
#include <errno.h>
#include <stddef.h>
#include "log2fix.h"
#define INV_LOG2_E_Q1DOT31 UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10
int32_t log2fix (uint32_t x, size_t precision)
{
int32_t b = 1U << (precision - 1);
int32_t y = 0;
if (precision < 1 || precision > 31) {
errno = EINVAL;
return INT32_MAX; // indicates an error
}
if (x == 0) {
return INT32_MIN; // represents negative infinity
}
while (x < 1U << precision) {
x <<= 1;
y -= 1U << precision;
}
while (x >= 2U << precision) {
x >>= 1;
y += 1U << precision;
}
uint64_t z = x;
for (size_t i = 0; i < precision; i++) {
z = z * z >> precision;
if (z >= 2U << (uint64_t)precision) {
z >>= 1;
y += b;
}
b >>= 1;
}
return y;
}
int32_t logfix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;
return t >> 31;
}
int32_t log10fix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;
return t >> 31;
}
The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.
[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.
A good starting point is Jack Crenshaw's book, "Math Toolkit for Real-Time Programming". It has a good discussion of algorithms and implementations for various transcendental functions.
Check my fixed point sqrt implementation using only integer operations.
It was fun to invent. Quite old now.
https://groups.google.com/forum/?hl=fr%05aacf5997b615c37&fromgroups#!topic/comp.lang.c/IpwKbw0MAxw/discussion
Otherwise check the CORDIC set of algorithms. That's the way to implement all the functions you listed and the trigonometric functions.
EDIT : I published the reviewed source on GitHub here

Generating random floating-point values based on random bit stream

Given a random source (a generator of random bit stream), how do I generate a uniformly distributed random floating-point value in a given range?
Assume that my random source looks something like:
unsigned int GetRandomBits(char* pBuf, int nLen);
And I want to implement
double GetRandomVal(double fMin, double fMax);
Notes:
I don't want the result precision to be limited (for example only 5 digits).
Strict uniform distribution is a must
I'm not asking for a reference to an existing library. I want to know how to implement it from scratch.
For pseudo-code / code, C++ would be most appreciated
I don't think I'll ever be convinced that you actually need this, but it was fun to write.
#include <stdint.h>
#include <cmath>
#include <cstdio>
FILE* devurandom;
bool geometric(int x) {
// returns true with probability min(2^-x, 1)
if (x <= 0) return true;
while (1) {
uint8_t r;
fread(&r, sizeof r, 1, devurandom);
if (x < 8) {
return (r & ((1 << x) - 1)) == 0;
} else if (r != 0) {
return false;
}
x -= 8;
}
}
double uniform(double a, double b) {
// requires IEEE doubles and 0.0 < a < b < inf and a normal
// implicitly computes a uniform random real y in [a, b)
// and returns the greatest double x such that x <= y
union {
double f;
uint64_t u;
} convert;
convert.f = a;
uint64_t a_bits = convert.u;
convert.f = b;
uint64_t b_bits = convert.u;
uint64_t mask = b_bits - a_bits;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
int b_exp;
frexp(b, &b_exp);
while (1) {
// sample uniform x_bits in [a_bits, b_bits)
uint64_t x_bits;
fread(&x_bits, sizeof x_bits, 1, devurandom);
x_bits &= mask;
x_bits += a_bits;
if (x_bits >= b_bits) continue;
double x;
convert.u = x_bits;
x = convert.f;
// accept x with probability proportional to 2^x_exp
int x_exp;
frexp(x, &x_exp);
if (geometric(b_exp - x_exp)) return x;
}
}
int main() {
devurandom = fopen("/dev/urandom", "r");
for (int i = 0; i < 100000; ++i) {
printf("%.17g\n", uniform(1.0 - 1e-15, 1.0 + 1e-15));
}
}
Here is one way of doing it.
The IEEE Std 754 double format is as follows:
[s][ e ][ f ]
where s is the sign bit (1 bit), e is the biased exponent (11 bits) and f is the fraction (52 bits).
Beware that the layout in memory will be different on little-endian machines.
For 0 < e < 2047, the number represented is
(-1)**(s) * 2**(e – 1023) * (1.f)
By setting s to 0, e to 1023 and f to 52 random bits from your bit stream, you get a random double in the interval [1.0, 2.0). This interval is unique in that it contains 2 ** 52 doubles, and these doubles are equidistant. If you then subtract 1.0 from the constructed double, you get a random double in the interval [0.0, 1.0). Moreover, the property about being equidistant is preserve.
From there you should be able to scale and translate as needed.
I'm surprised that for question this old, nobody had actual code for the best answer. User515430's answer got it right--you can take advantage of IEEE-754 double format to directly put 52 bits into a double with no math at all. But he didn't give code. So here it is, from my public domain ojrandlib:
double ojr_next_double(ojr_generator *g) {
uint64_t r = (OJR_NEXT64(g) & 0xFFFFFFFFFFFFFull) | 0x3FF0000000000000ull;
return *(double *)(&r) - 1.0;
}
NEXT64() gets a 64-bit random number. If you have a more efficient way of getting only 52 bits, use that instead.
This is easy, as long as you have an integer type with as many bits of precision as a double. For instance, an IEEE double-precision number has 53 bits of precision, so a 64-bit integer type is enough:
#include <limits.h>
double GetRandomVal(double fMin, double fMax) {
unsigned long long n ;
GetRandomBits ((char*)&n, sizeof(n)) ;
return fMin + (n * (fMax - fMin))/ULLONG_MAX ;
}
This is probably not the answer you want, but the specification here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3225.pdf
in sections [rand.util.canonical] and [rand.dist.uni.real], contains sufficient information to implement what you want, though with slightly different syntax. It isn't easy, but it is possible. I speak from personal experience. A year ago I knew nothing about random numbers, and I was able to do it. Though it took me a while... :-)
The question is ill-posed. What does uniform distribution over floats even mean?
Taking our cue from discrepancy, one way to operationalize your question is to define that you want the distribution that minimizes the following value:
Where x is the random variable you are sampling with your GetRandomVal(double fMin, double fMax) function, and means the probability that a random x is smaller or equal to t.
And now you can go on and try to evaluate eg a dabbler's answer. (Hint all the answers that fail to use the whole precision and stick to eg 52 bits will fail this minimization criterion.)
However, if you just want to be able to generate all float bit patterns that fall into your specified range with equal possibility, even if that means that eg asking for GetRandomVal(0,1000) will create more values between 0 and 1.5 than between 1.5 and 1000, that's easy: any interval of IEEE floating point numbers when interpreted as bit patterns map easily to a very small number of intervals of unsigned int64. See eg this question. Generating equally distributed random values of unsigned int64 in any given interval is easy.
I may be misunderstanding the question, but what stops you simply sampling the next n bits from the random bit stream and converting that to a base 10 number number ranged 0 to 2^n - 1.
To get a random value in [0..1[ you could do something like:
double value = 0;
for (int i=0;i<53;i++)
value = 0.5 * (value + random_bit()); // Insert 1 random bit
// or value = ldexp(value+random_bit(),-1);
// or group several bits into one single ldexp
return value;