Sorting/comparison of doubles in C(++) not stable? - c++

I've ran into a pretty wierd problem with doubles. I have a list of floating point numbers (double) that are sorted in decreasing order. Later in my program I find however that they are not exactly sorted anymore. For example:
0.65801139819
0.6545651031 <-- a
0.65456513001 <-- b
0.64422968678
The two numbers in the middle are flipped. One might think that this problem lies in the representations of the numbers, and they are just printed out wrong. But I compare each number with the previous one using the same operator I use to sort them - there is no conversion to base 10 or similar going on:
double last_pt = 0;
for (int i = 0; i < npoints; i++) {
if (last_pt && last_pt < track[i]->Pt()) {
cout << "ERROR: new value " << track[i]->Pt()
<< " is higher than previous value " << last_pt << endl;
}
last_pt = track[i]->Pt();
}
The values are compared during sorting by
bool moreThan(const Track& a, const Track& b) {
return a.Pt() > b.Pt();
}
and I made sure that they are always double, and not converted to float. Pt() returns a double. There are no NaNs in the list, and I don't touch the list after sorting.
Why is this, what's wrong with these numbers, and (how) can I sort the numbers so that they stay sorted?

Are you sure you're not converting double to float at some time? Let us take a look at binary representation of these two numbers:
0 01111111110 0100111100100011001010000011110111010101101100010101
0 01111111110 0100111100100011001010010010010011111101011010001001
In double we've got 1 bit of sign, 11 bits of exponent and 53 bits of mantissa, while in float there's 1 bit of sign, 8 bit of exponent and 23 bits of mantissa. Notice that mantissa in both numbers are identical at their first 23 bits.
Depending on rounding method, there would be different behaviour. In case when bits >23 are just trimmed, these two numbers as float are identical:
0 011111110 01001111001000110010100 (trim: 00011110111010101101100010101)
0 011111110 01001111001000110010100 (trim: 10010010011111101011010001001)

You're comparing the return value of a function. Floating point return
values are returned in a floating point register, which has higher
precision than a double. When comparing two such values (e.g. a.Pt() >
b.Pt()), the compiler will call one of the functions, store the return
value in an unnamed temporary of type double (thus rounding the
results to double), then call the other function, and compare its
results (still in the floating point register, and not rounded to
double) with the stored value. This means that you can end up with
cases where a.Pt() > b.Pt() and b.Pt() > a.Pt(), or a.Pt() >
a.Pt(). Which will cause sort to get more than a little confused.
(Formally, if we're talking about std::sort here, this results in
undefined behavior, and I've heard of cases where it did cause a core
dump.)
On the other hand, you say that Pt() "just returns a double field".
If Pt() does no calculation what so ever; if it's only:
double Pt() const { return someDouble; }
, then this shouldn't be an issue (provided someDouble has type
double). The extended precision can represent all possible double
values exactly.

Related

Why two float type variables have different values [duplicate]

This question already has answers here:
How dangerous is it to compare floating point values?
(12 answers)
What is the most effective way for float and double comparison?
(34 answers)
Closed 9 years ago.
I have two integer vectors of nearly 1000 size, and what I am going to do is to check whether the sum of the square integer for these two vectors is the same or not. So I write the following codes:
std::vector<int> array1;
std::vector<int> array2;
... // initialize array1 and array2, and in the experiment all elements
// in the two vectors are the same but the sequence of elements may be different.
// For example: array1={1001, 2002, 3003, ....}
// array2={2002, 3003, 1001, ....}
assert(array1.size() == array2.size());
float sum_array1 = 0;
float sum_array2 = 0;
for(int i=0; i<array1.size(); i++)
sum_array1 +=array1[i]*array1[i];
for(int i=0; i<array2.size(); i++)
sum_array2 +=array2[i]*array2[i];
I expect that sum_array1 should be equal to sum_array2, but in fact in my application I found they were different sum_array1 = 1.2868639e+009 while sum_array2 = 1.2868655e+009. What I have done next is to change the type of sum_array1 and sum_array2 to double type as the following codes show:
double sum_array1 = 0;
double sum_array2 = 0;
for(int i=0; i<array1.size(); i++)
sum_array1 +=array1[i]*array1[i];
for(int i=0; i<array2.size(); i++)
sum_array2 +=array2[i]*array2[i];
This time sum_array1 is equal to sum_array2 sum_array1=sum_array2=1286862225.0000000. My question is why it could happen. Thanks.
Floating point values have a finite size, and can therefore only represent real values with a finite precision. This leads to rounding errors when you need more precision than they can store.
In particular, when adding a small number (such as those you're summing) to a much larger number (such as your accumulator), the loss of precision can be quite large compared with the small number, giving a significant error; and the errors will be different depending on the order.
Typically, float has 24 bits of precision, corresponding to about 7 decimal places. Your accumulator requires 10 decimal places (around 30 bits), so you will experience this loss of precision. Typically, double has 53 bits (about 16 decimal places), so your result can be represented exactly.
A 64-bit integer may be the best option here, since all the inputs are integers. Using an integer avoids loss of precision, but introduces a danger of overflow if the inputs are too many or too large.
To minimise the error if you can't use a wide enough accumulator, you could sort the input so that the smallest values are accumulated first; or you could use more complicated methods such as Kahan summation.
In the two loops you're adding the same numbers but in different orders. As soon as sums exceed the integer value that can be exactly represented by a float, you're going to start losing precision, and the sums may end up slightly different.
An experiment for you to try:
float n = 0;
while (n != n + 1)
n = n + 1;
//Will this terminate? If so, what is n now?
If you run this, you'll find that the loop actually terminates - which seems completely counter-intuitive, but is correct behavior according to the definitions for IEEE single-precision floating point arithmetic.
You can try the same experiment, replacing float with double. You'll witness the same strange behavior, but this time the loop will terminate when n is much larger, because IEEE double-precision floating point numbers enable a much finer accuracy.
Floating-point representation (Commonly IEEE754) uses finite bits to represent decimals, so operations with floating-point numbers result in precision loss.
Normally, opposite to the common sense, comparisons like a == ((a+1)-1) results in false if a is a floating-point variable.
Solution:
To compare two floating points, you have to use a kind of "precision loss ranges". That is, if a number differs from other less than that precision-loss-range, you consider that numbers are equal:
//Supposing we can overload operator== for floats
bool operator==( float lhs , float rhs)
{
float epsilon = std::numeric_limits<float>.epsilon();
return std::abs(lhs-rhs) < epsilon;
}
A double has more bits and thus holds more information than a float. When you are adding values to the float, it will end up rounding the information at different times for sum_array1 vs sum_array2.
Depending on the input values, you can end up with the same issue when using a double as a float (if the values are large enough).
A web search for "everything you need to know about floating point numbers" will give you a good overview of the limitations, and how to best deal with them.

Weird Rounding Occurs in C++ Function

I am writing a function in c++ that is supposed to find the largest single digit in the number passed (inputValue). For example, the answer for .345 is 5. However, after a while, the program is changing the inputValue to something along the lines of .3449 (and the largest digit is then set to 9). I have no idea why this is happening. Any help to resolve this problem would be greatly appreciated.
This is the function in my .hpp file
void LargeInput(const double inputValue)
//Function to find the largest value of the input
{
int tempMax = 0,//Value that the temporary max number is in loop
digit = 0,//Value of numbers after the decimal place
test = 0,
powerOten = 10;//Number multiplied by so that the next digit can be checked
double number = inputValue;//A variable that can be changed in the function
cout << "The number is still " << number << endl;
for (int k = 1; k <= 6; k++)
{
test = (number*powerOten);
cout << "test: " << test << endl;
digit = test % 10;
cout << (static_cast<int>(number*powerOten)) << endl;
if (tempMax < digit)
tempMax = digit;
powerOten *= 10;
}
return;
}
You cannot represent real numbers (doubles) precisely in a computer - they need to be approximated. If you change your function to work on longs or ints there won't be any inaccuracies. That seems natural enough for the context of your question, you're just looking at the digits and not the number, so .345 can be 345 and get the same result.
Try this:
int get_largest_digit(int n) {
int largest = 0;
while (n > 0) {
int x = n % 10;
if (x > largest) largest = x;
n /= 10;
}
return largest;
}
This is because the fractional component of real numbers is in the form of 1/2^n. As a result you can get values very close to what you want but you can never achieve exact values like 1/3.
It's common to instead use integers and have a conversion (like 1000 = 1) so if you had the number 1333 you would do printf("%d.%d", 1333/1000, 1333 % 1000) to print out 1.333.
By the way the first sentence is a simplification of how floating point numbers are actually represented. For more information check out; http://en.wikipedia.org/wiki/Floating_point#Representable_numbers.2C_conversion_and_rounding
This is how floating point number work, unfortunately. The core of the problem is that there are an infinite number of floating point numbers. More specifically, there are an infinite number of values between 0.1 and 0.2 and there are an infinite number of values between 0.01 and 0.02. Computers, however, have a finite number of bits to represent a floating point number (64 bits for a double precision number). Therefore, most floating point numbers have to be approximated. After any floating point operation, the processor has to round the result to a value it can represent in 64 bits.
Another property of floating point numbers is that as number get bigger they get less and less precise. This is because the same 64 bits have to be able to represent very big numbers (1,000,000,000) and very small numbers (0.000,000,000,001). Therefore, the rounding error gets larger when working with bigger numbers.
The other issue here is that you are converting from floating point to integer. This introduces even more rounding error. It appears that when (0.345 * 10000) is converted to an integer, the result is closer to 3449 than 3450.
I suggest you don't convert your numbers to integers. Write your program in terms of floating point numbers. You can't use the modulus (%) operator on floating point numbers to get a value for digit. Instead use the fmod function in the C math library (cmath.h).
As other answers have indicated, binary floating-point is incapable of representing most decimal numbers exactly. Therefore, you must reconsider your problem statement. Some alternatives are:
The number is passed as a double (specifically, a 64-bit IEEE-754 binary floating-point value), and you wish to find the largest digit in the decimal representation of the exact value passed. In this case, the solution suggested by user millimoose will work (provided the asprintf or snprintf function used is of good quality, so that it does not incur rounding errors that prevent it from producing correctly rounded output).
The number is passed as a double but is intended to represent a number that is exactly representable as a decimal numeral with a known number of digits. In this case, the solution suggested by user millimoose again works, with the format specification altered to convert the double to decimal with the desired number of digits (e.g., instead of “%.64f”, you could use “%.6f”).
The function is changed to pass the number in another way, such as with decimal floating-point, as a scaled integer, or as a string containing a decimal numeral.
Once you have clarified the problem statement, it may be interesting to consider how to solve it with floating-point arithmetic, rather than calling library functions for formatted output. This is likely to have pedagogical value (and incidentally might produce a solution that is computationally more efficient than calling a library function).

Why do simple doubles like 1.82 end up being 1.819999999645634565360? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Why does Visual Studio 2008 tell me .9 - .8999999999999995 = 0.00000000000000055511151231257827?
c++
Hey so i'm making a function to return the number of a digits in a number data type given, but i'm having some trouble with doubles.
I figure out how many digits are in it by multiplying it by like 10 billion and then taking away digits 1 by 1 until the double ends up being 0. however when putting in a double of value say .7904 i never exit the function as it keeps taking away digits which never end up being 0 as the resut of .7904 ends up being 7,903,999,988 and not 7,904,000,000.
How can i solve this problem?? Thanks =) ! oh and any other feed back on my code is WELCOME!
here's the code of my function:
/////////////////////// Numb_Digits() ////////////////////////////////////////////////////
enum{DECIMALS = 10, WHOLE_NUMBS = 20, ALL = 30};
template<typename T>
unsigned long int Numb_Digits(T numb, int scope)
{
unsigned long int length= 0;
switch(scope){
case DECIMALS: numb-= (int)numb; numb*=10000000000; // 10 bil (10 zeros)
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
case WHOLE_NUMBS: numb= (int)numb; numb*=10000000000;
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
case ALL: numb = numb; numb*=10000000000;
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
default: break;}
return length;
};
int main()
{
double test = 345.6457;
cout << Numb_Digits(test, ALL) << endl;
cout << Numb_Digits(test, DECIMALS) << endl;
cout << Numb_Digits(test, WHOLE_NUMBS) << endl;
return 0;
}
It's because of their binary representation, which is discussed in depth here:
http://en.wikipedia.org/wiki/IEEE_754-2008
Basically, when a number can't be represented as is, an approximation is used instead.
To compare floats for equality, check if their difference is lesser than an arbitrary precision.
The easy summary about floating point arithmetic :
http://floating-point-gui.de/
Read this and you'll see the light.
If you're more on the math side, Goldberg paper is always nice :
http://cr.yp.to/2005-590/goldberg.pdf
Long story short : real numbers are stored with a fixed, irregular precision, leading to non obvious behaviors. This is unrelated to the language but more a design choice of how to handle real numbers as a whole.
This is because C++ (like most other languages) can not store floating point numbers with infinte precision.
Floating points are stored like this:
sign * coefficient * 10^exponent if you're using base 10.
The problem is that both the coefficient and exponent are stored as finite integers.
This is a common problem with storing floating point in computer programs, you usually get a tiny rounding error.
The most common way of dealing with this is:
Store the number as a fraction (x/y)
Use a delta that allows small deviations (if abs(x-y) < delta)
Use a third party library such as GMP that can store floating point with perfect precision.
Regarding your question about counting decimals.
There is no way of dealing with this if you get a double as input. You cannot be sure that the user actually sent 1.819999999645634565360 and not 1.82.
Either you have to change your input or change the way your function works.
More info on floating point can be found here: http://en.wikipedia.org/wiki/Floating_point
This is because of the way the IEEE floating point standard is implemented, which will vary depending on operations. It is an approximation of precision. Never use logic of if(float == float), ever!
Float numbers are represented in the form Significant digits × baseexponent(IEEE 754). In your case, float 1.82 = 1 + 0.5 + 0.25 + 0.0625 + ...
Since only a limited digits could be stored, therefore there will be a round error if the float number cannot be represented as a terminating expansion in the relevant base (base 2 in the case).
You should always check relative differences with floating point numbers, not absolute values.
You need to read this, too.
Computers don't store floating point numbers exactly. To accomplish what you are doing, you could store the original input as a string, and count the number of characters.

Heuristic to identify if a series of 4 bytes chunks of data are integers or floats

What's the best heuristic I can use to identify whether a chunk of X 4-bytes are integers or floats? A human can do this easily, but I wanted to do it programmatically.
I realize that since every combination of bits will result in a valid integer and (almost?) all of them will also result in a valid float, there is no way to know for sure. But I still would like to identify the most likely candidate (which will virtually always be correct; or at least, a human can do it).
For example, let's take a series of 4-bytes raw data and print them as integers first and then as floats:
1 1.4013e-45
10 1.4013e-44
44 6.16571e-44
5000 7.00649e-42
1024 1.43493e-42
0 0
0 0
-5 -nan
11 1.54143e-44
Obviously they will be integers.
Now, another example:
1065353216 1
1084227584 5
1085276160 5.5
1068149391 1.33333
1083179008 4.5
1120403456 100
0 0
-1110651699 -0.1
1195593728 50000
These will obviously be floats.
PS: I'm using C++ but you can answer in any language, pseudo code or just in english.
The "common sense" heuristic from your example seems to basically amount to a range check. If one interpretation is very large (or a tiny fraction, close to zero), that is probably wrong. Check the exponent of the float interpretation and compare it to the exponent that results from a proper static cast of the integer interpretation to a float.
Looks like a kolmogorov complexity issue. Basically, from what you show as example, the shorter number (when printed as string to be read by a human), be it integer or float, is the right answer for your heuristic.
Also, obviously if the value is an incorrect float, it is an integer :-)
Seems direct enough to implement.
You can probably "detect" it by looking at the high bits, with floats they'd generally be non-zero, with integers, they would be unless you're dealing with a very large number. So... you could try and see if (2^30) & number returns 0 or not.
If both numbers are positive, your floats are reasonably large (greater than 10^-42), and your ints are reasonably small (less than 8*10^6), then the check is pretty simple. Treat the data as a float and compare to the least normalized float.
union float_or_int {
float f;
int32_t i;
};
bool is_positive_normalized_float( float_or_int &u ) {
return u.f >= numeric_limits<float>::min();
}
This assumes IEEE float and same endinanness between the CPU and the FPU.
A human can do this easily
A human can't do it at all. Ergo neither can a computer. There are 2^32 valid int values. A large number of them are also valid float values. There is no way of distinguishing the intent of the data other than by tagging it or by not getting into such a mess in the first place.
Don't attempt this.
You are going to be looking at the upper 8 or 9 bits. That's where the sign and mantissa of a floating point value are. Values of 0x00 0x80 and 0xFF here are pretty uncommon for valid float data.
In particular if the upper 9 bits are all 0 then this likely to be a valid floating point value only if all 32 bits are 0. Another way to say this is that if the exponent is 0, the mantissa should also be zero. If the upper bit is 1 and the next 8 bits are 0, this is legal, but also not likely to be valid. It represents -0.0 which is a legal floating point value, but a meaningless one.
To put this into numerical terms. if the upper byte is 0x00 (or 0x80), then the value has a magnitude of at most 2.35e-38. Plank's constant is 6.62e-34 m2kg/s that's 4 orders of magnitude larger. The estimated diameter of a proton is much much larger than that (estimated at 1.6e−15 meters). The smallest non-zero value for audio data is about 2.3e-10. You aren't likely to see floating point values are are legitimate measurements of anything real that are smaller than 2.35e-38 but not zero.
Going the other direction if the upper byte is 0xFF then this value is either Infinite, a NaN or larger in magnitude than 3.4e+38. The age of the universe is estimated to be 1.3e+10 years (1.3e+25 femtoseconds). The observable universe has roughly e+23 stars, Avagadro's number is 6.02e+23. Once again float values larger than e+38 rarely show up in legitimate measurements.
This is not to say that the FPU can't load or produce such values, and you will certainly see them in intermediate values of calculations if you are working with modern FPUs. A modern FPU will load a floating point value that has a exponent of 0 but the other bits are not 0. These are called denormalized values. This is why you are seeing small positive integers show up as float values in the range of e-42 even though the normal range of a float only goes down to e-38
An exponent of all 1s represents Infinity. You probably won't find infinities in your data, but you would know better than I. -Infinity is 0xFF800000, +Infinity is 0x7F800000, any value other than 0 in the mantissa of Infinity is malformed. malformed infinities are used as NaNs.
Loading a NaN into a float register can cause it to throw an exception, so you want to use integer math to do your guessing about whether your data is float or int until you are fairly certain it is int.
If you know that your floats are all going to be actual values (no NaNs, INFs, denormals or other aberrant values) then you can use this a criterion. In general an array of ints will have a high probability of containing "bad" float values.
I assume the following:
that you mean IEEE 754 single precision floating point numbers.
that the sign bit of the float is saved in the MSB of an int.
So here we go:
static boolean probablyFloat(uint32_t bits) {
bool sign = (bits & 0x80000000U) != 0;
int exp = ((bits & 0x7f800000U) >> 23) - 127;
uint32_t mant = bits & 0x007fffff;
// +- 0.0
if (exp == -127 && mant == 0)
return true;
// +- 1 billionth to 1 billion
if (-30 <= exp && exp <= 30)
return true;
// some value with only a few binary digits
if ((mant & 0x0000ffff) == 0)
return true;
return false;
}
int main() {
assert(probablyFloat(1065353216));
assert(probablyFloat(1084227584));
assert(probablyFloat(1085276160));
assert(probablyFloat(1068149391));
assert(probablyFloat(1083179008));
assert(probablyFloat(1120403456));
assert(probablyFloat(0));
assert(probablyFloat(-1110651699));
assert(probablyFloat(1195593728));
return 0;
}
simplifying what Alan said, I'd ONLY look at the integer form. and say, if the number is bigger than 99999999 then it's almost definitely a float.
This has the advantage that it's fast, easy, and avoids nan issues.
It has the disadvantage that it pretty much full of crap... i didn't actually look at what floats these will represent or anything, but it looks reasonable from your examples...
In any case, this is a heuristic, so it's GONNA be full of crap, and not always work anyway...
Measure with a micrometer, mark with chalk, cut with an axe.
Here is a heuristic I came up with, based on #kriss' idea. After a brief look at some of my data, it seems to work fairly well.
I am using it in a disassembler to detect if a 32-bit value was likely originally an integer or float literal.
public class FloatUtil {
private static final int canonicalFloatNaN = Float.floatToRawIntBits(Float.NaN);
private static final int maxFloat = Float.floatToRawIntBits(Float.MAX_VALUE);
private static final int piFloat = Float.floatToRawIntBits((float)Math.PI);
private static final int eFloat = Float.floatToRawIntBits((float)Math.E);
private static final DecimalFormat format = new DecimalFormat("0.####################E0");
public static boolean isLikelyFloat(int value) {
// Check for some common named float values
if (value == canonicalFloatNaN ||
value == maxFloat ||
value == piFloat ||
value == eFloat) {
return true;
}
// Check for some named integer values
if (value == Integer.MAX_VALUE || value == Integer.MIN_VALUE) {
return false;
}
// a non-canocical NaN is more likely to be an integer
float floatValue = Float.intBitsToFloat(value);
if (Float.isNaN(floatValue)) {
return false;
}
// Otherwise, whichever has a shorter scientific notation representation is more likely.
// Integer wins the tie
String asInt = format.format(value);
String asFloat = format.format(floatValue);
// try to strip off any small imprecision near the end of the mantissa
int decimalPoint = asFloat.indexOf('.');
int exponent = asFloat.indexOf("E");
int zeros = asFloat.indexOf("000");
if (zeros > decimalPoint && zeros < exponent) {
asFloat = asFloat.substring(0, zeros) + asFloat.substring(exponent);
} else {
int nines = asFloat.indexOf("999");
if (nines > decimalPoint && nines < exponent) {
asFloat = asFloat.substring(0, nines) + asFloat.substring(exponent);
}
}
return asFloat.length() < asInt.length();
}
}
And here are some of the values it works for (and a couple it doesn't)
#Test
public void isLikelyFloatTest() {
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1.23f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1.0f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(Float.NaN)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(Float.NEGATIVE_INFINITY)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(Float.POSITIVE_INFINITY)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1e-30f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1000f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(-1f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(-5f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1.3333f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(4.5f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(.1f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(50000f)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(Float.MAX_VALUE)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits((float)Math.PI)));
Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits((float)Math.E)));
// Float.MIN_VALUE is equivalent to integer value 1. this should be detected as an integer
// Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(Float.MIN_VALUE)));
// This one doesn't quite work. It has a series of 2 0's, but we only strip 3 0's or more
// Assert.assertTrue(FloatUtil.isLikelyFloat(Float.floatToRawIntBits(1.33333f)));
Assert.assertFalse(FloatUtil.isLikelyFloat(0));
Assert.assertFalse(FloatUtil.isLikelyFloat(1));
Assert.assertFalse(FloatUtil.isLikelyFloat(10));
Assert.assertFalse(FloatUtil.isLikelyFloat(100));
Assert.assertFalse(FloatUtil.isLikelyFloat(1000));
Assert.assertFalse(FloatUtil.isLikelyFloat(1024));
Assert.assertFalse(FloatUtil.isLikelyFloat(1234));
Assert.assertFalse(FloatUtil.isLikelyFloat(-5));
Assert.assertFalse(FloatUtil.isLikelyFloat(-13));
Assert.assertFalse(FloatUtil.isLikelyFloat(-123));
Assert.assertFalse(FloatUtil.isLikelyFloat(20000000));
Assert.assertFalse(FloatUtil.isLikelyFloat(2000000000));
Assert.assertFalse(FloatUtil.isLikelyFloat(-2000000000));
Assert.assertFalse(FloatUtil.isLikelyFloat(Integer.MAX_VALUE));
Assert.assertFalse(FloatUtil.isLikelyFloat(Integer.MIN_VALUE));
Assert.assertFalse(FloatUtil.isLikelyFloat(Short.MIN_VALUE));
Assert.assertFalse(FloatUtil.isLikelyFloat(Short.MAX_VALUE));
}

How to detect if a base 10 decimal can be represented exactly in base 2

As part of a numerical library test I need to choose base 10 decimal numbers that can be represented exactly in base 2. How do you detect in C++ if a base 10 decimal number can be represented exactly in base 2?
My first guess is as follows:
bool canBeRepresentedInBase2(const double &pNumberInBase10)
{
//check if a number in base 10 can be represented exactly in base 2
//reference: http://en.wikipedia.org/wiki/Binary_numeral_system
bool funcResult = false;
int nbOfDoublings = 16*3;
double doubledNumber = pNumberInBase10;
for (int i = 0; i < nbOfDoublings ; i++)
{
doubledNumber = 2*doubledNumber;
double intPart;
double fracPart = modf(doubledNumber/2, &intPart);
if (fracPart == 0) //number can be represented exactly in base 2
{
funcResult = true;
break;
}
}
return funcResult;
}
I tested this function with the following values: -1.0/4.0, 0.0, 0.1, 0.2, 0.205, 1.0/3.0, 7.0/8.0, 1.0, 256.0/255.0, 1.02, 99.005. It returns true for -1.0/4.0, 0.0, 7.0/8.0, 1.0, 99.005 which is correct.
Any better ideas?
I think what you are looking for is a number which has a fractional portion which is the sum of a sequence of negative powers of 2 (aka: 1 over a power of 2). I believe this should always be able to be represented exactly in IEEE floats/doubles.
For example:
0.375 = (1/4 + 1/8) which should have an exact representation.
If you want to generate these. You could try do something like this:
#include <iostream>
#include <cstdlib>
int main() {
srand(time(0));
double value = 0.0;
for(int i = 1; i < 256; i *= 2) {
// doesn't matter, some random probability of including this
// fraction in our sequence..
if((rand() % 3) == 0) {
value += (1.0 / static_cast<double>(i));
}
}
std::cout << value << std::endl;
}
EDIT: I believe your function has a broken interface. It would be better if you had this:
bool canBeRepresentedExactly(int numerator, int denominator);
because not all fractions have exact representations, but the moment you shove it into a double, you've chosen a representation in binary... defeating the purpose of the test.
If you're checking to see if it's binary, it will always return true. If your method takes a double as the parameter, the number is already represented in binary (double is a binary type, usually 64 bits). Looking at your code, I think you're actually trying to see if it can be represented exactly as an integer, in which case why can't you just cast to int, then back to double and compare to the original. Any integer stored in a double that's within the range representable by an int should be exact, IIRC, because a 64 bit double has 53 bits of mantissa (and I'm assuming a 32 bit int). That means if they're equal, it's an integer.
If you're passing in a double, then by definition, it has already been represented in binary and if not, then you've already lost accuracy.
Maybe try passing in numerator and denominator of the fraction to the function. Then you have not lost accuracy and can check to see if you can come up with a binary representation of the answer that is the same as the fraction you've passed in.
As rmeador have pointed out, it might not be a good idea to accept the double, because the number has been converted to a double, an possible approximation to the number that you're trying to check.
So, in a very abstract way, you should split your check into integers, and decimals. Integers should not be too large such that the mantissa cannot express all the integers, (e.g. 9007199254740993 should not be represented properly by a 64-bit fp)
Decimal points may be a bit easier, mentally, because if anything after the decimal point (e.g. yyy in xxx.yyy) contains a factor of anything other than 2, the floating point repeats in order to try to represent it. It's the reason why 1/3 cannot be represented with finite digits in base 10 = base (2*5)... See Recurring Decimal
EDIT: As the comments pointed out, if the decimal number has a factor of anything other than 1/2, that would be the mathematically correct way to say it...
As others have mentioned, your method doesn't do what you mean, since you pass a number represented as a (binary) double. The method actually detects, if the number you passed is in the form integer/2^48. This should fail for numbers like (1+2^-50), which is binary, and 259/255, which isn't.
If you really want to test a number for being exactly representable by finite binary string, you have to pass a number in an exact form.
You can't pass IN a Double because it's already lost precision. You should be able to use the toString() method of Double to check for this. (example in Java)
public static Boolean canBeRepresentedInBase2(String thenumber)
{
// Reuturns true of the parsed Double did not loose precision.
// Only works for numbers that are not converted into scientific notation by toString.
return thenumber.equals(Double.parseDouble(thenumber).toString())
}
You asked for C++ but maybe this algorithm will help. I use "EE" to mean "exactly expressible as a float."
Start with a decimal representation of the number you want to test. Remove any trailing zeroes (that is, 0.123450000 becomes 0.12345).
1) If the number is not an integer, check to see if the rightmost digit is 5. If it's not, then stop -- the number is not EE.
2) Multiply the number by 2. If the result is an integer, then stop -- the number is EE. Otherwise, go back to step 1.
I don't have rigorous proof for this but a "warm fuzzy." Fire up Calculator and enter your favorite fractional power of 2, like 0.0000152587890625. Add it to itself a few dozen times (I just hit "+" once then "=" a bunch of times). If there are any non-zero digits to the right of the decimal point, the last digit is always 5.
Here is the code in C# and it works. Because it works with the Decimal data - there are no inherent rounding errors that show up in the original code which uses double. (decimal in C# stores using base 10 instead of base 2 - which is what double does)
static bool canBeRepresentedInBase2(decimal pNumberInBase10)
{
//check if a number in base 10 can be represented exactly in base 2
//reference: http://en.wikipedia.org/wiki/Binary_numeral_system
bool funcResult = false;
int nbOfDoublings = 16*3;
decimal doubledNumber = pNumberInBase10;
for (int i = 0; i < nbOfDoublings ; i++)
{
doubledNumber = 2*doubledNumber;
decimal intPart;
decimal fracPart = ModF(doubledNumber/2, out intPart);
if (fracPart == 0) //number can be represented exactly in base 2
{
funcResult = true;
break;
}
}
return funcResult;
}
static decimal ModF(decimal number, out decimal intPart)
{
intPart = Math.Floor(number);
decimal fractional = number - (intPart);
return fractional;
}
Tested with the following code (where WL does a Console.WritelLine - SnippetCompiler)
WL(canBeRepresentedInBase2(-1.0M/4.0M)); //true
WL(canBeRepresentedInBase2(0.0M)); //true
WL(canBeRepresentedInBase2(0.1M)); //false
WL(canBeRepresentedInBase2(0.2M)); //false
WL(canBeRepresentedInBase2(0.205M)); //false
WL(canBeRepresentedInBase2(1.0M/3.0M)); //false
WL(canBeRepresentedInBase2(7.0M/8.0M)); //true
WL(canBeRepresentedInBase2(1.0M)); //true
WL(canBeRepresentedInBase2(256.0M/255.0M)); //false
WL(canBeRepresentedInBase2(1.02M)); //false
WL(canBeRepresentedInBase2(99.005M)); //false
WL(canBeRepresentedInBase2(2.53M)); //false
Or even easier:
return pNumber == floor(pNumber);
On the other hand, if you have some weird fractional representation (numerator denominator pair, or string with a decimal in it, or something), and you really do want to know if the value can be exactly represented as a double, it's a bit harder.
But you would need a different parameter(s) for that...
Given a number r it can be represented exactly with finite precision in base 2 iff r can be written as r = m/2^n, where m, n are integers, and n >= 0.
For example 1/7 doesn't have a finite binary expression, also 1/6 and 1/10 can't be written with a finite expression in base 2.
But 1/4+1/32+1/1024, have a finite expression in base.
PS: In general you can express a number r with finite digits in a base b iff r=m/b^n where m, n are integers an n >= 0.
PPS: As almost everybody has stated previously using a double as input is a bad idea, because you are loosing precision, and you will end up with a different number.
I don't think this is what he's asking... I think he's looking for a solution that will tell him if a number can be represented EXACTLY in binary form. For example, 33.3.. That's a number cannot be represented in binary, because it will go on forever, so depending on your FPU settings, it will be represented as something like "33.333333333333336". So, it looks like his method will do the job. I don't know of a better way off the top of my head.
\
Ignoring the general criticism of using a double...
For a general finite decimal, you can determine if it has a finite representation in binary with the following algorithm:
Extract the fraction part of the decimal f.
Determine f x 10b = c, where b and c are integers.
Determine 2d >= 10b, where d is an integer.
If c x 2b / 10b is an integer, then the decimal has a finite representation in binary. Otherwise, it doesn't.
You can generalize this to any two bases.