Comparing floats in their bit representations

Comparing floats in their bit representations - c++

Say I want a function that takes two floats (x and y), and I want to compare them using not their float representation but rather their bitwise representation as a 32-bit unsigned int. That is, a number like -495.5 has bit representation 0b11000011111001011100000000000000 or 0xC3E5C000 as a float, and I have an unsigned int with the same bit representation (corresponding to a decimal value 3286614016, which I don't care about). Is there any easy way for me to perform an operation like <= on these floats using only the information contained in their respective unsigned int counterparts?

You must do a signed compare unless you ensure that all the original values were positive. You must use an integer type that is the same size as the original floating point type. Each chip may have a different internal format, so comparing values from different chips as integers is most likely to give misleading results.
Most float formats look something like this: sxxxmmmm
s is a sign bit
xxx is an exponent
mmmm is the mantissa
The value represented will then be something like: 1mmm << (xxx-k)
1mmm because there is an implied leading 1 bit unless the value is zero.
If xxx < k then it will be a right shift. k is near but not equal to half the largest value that could be expressed by xxx. It is adjusted for the size of the mantissa.
All to say that, disregarding NaN, comparing floating point values as signed integers of the same size will yield meaningful results. They are designed that way so that floating point comparisons are no more costly than integer comparisons. There are compiler optimizations to turn off NaN checks so that the comparisons are straight integer comparisons if the floating point format of the chip supports it.
As an integer, NaN is greater than infinity is greater than finite values. If you try an unsigned compare, all the negative values will be larger than the positive values, just like signed integers cast to unsigned.

If you truly truly don't care about what the conversion yields, it isn't too hard. But the results are extremely non-portable, and you almost certainly won't get an ordering that at all resembles what you'd get by comparing the floats directly.
typedef unsigned int TypeWithSameSizeAsFloat; //Fix this for your platform
bool compare1(float one, float two)
union Convert {
float f;
TypeWithSameSizeAsFloat i;
}
Convert lhs, rhs;
lhs.f = one;
rhs.f = two;
return lhs.i < rhs.i;
}
bool compare2(float one, float two) {
return reinterpret_cast<TypeWithSameSizeAsFloat&>(one)
< reinterpret_cast<TypeWithSameSizeAsFloat&>(two);
}
Just understand the caveats, and chose your second type carefully. Its a near worthless excersize at any rate.

In a word, no. IEEE 754 might allow some kinds of hacks like this, but they do not work all the time and handle all cases, and some platforms do not use that floating point standard (such as doubles on x87 having 80 bit precision internally).
If you're doing this for performance reasons I suggest you strongly reconsider -- if it's faster to use the integer comparison the compiler will probably do it for you, and if it is not, you pay for a float to int conversion multiple times, when a simple comparison may be possible without moving the floats out of registers.

Maybe I'm misreading the question, but I suppose you could do this:
bool compare(float a, float b)
{
return *((unsigned int*)&a) < *((unsigned int*)&b);
}
But this assumes all kinds of things and also warrants the question of why you'd want to compare the bitwise representations of two floats.

Related

Floating point Arithmetics

Today in my C++ programming lessons, my proff told me that one should never compare two floating point values directly.
So I tried this piece of code and found out the reason for his statement.
double l_Value=94.9;
print("%.20lf",l_Value);
And I found the results as 94.89999999 ( some relative error )
I understand that floating numbers are not stored in the way one presents it to the code. Squeezing those ones and zeros in binary form involves some relative rounding errors.
Iam looking for solutions to two problems.
1. Efficient way to compare two floating values.
2. How to add a floating value to another one. Example. Add 0.1111 to 94.4345 to get the exact value as 94.5456
Thanks in advance.

Efficient way to compare two floating values.
A simple double a,b; if (a == b) is an efficient way to compare two floating values. Yet as OP noticed, this may not meet the overall coding goal. Better ways depend on the context of the compare, something not supplied by OP. See far below.
How to add a floating value to another one. Example. Add 0.1111 to 94.4345 to get the exact value as 94.5456
Floating values as source code have effective unlimited range and precision such as 1.23456789012345678901234567890e1234567. Conversion of this text to a double is limited typically to one of 264 different values. The closest is selected, but that may not be an exact match.
Neither 0.1111, 94.4345, 94.5456 can be representably exactly as a typical double.
OP has choices:
1.) Use another type other than double, float. Various libraries offer decimal floating point types.
2) Limit code to rare platforms that support double to a base 10 form such that FLT_RADIX == 10.
3) Write your own code to handle user input like "0.1111" into a structure/string and perform the needed operations.
4) Treat user input as strings and the convert to some integer type, again with supported routines to read/compute/and write.
5) Accept that floating point operations are not mathematically exact and handle round-off error.
double a = 0.1111;
printf("a: %.*e\n", DBL_DECIMAL_DIG -1 , a);
double b = 94.4345;
printf("b: %.*e\n", DBL_DECIMAL_DIG -1 , b);
double sum = a + b;
printf("sum: %.*e\n", DBL_DECIMAL_DIG -1 , sum);
printf("%.4f\n", sum);
Output
a: 1.1110000000000000e-01
b: 9.4434500000000000e+01
sum: 9.4545599999999993e+01
94.5456 // Desired textual output based on a rounded `sum` to the nearest 0.0001
More on #1
If an exact compare is not sought but some sort of "are the two values close enough?", a definition of "close enough" is needed - of which there are many.
The following "close enough" compares the distance by examining the ULP of the two numbers. It is a linear difference when the values are in the same power-of-two and becomes logarithmic other wise. Of course, change of sign is an issue.
float example:
Consider all finite float ordered from most negative to most positive. The following, somewhat-portable code, returns an integer for each float with that same order.
uint32_t sequence_f(float x) {
union {
float f;
uint32_t u32;
} u;
assert(sizeof(float) == sizeof(uint32_t));
u.f = x;
if (u.u32 & 0x80000000) {
u.u32 ^= 0x80000000;
return 0x80000000 - u.u32;
}
return u.u3
}
Now, to determine if two float are "close enough", simple compare two integers.
static bool close_enough(float x, float y, uint32_t ULP_delta) {
uint32_t ullx = sequence_f(x);
uint32_t ully = sequence_f(y);
if (ullx > ully) return (ullx - ully) <= ULP_delta;
return (ully - ullx) <= ULP_delta;
}

The way I've usually done this is is to have a custom equality comparison function. The basic idea, is you have a certain tolerance, say 0.0001 or something. Then you subtract your two numbers and take their absolute value, and if it is less than your tolerance you treat it as equal. There are other strategies that may be more appropriate for certain situations, of course.

Define for yourself a tolerance level e (for example, e=.0001) and check if abs(a-b) <= e
You aren't going to get an "exact" value with floating point. Ever. If you know in advance that you are using four decimals, and you want "exact", then you need to internally treat your numbers as integers and only display them as decimals. 944345 + 1111 = 945456

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?

The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.

It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;

1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.

If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

Rounding in C++ and round-tripping numbers

I have a class that internally represents some quantity in fixed point as 32-bit integer with somewhat arbitrary denominator (it is neither power of 2 nor power of 10).
For communicating with other applications the quantity is converted to plain old double on output and back on input. As code inside the class it looks like:
int32_t quantity;
double GetValue() { return double(quantity) / DENOMINATOR; }
void SetValue(double x) { quantity = x * DENOMINATOR; }
Now I need to ensure that if I output some value as double and read it back, I will always get the same value back. I.e. that
x.SetValue(x.GetValue());
will never change x.quantity (x is arbitrary instance of the class containing the above code).
The double representation has more digits of precision, so it should be possible. But it will almost certainly not be the case with the simplistic code above.
What rounding do I need to use and
How can I find the critical would-be corner cases to test that the rounding is indeed correct?

Any 32 bits will be represented exactly when you convert to a double, but when you divide then multiply by an arbitrary value you will get a similar value but not exactly the same. You should lose at most one bit per operations, which means your double will be almost the same, prior to casting back to an int.
However, since int casts are truncations, you will get the wrong result when very minor errors turn 2.000 into 1.999, thus what you need to do is a simple rounding task prior to casting back.
You can use std::lround() for this if you have C++11, else you can write you own rounding function.
You probably don't care about fairness much here, so the common int(doubleVal+0.5) will work for positives. If as seems likely, you have negatives, try this:
int round(double d) { return d<0?d-0.5:d+0.5; }

The problem you describe is the same problem which exists with converting between binary and decimal representation just with different bases. At least it exists if you want to have the double representation to be a good approximation of the original value (otherwise you could just multiply the 32 bit value you have with your fixed denominator and store the result in a double).
Assuming you want the double representation be a good approximation of your actual value the conversions are nontrivial! The conversion from your internal representation to double can be done using Dragon4 ("How to print floating point numbers accurately", Steele & White) or Grisu ("How to print floating point numbers quickly and accurately", Loitsch; I'm not sure if this algorithm is independent from the base, though). The reverse can be done using Bellerophon ("How to read floating point numbers accurately", Clinger). These algorithms aren't entirely trivial, though...

Float to int number conversion in c++

The following C++ code:
union float2bin{
float f;
int i;
};
float2bin obj;
obj.f=2.243;
cout<<obj.i;
gives output as some garbage value .
But
union float2bin{
float f;
float i;
};
float2bin obj;
obj.f=2.243;
cout<<obj.i;
gives output same as the value of f i.e 2.243
Compiler GCC has int & float of same size i.e 4 but then what's the reason behind this output behaviour?

The reason is because it is undefined behavior. In practice,
you'll get away with reading an int from something that was
stored as a float on most machines, but you'll read garbage
values unless you know what to expect. Doing it in the other
direction will likely cause the program to crash for certain
values of int.
Under the hood, of course, integral values and floating point
values have different representations, at least on most
machines. (On some Unisys mainframes, your code would do what
you expect. But they're not the most common systems around, and
you probably don't have one on your desktop.) Basically,
regardless of the type, you have a sequence of bits, which will
be interpreted by the hardware in some way. C++ requires
integers to use a pure binary representation, which constrains
the representation somewhat. It also requires a very large
range for floating point values, and more or less requires some
form of exponential notation, with some bits representing the
exponent, and others the mantissa. With different encodings for
each.

The reason is because floating point values are stored in a more complicated way, partitioning the 32 bits into a sign, an exponent and a fraction. If these bits are read as an integer straight off, it will look like a very different value.
The important point here is that if you create a union, you are saying that it is one contiguous block of memory that can be interpreted in two different ways. No where in this mechanism does it account for a safe conversion between float and int, in which case some kind of rounding occurs.
Update: What you might want is
float f = 10.25f;
int i = (int)f;
// Will give you i = 10
However, the union approach is closer to this:
float f = 10.25f;
int i = *((int *)&f);
// Will give you some seemingly arbitrary value

How to convert 32 bit integers to 32 bit floats so that the ordering is preserved?

I have two 32 bit integers i1,i2 which I need to convert to floats f1,f2 in such a way that their relative ordering is preserved (i.e. i1 < i2 => f1 < f2)
Will a reinterpret_cast do the trick? Is there some better way?

If the integer values are less than 224, just convert the values:
float f1 = i1, f2 = i2;
For larger values, you will lose precision and two distinct integers may convert to the same floating point value.
On the other hand, you could copy the bit pattern. If your floats are IEEE754, then this requires that the sign bits agree and that neither integer represents some form of NaN. (If the sign bits do not agree, you must beware of -0.f == +0.f:.) To copy the binary representation:
float f1;
std::copy(reinterpret_cast<const char*>(&i1),
reinterpret_cast<const char*>(&i1) + 4,
reinterpret_cast<char*>(&f1));

Integer inherently stores more information in the same bit width than a float on a 32-bit machine, because of values that are reserved for NaN space and infinities. So in short, cannot be done.
int range: -2,147,483,648 to 2,147,483,647
float precision: 7 digits
I think that it would be possible if the nature of the problem limits the range of integer values somehow. Otherwise use a double-precision value. It has 15-16 digits in mantissa.
Keep in mind that in C++ the int type can have different range depending on your native pointer size. On a 16-bit machine, int range is -32k to +32k.
Also, keep in mind that there's no promise of correctness for two (binary) least-significant bits, even in a cast-to-float scenario.
http://steve.hollasch.net/cgindex/coding/ieeefloat.html

When you're casting a int to a float the value is not changed in general, therefore the relative order is preserved.
The reinterpret_cast cannot be used for this purpose, since it is only usable for pointers, e. g. converting an object to a kind of "flat" memory representation, i. e. it copies the bit pattern.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comparing floats in their bit representations - c++

Maybe I'm misreading the question, but I suppose you could do this: bool compare(float a, float b) { return ((unsigned int)&a) < ((unsigned int)&b); } But this assumes all kinds of things and also warrants the question of why you'd want to compare the bitwise representations of two floats.

Related

Floating point Arithmetics

Using scientific notation in for loops

Rounding in C++ and round-tripping numbers

Float to int number conversion in c++

How to convert 32 bit integers to 32 bit floats so that the ordering is preserved?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comparing floats in their bit representations - c++

Maybe I'm misreading the question, but I suppose you could do this: bool compare(float a, float b) { return *((unsigned int*)&a) < *((unsigned int*)&b); } But this assumes all kinds of things and also warrants the question of why you'd want to compare the bitwise representations of two floats.

Related

Floating point Arithmetics

Using scientific notation in for loops

Rounding in C++ and round-tripping numbers

Float to int number conversion in c++

How to convert 32 bit integers to 32 bit floats so that the ordering is preserved?

Categories

Resources

Maybe I'm misreading the question, but I suppose you could do this: bool compare(float a, float b) { return ((unsigned int)&a) < ((unsigned int)&b); } But this assumes all kinds of things and also warrants the question of why you'd want to compare the bitwise representations of two floats.