Float to int number conversion in c++ - c++

The following C++ code:
union float2bin{
float f;
int i;
};
float2bin obj;
obj.f=2.243;
cout<<obj.i;
gives output as some garbage value .
But
union float2bin{
float f;
float i;
};
float2bin obj;
obj.f=2.243;
cout<<obj.i;
gives output same as the value of f i.e 2.243
Compiler GCC has int & float of same size i.e 4 but then what's the reason behind this output behaviour?

The reason is because it is undefined behavior. In practice,
you'll get away with reading an int from something that was
stored as a float on most machines, but you'll read garbage
values unless you know what to expect. Doing it in the other
direction will likely cause the program to crash for certain
values of int.
Under the hood, of course, integral values and floating point
values have different representations, at least on most
machines. (On some Unisys mainframes, your code would do what
you expect. But they're not the most common systems around, and
you probably don't have one on your desktop.) Basically,
regardless of the type, you have a sequence of bits, which will
be interpreted by the hardware in some way. C++ requires
integers to use a pure binary representation, which constrains
the representation somewhat. It also requires a very large
range for floating point values, and more or less requires some
form of exponential notation, with some bits representing the
exponent, and others the mantissa. With different encodings for
each.

The reason is because floating point values are stored in a more complicated way, partitioning the 32 bits into a sign, an exponent and a fraction. If these bits are read as an integer straight off, it will look like a very different value.
The important point here is that if you create a union, you are saying that it is one contiguous block of memory that can be interpreted in two different ways. No where in this mechanism does it account for a safe conversion between float and int, in which case some kind of rounding occurs.
Update: What you might want is
float f = 10.25f;
int i = (int)f;
// Will give you i = 10
However, the union approach is closer to this:
float f = 10.25f;
int i = *((int *)&f);
// Will give you some seemingly arbitrary value

Related

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

Rounding in C++ and round-tripping numbers

I have a class that internally represents some quantity in fixed point as 32-bit integer with somewhat arbitrary denominator (it is neither power of 2 nor power of 10).
For communicating with other applications the quantity is converted to plain old double on output and back on input. As code inside the class it looks like:
int32_t quantity;
double GetValue() { return double(quantity) / DENOMINATOR; }
void SetValue(double x) { quantity = x * DENOMINATOR; }
Now I need to ensure that if I output some value as double and read it back, I will always get the same value back. I.e. that
x.SetValue(x.GetValue());
will never change x.quantity (x is arbitrary instance of the class containing the above code).
The double representation has more digits of precision, so it should be possible. But it will almost certainly not be the case with the simplistic code above.
What rounding do I need to use and
How can I find the critical would-be corner cases to test that the rounding is indeed correct?
Any 32 bits will be represented exactly when you convert to a double, but when you divide then multiply by an arbitrary value you will get a similar value but not exactly the same. You should lose at most one bit per operations, which means your double will be almost the same, prior to casting back to an int.
However, since int casts are truncations, you will get the wrong result when very minor errors turn 2.000 into 1.999, thus what you need to do is a simple rounding task prior to casting back.
You can use std::lround() for this if you have C++11, else you can write you own rounding function.
You probably don't care about fairness much here, so the common int(doubleVal+0.5) will work for positives. If as seems likely, you have negatives, try this:
int round(double d) { return d<0?d-0.5:d+0.5; }
The problem you describe is the same problem which exists with converting between binary and decimal representation just with different bases. At least it exists if you want to have the double representation to be a good approximation of the original value (otherwise you could just multiply the 32 bit value you have with your fixed denominator and store the result in a double).
Assuming you want the double representation be a good approximation of your actual value the conversions are nontrivial! The conversion from your internal representation to double can be done using Dragon4 ("How to print floating point numbers accurately", Steele & White) or Grisu ("How to print floating point numbers quickly and accurately", Loitsch; I'm not sure if this algorithm is independent from the base, though). The reverse can be done using Bellerophon ("How to read floating point numbers accurately", Clinger). These algorithms aren't entirely trivial, though...

double and float comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
According to this post, when comparing a float and a double, the float should be treated as double.
The following program, does not seem to follow this statement. The behaviour looks quite unpredictable.
Here is my program:
void main(void)
{
double a = 1.1; // 1.5
float b = 1.1; // 1.5
printf("%X %X\n", a, b);
if ( a == b)
cout << "success " <<endl;
else
cout << "fail" <<endl;
}
When I run the following program, I get "fail" displayed.
However, when I change a and b to 1.5, it displays "success".
I have also printed the hex notations of the values. They are different in both the cases. My compiler is Visual Studio 2005
Can you explain this output ? Thanks.
float f = 1.1;
double d = 1.1;
if (f == d)
In this comparison, the value of f is promoted to type double. The problem you're seeing isn't in the comparison, but in the initialization. 1.1 can't be represented exactly as a floating-point value, so the values stored in f and d are the nearest value that can be represented. But float and double are different sizes, so have a different number of significant bits. When the value in f is promoted to double, there's no way to get back the extra bits that were lost when the value was stored, so you end up with all zeros in the extra bits. Those zero bits don't match the bits in d, so the comparison is false. And the reason the comparison succeeds with 1.5 is that 1.5 can be represented exactly as a float and as a double; it has a bunch of zeros in its low bits, so when the promotion adds zeros the result is the same as the double representation.
I found a decent explanation of the problem you are experiencing as well as some solutions.
See How dangerous is it to compare floating point values?
Just a side note, remember that some values can not be represented EXACTLY in IEEE 754 floating point representation. Your same example using a value of say 1.5 would compare as you expect because there is a perfect representation of 1.5 without any loss of data. However, 1.1 in 32-bit and 64-bit are in fact different values because the IEEE 754 standard can not perfectly represent 1.1.
See http://www.binaryconvert.com
double a = 1.1 --> 0x3FF199999999999A
Approximate representation = 1.10000000000000008881784197001
float b = 1.1 --> 0x3f8ccccd
Approximate representation = 1.10000002384185791015625
As you can see, the two values are different.
Also, unless you are working in some limited memory type environment, it's somewhat pointless to use floats. Just use doubles and save yourself the headaches.
If you are not clear on why some values can not be accurately represented, consult a tutorial on how to covert a decimal to floating point.
Here's one: http://class.ece.iastate.edu/arun/CprE281_F05/ieee754/ie5.html
I would regard code which directly performs a comparison between a float and a double without a typecast to be broken; even if the language spec says that the float will be implicitly converted, there are two different ways that the comparison might sensibly be performed, and neither is sufficiently dominant to really justify a "silent" default behavior (i.e. one which compiles without generating a warning). If one wants to perform a conversion by having both operands evaluated as double, I would suggest adding an explicit type cast to make one's intentions clear. In most cases other than tests to see whether a particular double->float conversion will be reversible without loss of precision, however, I suspect that comparison between float values is probably more appropriate.
Fundamentally, when comparing floating-point values X and Y of any sort, one should regard comparisons as indicating that X or Y is larger, or that the numbers are "indistinguishable". A comparison which shows X is larger should be taken to indicate that the number that Y is supposed to represent is probably smaller than X or close to X. A comparison that says the numbers are indistinguishable means exactly that. If one views things in such fashion, comparisons performed by casting to float may not be as "informative" as those done with double, but are less likely to yield results that are just plain wrong. By comparison, consider:
double x, y;
float f = x;
If one compares f and y, it's possible that what one is interested in is how y compares with the value of x rounded to a float, but it's more likely that what one really wants to know is whether, knowing the rounded value of x, whether one can say anything about the relationship between x and y. If x is 0.1 and y is 0.2, f will have enough information to say whether x is larger than y; if y is 0.100000001, it will not. In the latter case, if both operands are cast to double, the comparison will erroneously imply that x was larger; if they are both cast to float, the comparison will report them as indistinguishable. Note that comparison results when casting both operands to double may be erroneous not only when values are within a part per million; they may be off by hundreds of orders of magnitude, such as if x=1e40 and y=1e300. Compare f and y as float and they'll compare indistinguishable; compare them as double and the smaller value will erroneously compare larger.
The reason why the rounding error occurs with 1.1 and not with 1.5 is due to the number of bits required to accurately represent a number like 0.1 in floating point format. In fact an accurate representation is not possible.
See How To Represent 0.1 In Floating Point Arithmetic And Decimal for an example, particularly the answer by #paxdiablo.

Converting variable type (or workaround)

The class below is supposed to represent a musical note. I want to be able to store the length of the note (e.g. 1/2 note, 1/4 note, 3/8 note, etc.) using only integers. However, I also want to be able to store the length using a floating point number for the rare case that I deal with notes of irregular lengths.
class note{
string tone;
int length_numerator;
int length_denominator;
public:
set_length(int numerator, int denominator){
length_numerator=numerator;
length_denominator=denominator;
}
set_length(double d){
length_numerator=d; // unfortunately truncates everything past decimal point
length_denominator=1;
}
}
The reason it is important for me to be able to use integers rather than doubles to store the length is that in my past experience with floating point numbers, sometimes the values are unexpectedly inaccurate. For example, a number that is supposed to be 16 occasionally gets mysteriously stored as 16.0000000001 or 15.99999999999 (usually after enduring some operations) with floating point, and this could cause problems when testing for equality (because 16!=15.99999999999).
Is it possible to convert a variable from int to double (the variable, not just its value)? If not, then what else can I do to be able to store the note's length using either an integer or a double, depending on the what I need the type to be?
If your only problem is comparing floats for equality, then I'd say to use floats, but read "Comparing floating point numbers" / Bruce Dawson first. It's not long, and it explains how to compare two floating numbers correctly (by checking the absolute and relative difference).
When you have more time, you should also look at "What Every Computer Scientist Should Know About Floating Point Arithmetic" to understand why 16 occasionally gets "mysteriously" stored as 16.0000000001 or 15.99999999999.
Attempts to use integers for rational numbers (or for fixed point arithmetic) are rarely as simple as they look.
I see several possible solutions: the first is just to use double. It's
true that extended computations may result in inaccurate results, but in
this case, your divisors are normally powers of 2, which will give exact
results (at least on all of the machines I've seen); you only risk
running into problems when dividing by some unusual value (which is the
case where you'll have to use double anyway).
You could also scale the results, e.g. representing the notes as
multiples of, say 64th notes. This will mean that most values will be
small integers, which are guaranteed exact in double (again, at least
in the usual representations). A number that is supposed to be 16 does
not get stored as 16.000000001 or 15.99999999 (but a number that is
supposed to be .16 might get stored as .1600000001 or .1599999999).
Before the appearance of long long, decimal arithmetic classes often
used double as a 52 bit integral type, ensuring at each step that the
actual value was exactly an integer. (Only division might cause a problem.)
Or you could use some sort of class representing rational numbers.
(Boost has one, for example, and I'm sure there are others.) This would
allow any strange values (5th notes, anyone?) to remain exact; it could
also be advantageous for human readable output, e.g. you could test the
denominator, and then output something like "3 quarter notes", or the
like. Even something like "a 3/4 note" would be more readable to a
musician than "a .75 note".
It is not possible to convert a variable from int to double, it is possible to convert a value from int to double. I'm not completely certain which you are asking for but maybe you are looking for a union
union DoubleOrInt
{
double d;
int i;
};
DoubleOrInt length_numerator;
DoubleOrInt length_denominator;
Then you can write
set_length(int numerator, int denominator){
length_numerator.i=numerator;
length_denominator.i=denominator;
}
set_length(double d){
length_numerator.d=d;
length_denominator.d=1.0;
}
The problem with this approach is that you absolutely must keep track of whether you are currently storing ints or doubles in your unions. Bad things will happen if you store an int and then try to access it as a double. Preferrably you would do this inside your class.
This is normal behavior for floating point variables. They are always rounded and the last digits may change valued depending on the operations you do. I suggest reading on floating points somewhere (e.g. http://floating-point-gui.de/) - especially about comparing fp values.
I normally subtract them, take the absolute value and compare this against an epsilon, e.g. if (abs(x-y)
Given you have a set_length(double d), my guess is that you actually need doubles. Note that the conversion from double to a fraction of integer is fragile and complexe, and will most probably not solve your equality problems (is 0.24999999 equal to 1/4 ?). It would be better for you to either choose to always use fractions, or always doubles. Then, just learn how to use them. I must say, for music, it make sense to have fractions as it is even how notes are being described.
If it were me, I would just use an enum. To turn something into a note would be pretty simple using this system also. Here's a way you could do it:
class Note {
public:
enum Type {
// In this case, 16 represents a whole note, but it could be larger
// if demisemiquavers were used or something.
Semiquaver = 1,
Quaver = 2,
Crotchet = 4,
Minim = 8,
Semibreve = 16
};
static float GetNoteLength(const Type &note)
{ return static_cast<float>(note)/16.0f; }
static float TieNotes(const Type &note1, const Type &note2)
{ return GetNoteLength(note1)+GetNoteLength(note2); }
};
int main()
{
// Make a semiquaver
Note::Type sq = Note::Semiquaver;
// Make a quaver
Note::Type q = Note::Quaver;
// Dot it with the semiquaver from before
float dottedQuaver = Note::TieNotes(sq, q);
std::cout << "Semiquaver is equivalent to: " << Note::GetNoteLength(sq) << " beats\n";
std::cout << "Dotted quaver is equivalent to: " << dottedQuaver << " beats\n";
return 0;
}
Those 'Irregular' notes you speak of can be retrieved using TieNotes

Comparing floats in their bit representations

Say I want a function that takes two floats (x and y), and I want to compare them using not their float representation but rather their bitwise representation as a 32-bit unsigned int. That is, a number like -495.5 has bit representation 0b11000011111001011100000000000000 or 0xC3E5C000 as a float, and I have an unsigned int with the same bit representation (corresponding to a decimal value 3286614016, which I don't care about). Is there any easy way for me to perform an operation like <= on these floats using only the information contained in their respective unsigned int counterparts?
You must do a signed compare unless you ensure that all the original values were positive. You must use an integer type that is the same size as the original floating point type. Each chip may have a different internal format, so comparing values from different chips as integers is most likely to give misleading results.
Most float formats look something like this: sxxxmmmm
s is a sign bit
xxx is an exponent
mmmm is the mantissa
The value represented will then be something like: 1mmm << (xxx-k)
1mmm because there is an implied leading 1 bit unless the value is zero.
If xxx < k then it will be a right shift. k is near but not equal to half the largest value that could be expressed by xxx. It is adjusted for the size of the mantissa.
All to say that, disregarding NaN, comparing floating point values as signed integers of the same size will yield meaningful results. They are designed that way so that floating point comparisons are no more costly than integer comparisons. There are compiler optimizations to turn off NaN checks so that the comparisons are straight integer comparisons if the floating point format of the chip supports it.
As an integer, NaN is greater than infinity is greater than finite values. If you try an unsigned compare, all the negative values will be larger than the positive values, just like signed integers cast to unsigned.
If you truly truly don't care about what the conversion yields, it isn't too hard. But the results are extremely non-portable, and you almost certainly won't get an ordering that at all resembles what you'd get by comparing the floats directly.
typedef unsigned int TypeWithSameSizeAsFloat; //Fix this for your platform
bool compare1(float one, float two)
union Convert {
float f;
TypeWithSameSizeAsFloat i;
}
Convert lhs, rhs;
lhs.f = one;
rhs.f = two;
return lhs.i < rhs.i;
}
bool compare2(float one, float two) {
return reinterpret_cast<TypeWithSameSizeAsFloat&>(one)
< reinterpret_cast<TypeWithSameSizeAsFloat&>(two);
}
Just understand the caveats, and chose your second type carefully. Its a near worthless excersize at any rate.
In a word, no. IEEE 754 might allow some kinds of hacks like this, but they do not work all the time and handle all cases, and some platforms do not use that floating point standard (such as doubles on x87 having 80 bit precision internally).
If you're doing this for performance reasons I suggest you strongly reconsider -- if it's faster to use the integer comparison the compiler will probably do it for you, and if it is not, you pay for a float to int conversion multiple times, when a simple comparison may be possible without moving the floats out of registers.
Maybe I'm misreading the question, but I suppose you could do this:
bool compare(float a, float b)
{
return *((unsigned int*)&a) < *((unsigned int*)&b);
}
But this assumes all kinds of things and also warrants the question of why you'd want to compare the bitwise representations of two floats.