Implicit conversion from long long to float yields unexpected result - c++

In an attempt to verify (using VS2012) a book's claim (2nd sentence) that
When we assign an integral value to an object of floating-point type, the fractional part is zero.
Precision may be lost if the integer has more bits than the floating-point object can accommodate.
I wrote the following wee prog:
#include <iostream>
#include <iomanip>
using std::cout;
using std::setprecision;
int main()
{
long long i = 4611686018427387905; // 2^62 + 2^0
float f = i;
std::streamsize prec = cout.precision();
cout << i << " " << setprecision(20) << f << setprecision(prec) << std::endl;
return 0;
}
The output is
4611686018427387905 4611686018427387900
I expected output of the form
4611686018427387905 4611690000000000000
How is a 4-byte float able to retain so much info about an 8-byte integer? Is there a value for i that actually demonstrates the claim?

Floats don't store their data in base 10, they store it in base 2. Thus, 4611690000000000000 isn't actually a very round number. It's binary representation is:
100000000000000000000111001111100001000001110001010000000000000.
As you can see, that would take a lot of data to precisely record. The number that's actually printed, however, has the following binary representation:
11111111111111111111111111111111111111111111111111111111111100
As you can see, that's a much rounder number, and the fact that it's off by 4 from a power of two is likely due to rounding in the convert-to-base-10 algorithm.
As an example of a number that won't fit in a float properly, try the number you expected:
4611690000000000000
You'll notice that that will come out very differently.

The float retains so much information because you're working with a number that is so close to a power of 2.
The float format stores numbers in basically binary scientific notation. In your case, it gets stored as something like
1.0000000...[61 zeroes]...00000001 * 2^62.
The float format can't store 62 decimal places, so the final 1 gets cut off... but we're left with 2^62, which is almost exactly equal to the number you're trying to store.
I'm bad at manufacturing examples, but CERT isn't; you can view an example of what happens with bungled number conversions here. Note that the example is in Java, but C++ uses the same floating point types; additionally, the first example is a conversion between a 4-byte int and a 4-byte float, but this further proves your point (there's less integer information that needs to be stored than there is in your example, yet it still fails).

Related

Are doubles able to represent every int64_t value? [duplicate]

This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.
Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.
No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.
The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.
You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.
The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.

Are all integer values perfectly represented as doubles? [duplicate]

This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.
Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.
No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.
The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.
You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.
The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.

I was trying to make an output/input calculator with C++

I am self teaching my self teaching myself C++
but I am having a trouble make a simple input/output calculator.
Here is what I have come up with:
#include <iostream>
using namespace std;
int main()
{
cout << "THIS IS TEST CALCULATOR!" << endl;
int fn,sn,dvi; //fn- FIRST NUMBER, sn - SECOND NUMBER
cout << "Please type in the first number: " << endl;
cin >> fn;
cout << "Please type in the second number: " << endl;
cin >> sn;
cout << "Please choose type ( type in one: division, multiplication, subtraction, addition or modulus): "<< endl; //Asks for the type on calculation; needs to type in one.
std::string tp;
cin>>tp;
dvi = (fn/sn); // division
if (tp == "division"){
cout<<fn<<" divide by "<<sn<<" is "<<dvi<< endl;
}
}
result:
8/6 = 1.333333... not 1
You are performing integer division, and therefore 8/6 is 1. If you want floating point arithmetic, you could change fn,sn and dvi to float instead of int, or you could simply cast one or both of the arguments to a float (or double).
See this question for more in-depth answers.
Since the decalred variable type is int you will always get the floor of number.
All three should be of type float or double. Else dvi can be of this type and typecast the fn and sn to same type (float/double) during division.
When dividing integers you get integer values not decimal ones. To solve your problem you either declare them float or double or cast the operation.
double dvi = ((double)fn / (double)sn); // division
Instead of declaring fn,sn,dvi as ints, you should declare them as doubles:
double fn,sn,dvi;
Otherwise, dividing integers by integers in C++ will truncate the numbers to a smaller whole number. For example: 5/3 would equal 1, even though in reality it equals 1.6667. C++ will not round up to 2.0, it will round down to 1.
Another workaround for this issue if you're not using variables is to add the decimal to whole numbers so that the compiler recognizes them as doubles. For example: 5.0/2.0 = 2.5
The problem in your code is that you are dividing 2 variables that are of type int, and storing it in a variable that is also of type int. All three variables:
fn
sn
dvi
are of type int in your code. An int variable only stores a whole integer number, ranging from positive infinity to negative infinity. This means that when you divide 2 numbers that do not produce an integer, the floating point value is rounded off so that it can be stored in the int variable, which in your case is dvi. In addition, even if dvi alone is of data type float, your division will still round to a whole number as you have divided two variables of data type int, namely fn and sn, and in c++, when two integers are divided, the end result is always rounded off regardless of the data type of the variable it will be stored in.
The easiest way to get around the problem is to change the declaration of these variables to:
double fn, sn, dvi;
NOTE: I use double here, as it can store a more precise value since it has 8 bytes allocated to it in contrast to 4 for a float.
This is the easiest way to get around your problem. Alternatively, we can use casting in the mathematical step of your code; in that case we would declare the variables as:
int fn, sn;
double dvi;
Then in the step where you divide the numbers, you can do the following:
dvi = (double)(fn/sn);
This is called casting; the compiler here is "forced" to treat the end result of this division as a double. In your case it will work, as you are dividing numbers, however, as a side note, it cannot be done in all cases (you cannot cast a string as a float for example). Here is a link if you want to learn more about casting:
Typecasting in C and C++
A third way, simpler than casting would be to do the following:
dvi = (fn/sn)*1.0;
This can be done only if dvi is of type float or double. Here, the mathematical operation involves a float, so all values are not of type int. As a result, the precision of the value is preserved with a decimal point, in your case, it will store 1.33333333333... instead of 1.
Sorry for the late reply, I couldn't get time to answer your question yesterday.

Converting variable type (or workaround)

The class below is supposed to represent a musical note. I want to be able to store the length of the note (e.g. 1/2 note, 1/4 note, 3/8 note, etc.) using only integers. However, I also want to be able to store the length using a floating point number for the rare case that I deal with notes of irregular lengths.
class note{
string tone;
int length_numerator;
int length_denominator;
public:
set_length(int numerator, int denominator){
length_numerator=numerator;
length_denominator=denominator;
}
set_length(double d){
length_numerator=d; // unfortunately truncates everything past decimal point
length_denominator=1;
}
}
The reason it is important for me to be able to use integers rather than doubles to store the length is that in my past experience with floating point numbers, sometimes the values are unexpectedly inaccurate. For example, a number that is supposed to be 16 occasionally gets mysteriously stored as 16.0000000001 or 15.99999999999 (usually after enduring some operations) with floating point, and this could cause problems when testing for equality (because 16!=15.99999999999).
Is it possible to convert a variable from int to double (the variable, not just its value)? If not, then what else can I do to be able to store the note's length using either an integer or a double, depending on the what I need the type to be?
If your only problem is comparing floats for equality, then I'd say to use floats, but read "Comparing floating point numbers" / Bruce Dawson first. It's not long, and it explains how to compare two floating numbers correctly (by checking the absolute and relative difference).
When you have more time, you should also look at "What Every Computer Scientist Should Know About Floating Point Arithmetic" to understand why 16 occasionally gets "mysteriously" stored as 16.0000000001 or 15.99999999999.
Attempts to use integers for rational numbers (or for fixed point arithmetic) are rarely as simple as they look.
I see several possible solutions: the first is just to use double. It's
true that extended computations may result in inaccurate results, but in
this case, your divisors are normally powers of 2, which will give exact
results (at least on all of the machines I've seen); you only risk
running into problems when dividing by some unusual value (which is the
case where you'll have to use double anyway).
You could also scale the results, e.g. representing the notes as
multiples of, say 64th notes. This will mean that most values will be
small integers, which are guaranteed exact in double (again, at least
in the usual representations). A number that is supposed to be 16 does
not get stored as 16.000000001 or 15.99999999 (but a number that is
supposed to be .16 might get stored as .1600000001 or .1599999999).
Before the appearance of long long, decimal arithmetic classes often
used double as a 52 bit integral type, ensuring at each step that the
actual value was exactly an integer. (Only division might cause a problem.)
Or you could use some sort of class representing rational numbers.
(Boost has one, for example, and I'm sure there are others.) This would
allow any strange values (5th notes, anyone?) to remain exact; it could
also be advantageous for human readable output, e.g. you could test the
denominator, and then output something like "3 quarter notes", or the
like. Even something like "a 3/4 note" would be more readable to a
musician than "a .75 note".
It is not possible to convert a variable from int to double, it is possible to convert a value from int to double. I'm not completely certain which you are asking for but maybe you are looking for a union
union DoubleOrInt
{
double d;
int i;
};
DoubleOrInt length_numerator;
DoubleOrInt length_denominator;
Then you can write
set_length(int numerator, int denominator){
length_numerator.i=numerator;
length_denominator.i=denominator;
}
set_length(double d){
length_numerator.d=d;
length_denominator.d=1.0;
}
The problem with this approach is that you absolutely must keep track of whether you are currently storing ints or doubles in your unions. Bad things will happen if you store an int and then try to access it as a double. Preferrably you would do this inside your class.
This is normal behavior for floating point variables. They are always rounded and the last digits may change valued depending on the operations you do. I suggest reading on floating points somewhere (e.g. http://floating-point-gui.de/) - especially about comparing fp values.
I normally subtract them, take the absolute value and compare this against an epsilon, e.g. if (abs(x-y)
Given you have a set_length(double d), my guess is that you actually need doubles. Note that the conversion from double to a fraction of integer is fragile and complexe, and will most probably not solve your equality problems (is 0.24999999 equal to 1/4 ?). It would be better for you to either choose to always use fractions, or always doubles. Then, just learn how to use them. I must say, for music, it make sense to have fractions as it is even how notes are being described.
If it were me, I would just use an enum. To turn something into a note would be pretty simple using this system also. Here's a way you could do it:
class Note {
public:
enum Type {
// In this case, 16 represents a whole note, but it could be larger
// if demisemiquavers were used or something.
Semiquaver = 1,
Quaver = 2,
Crotchet = 4,
Minim = 8,
Semibreve = 16
};
static float GetNoteLength(const Type &note)
{ return static_cast<float>(note)/16.0f; }
static float TieNotes(const Type &note1, const Type &note2)
{ return GetNoteLength(note1)+GetNoteLength(note2); }
};
int main()
{
// Make a semiquaver
Note::Type sq = Note::Semiquaver;
// Make a quaver
Note::Type q = Note::Quaver;
// Dot it with the semiquaver from before
float dottedQuaver = Note::TieNotes(sq, q);
std::cout << "Semiquaver is equivalent to: " << Note::GetNoteLength(sq) << " beats\n";
std::cout << "Dotted quaver is equivalent to: " << dottedQuaver << " beats\n";
return 0;
}
Those 'Irregular' notes you speak of can be retrieved using TieNotes

Rounding to use for int -> float -> int round trip conversion

I'm writing a set of numeric type conversion functions for a database engine, and I'm concerned about the behavior of converting large integral floating-point values to integer types with greater precision.
Take for example converting a 32-bit int to a 32-bit single-precision float. The 23-bit significand of the float yields about 7 decimal digits of precision, so converting any int with more than about 7 digits will result in a loss of precision (which is fine and expected). However, when you convert such a float back to an int, you end up with artifacts of its binary representation in the low-order digits:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
int a = 2147483000;
cout << a << endl;
float f = (float)a;
cout << setprecision(10) << f << endl;
int b = (int)f;
cout << b << endl;
return 0;
}
This prints:
2147483000
2147483008
2147483008
The trailing 008 is beyond the precision of the float, and therefore seems undesirable to retain in the int, since in a database application, users are primarily concerned with decimal representation, and trailing 0's are used to indicate insignificant digits.
So my questions are: Are there any well-known existing systems that perform decimal significant digit rounding in float -> int (or double -> long long) conversions, and are there any well-known, efficient algorithms for doing so?
(Note: I'm aware that some systems have decimal floating-point types, such as those defined by IEEE 754-2008. However, they don't have mainstream hardware support and aren't built into C/C++. I might want to support them down the road, but I still need to handle binary floats intuitively.)
std::numeric_limits<float>::digits10 says you only get 6 precise digits for float.
Pick an efficient algorithm for your language, processor, and data distribution to calculate-the-decimal-length-of-an-integer (or here). Then subtract the number of digits that digits10 says are precise to get the number of digits to cull. Use that as an index to lookup a power of 10 to use as a modulus. Etc.
One concern: Let's say you convert a float to a decimal and perform this sort of rounding or truncation. Then convert that "adjusted" decimal to a float and back to a decimal with the same rounding/truncation scheme. Do you get the same decimal value? Hopefully yes.
This isn't really what you're looking for but may be interesting reading: A Proposal to add a max significant decimal digits value to the C++ Standard Library Numeric limits
Naturally, 2147483008 has trailing zeros if you write it in binary (1111111111111111111110110000000) or hexadecimal (0b0x7FFFFD80). The most "correct" thing to do would be to track insignificant digits in any of those forms instead.
Alternatively, you could just zero all digits after the first seven significant ones in the int (ideally by rounding) after converting to it from a float, since the float contains approximately seven significant digits.