I saw 1/3.f in a program and wondered what the .f was for. So tried my own program:
#include <iostream>
int main()
{
std::cout << (float) 1/3 << std::endl;
std::cout << 1/3.f << std::endl;
std::cout << 1/3 << std::endl;
}
Is the .f used like a cast? Is there a place where I can read more about this interesting syntax?
3. is equivalent to 3.0, it's a double.
f following a number literal makes it a float.
Without the .f the number gets interpreted as an integer, hence 1/3 is (int)1/(int)3 => (int)0 instead of the desired (float)0.333333. The .f tells the compiler to interpret the literal as a floating point number of type float. There are other such constructs such as for example 0UL which means a (unsigned long)0, whereas a plain 0 would be an (int)0.
The .f is actually two components, the . which indicates that the literal is a floating point number rather than an integer, and the f suffix which tells the compiler the literal should be of type float rather than the default double type used for floating point literals.
Disclaimer; the "cast construct" used in the above explanation is not an actual cast, but just a way to indicate the type of the literal.
If you want to know all about literals and the suffixes you can use in them, you can read the C++ standard, (1997 draft, C++11 draft, C++14 draft, C++17 draft) or alternatively, have a look at a decent textbook, such as Stroustrup's The C++ Programming Language.
As an aside, in your example (float)1/3 the literals 1 and 3 are actually integers, but the 1 is first cast to a float by your cast, then subsequently the 3 gets implicitly cast to a float because it is a righthand operand of a floating point operator. (The operator is floating point because its lefthand operand is floating point.)
By default 3.2 is treated as double; so to force the compiler to treat it as float, you need to write f at the end.
Just see this interesting demonstration:
float a = 3.2;
if ( a == 3.2 )
cout << "a is equal to 3.2"<<endl;
else
cout << "a is not equal to 3.2"<<endl;
float b = 3.2f;
if ( b == 3.2f )
cout << "b is equal to 3.2f"<<endl;
else
cout << "b is not equal to 3.2f"<<endl;
Output:
a is not equal to 3.2
b is equal to 3.2f
Do experiment here at ideone: http://www.ideone.com/WS1az
Try changing the type of the variable a from float to double, see the result again!
3.f is short for 3.0f - the number 3.0 as a floating point literal of type float.
The decimal point and the f have a different purpose so it is not really .f
You have to understand that in C and C++ everything is typed, including literals.
3 is a literal integer.
3. is a literal double
3.f is a literal float.
An IEEE float has less precision than a double. float uses only 32 bits, with a 23 bit mantissa and an 8 bit exponent (plus a sign bit).
double gives you more accuracy, but sometimes you do not need such accuracy (e.g. if you are doing calculations on figures that are only estimates in the first place) and that given by float will suffice, and if you are storing large numbers of them (eg processing a lot of time-series data) that can be more important than the accuracy.
Thus float is still a useful type.
You should not confuse this with the notation used by printf and equivalent statements.
Related
There are several posts here about floating point numbers and their nature. It is clear that comparing floats and doubles must always be done cautiously. Asking for equality has also been discussed and the recommendation is clearly to stay away from it.
But what if there is a direct assignement:
double a = 5.4;
double b = a;
assumg a is any non-NaN value - can a == b ever be false?
It seems that the answer is obviously no, yet I can't find any standard defining this behaviour in a C++ environment. IEEE-754 states that two floating point numbers with equal (non-NaN) bitset patterns are equal. Does it now mean that I can continue comparing my doubles this way without having to worry about maintainability? Do I have to worried about other compilers / operating systems and their implementation regarding these lines? Or maybe a compiler that optimizes some bits away and ruins their equality?
I wrote a little program that generates and compares non-NaN random doubles forever - until it finds a case where a == b yields false. Can I compile/run this code anywhere and anytime in the future without having to expect a halt? (ignoring endianness and assuming sign, exponent and mantissa bit sizes / positions stay the same).
#include <iostream>
#include <random>
struct double_content {
std::uint64_t mantissa : 52;
std::uint64_t exponent : 11;
std::uint64_t sign : 1;
};
static_assert(sizeof(double) == sizeof(double_content), "must be equal");
void set_double(double& n, std::uint64_t sign, std::uint64_t exponent, std::uint64_t mantissa) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
convert.sign = sign;
convert.exponent = exponent;
convert.mantissa = mantissa;
memcpy(&n, &convert, sizeof(double_content));
}
void print_double(double& n) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
std::cout << "sign: " << convert.sign << ", exponent: " << convert.exponent << ", mantissa: " << convert.mantissa << " --- " << n << '\n';
}
int main() {
std::random_device rd;
std::mt19937_64 engine(rd());
std::uniform_int_distribution<std::uint64_t> mantissa_distribution(0ull, (1ull << 52) - 1);
std::uniform_int_distribution<std::uint64_t> exponent_distribution(0ull, (1ull << 11) - 1);
std::uniform_int_distribution<std::uint64_t> sign_distribution(0ull, 1ull);
double a = 0.0;
double b = 0.0;
bool found = false;
while (!found){
auto sign = sign_distribution(engine);
auto exponent = exponent_distribution(engine);
auto mantissa = mantissa_distribution(engine);
//re-assign exponent for NaN cases
if (mantissa) {
while (exponent == (1ull << 11) - 1) {
exponent = exponent_distribution(engine);
}
}
//force -0.0 to be 0.0
if (mantissa == 0u && exponent == 0u) {
sign = 0u;
}
set_double(a, sign, exponent, mantissa);
b = a;
//here could be more (unmodifying) code to delay the next comparison
if (b != a) { //not equal!
print_double(a);
print_double(b);
found = true;
}
}
}
using Visual Studio Community 2017 Version 15.9.5
The C++ standard clearly specifies in [basic.types]#3:
For any trivially copyable type T, if two pointers to T point to distinct T objects obj1 and obj2, where neither obj1 nor obj2 is a potentially-overlapping subobject, if the underlying bytes ([intro.memory]) making up obj1 are copied into obj2, obj2 shall subsequently hold the same value as obj1.
It gives this example:
T* t1p;
T* t2p;
// provided that t2p points to an initialized object ...
std::memcpy(t1p, t2p, sizeof(T));
// at this point, every subobject of trivially copyable type in *t1p contains
// the same value as the corresponding subobject in *t2p
The remaining question is what a value is. We find in [basic.fundamental]#12 (emphasis mine):
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
Since the C++ standard has no further requirements on how floating point values are represented, this is all you will find as guarantee from the standard, as assignment is only required to preserve values ([expr.ass]#2):
In simple assignment (=), the object referred to by the left operand is modified by replacing its value with the result of the right operand.
As you correctly observed, IEEE-754 requires that non-NaN, non-zero floats compare equal if and only if they have the same bit pattern. So if your compiler uses IEEE-754-compliant floats, you should find that assignment of non-NaN, non-zero floating point numbers preserves bit patterns.
And indeed, your code
double a = 5.4;
double b = a;
should never allow (a == b) to return false. But as soon as you replace 5.4 with a more complicated expression, most of this nicety vanishes. It's not the exact subject of the article, but https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ mentions several possible ways in which innocent looking code can yield different results (which breaks "identical to the bit pattern" assertions). In particular, you might be comparing an 80 bit intermediate result with a 64 bit rounded result, possibly yielding inequality.
There are some complications here. First, note that the title asks a different question than the question. The title asks:
is assigning two doubles guaranteed to yield the same bitset patterns?
while the question asks:
can a == b ever be false?
The first of these asks whether different bits might occur from an assignment (which could be due to either the assignment not recording the same value as its right operand or due to the assignment using a different bit pattern that represents the same value), while the second asks whether, whatever bits are written by an assignment, the stored value must compare equal to the operand.
In full generality, the answer to the first question is no. Using IEEE-754 binary floating-point formats, there is a one-to-one map between non-zero numeric values and their encodings in bit patterns. However, this admits several cases where an assignment could produce a different bit pattern:
The right operand is the IEEE-754 −0 entity, but +0 is stored. This is not a proper IEEE-754 operation, but C++ is not required to conform to IEEE 754. Both −0 and +0 represent mathematical zero and would satisfy C++ requirements for assignment, so a C++ implementation could do this.
IEEE-754 decimal formats have one-to-many maps between numeric values and their encodings. By way of illustration, three hundred could be represented with bits whose direct meaning is 3•102 or bits whose direct meaning is 300•100. Again, since these represent the same mathematical value, it would be permissible under the C++ standard to store one in the left operand of an assignment when the right operand is the other.
IEEE-754 includes many non-numeric entities called NaNs (for Not a Number), and a C++ implementation might store a NaN different from the right operand. This could include either replacing any NaN with a “canonical” NaN for the implementation or, upon assignment of a signaling Nan, indicating the signal in some way and then converting the signaling NaN to a quiet NaN and storing that.
Non-IEEE-754 formats may have similar issues.
Regarding the latter question, can a == b be false after a = b, where both a and b have type double, the answer is no. The C++ standard does require that an assignment replace the value of the left operand with the value of the right operand. So, after a = b, a must have the value of b, and therefore they are equal.
Note that the C++ standard does not impose any restrictions on the accuracy of floating-point operations (although I see this only stated in non-normative notes). So, theoretically, one might interpret assignment or comparison of floating-point values to be floating-point operations and say that they do not need to be accuracy, so the assignment could change the value or the comparison could return an inaccurate result. I do not believe this is a reasonable interpretation of the standard; the lack of restrictions on floating-point accuracy is intended to allow latitude in expression evaluation and library routines, not simple assignment or comparison.
One should note the above applies specifically to a double object that is assigned from a simple double operand. This should not lull readers into complacency. Several similar but different situations can result in failure of what might seem intuitive mathematically, such as:
After float x = 3.4;, the expression x == 3.4 will generally evaluate as false, since 3.4 is a double and has to be converted to a float for the assignment. That conversion reduces precision and alters the value.
After double x = 3.4 + 1.2;, the expression x == 3.4 + 1.2 is permitted by the C++ standard to evaluate to false. This is because the standard permits floating-point expressions to be evaluated with more precision than the nominal type requires. Thus, 3.4 + 1.2 might be evaluated with the precision of long double. When the result is assigned to x, the standard requires that the excess precision be “discarded,” so the value is converted to a double. As with the float example above, this conversion may change the value. Then the comparison x == 3.4 + 1.2 may compare a double value in x to what is essentially a long double value produced by 3.4 + 1.2.
In an attempt to verify (using VS2012) a book's claim (2nd sentence) that
When we assign an integral value to an object of floating-point type, the fractional part is zero.
Precision may be lost if the integer has more bits than the floating-point object can accommodate.
I wrote the following wee prog:
#include <iostream>
#include <iomanip>
using std::cout;
using std::setprecision;
int main()
{
long long i = 4611686018427387905; // 2^62 + 2^0
float f = i;
std::streamsize prec = cout.precision();
cout << i << " " << setprecision(20) << f << setprecision(prec) << std::endl;
return 0;
}
The output is
4611686018427387905 4611686018427387900
I expected output of the form
4611686018427387905 4611690000000000000
How is a 4-byte float able to retain so much info about an 8-byte integer? Is there a value for i that actually demonstrates the claim?
Floats don't store their data in base 10, they store it in base 2. Thus, 4611690000000000000 isn't actually a very round number. It's binary representation is:
100000000000000000000111001111100001000001110001010000000000000.
As you can see, that would take a lot of data to precisely record. The number that's actually printed, however, has the following binary representation:
11111111111111111111111111111111111111111111111111111111111100
As you can see, that's a much rounder number, and the fact that it's off by 4 from a power of two is likely due to rounding in the convert-to-base-10 algorithm.
As an example of a number that won't fit in a float properly, try the number you expected:
4611690000000000000
You'll notice that that will come out very differently.
The float retains so much information because you're working with a number that is so close to a power of 2.
The float format stores numbers in basically binary scientific notation. In your case, it gets stored as something like
1.0000000...[61 zeroes]...00000001 * 2^62.
The float format can't store 62 decimal places, so the final 1 gets cut off... but we're left with 2^62, which is almost exactly equal to the number you're trying to store.
I'm bad at manufacturing examples, but CERT isn't; you can view an example of what happens with bungled number conversions here. Note that the example is in Java, but C++ uses the same floating point types; additionally, the first example is a conversion between a 4-byte int and a 4-byte float, but this further proves your point (there's less integer information that needs to be stored than there is in your example, yet it still fails).
Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?
In other words, will the following assert always be satisfied?
int main()
{
float f = some_random_float();
assert(f == (float)(double)f);
}
Assume that f could acquire any of the special values defined by IEEE, such as NaN and Infinity.
According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?
The code snippet is valid in both C and C++.
You don't even need to assume IEEE. C89 says in 3.1.2.5:
The set of values of the type float is a subset of the set of values
of the type double
And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type float", albeit values with some special-case rules when used as operands.
The fact that the float -> double -> float conversion restores the original value of the float follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.
Bit-level representations are a slightly different matter. Imagine that there's a value of float that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.
One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point == it's the kind of thing you start worrying about.
From C99:
6.3.1.5 Real floating types
1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged.
2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged...
I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.
The standard also defines the macros INFINITY and NAN in 7.12 Mathematics <math.h>:
4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time.
5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN.
So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).
The assertion will fail in flush-to-zero and/or denormalized-is-zero mode (e.g. code compiled with -mfpmath=sse, -fast-math, etc, but also on heaps of compilers and architectures as default, such as Intel's C++ compiler) if f is denormalized.
You cannot produce a denormalized float in that mode though, but the scenario is still possible:
a) Denormalized float comes from external source.
b) Some libraries tamper with FPU modes but forget (or intentionally avoid) setting them back after each function call to it, making it possible for caller to mismatch normalization.
Practical example which prints following:
f = 5.87747e-39
f2 = 5.87747e-39
f = 5.87747e-39
f2 = 0
error, f != f2!
The example works both for VC2010 and GCC 4.3 but assumes that VC uses SSE for math as default and GCC uses FPU for math as default. The example may fail to illustrate the problem otherwise.
#include <limits>
#include <iostream>
#include <cmath>
#ifdef _MSC_VER
#include <xmmintrin.h>
#endif
template <class T>bool normal(T t)
{
return (t != 0 || fabsf( t ) >= std::numeric_limits<T>::min());
}
void csr_flush_to_zero()
{
#ifdef _MSC_VER
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
#else
unsigned csr = __builtin_ia32_stmxcsr();
csr |= (1 << 15);
__builtin_ia32_ldmxcsr(csr);
#endif
}
void test_cast(float f)
{
std::cout << "f = " << f << "\n";
double d = double(f);
float f2 = float(d);
std::cout << "f2 = " << f2 << "\n";
if(f != f2)
std::cout << "error, f != f2!\n";
std::cout << "\n";
}
int main()
{
float f = std::numeric_limits<float>::min() / 2.0;
test_cast(f);
csr_flush_to_zero();
test_cast(f);
}
I'm currently working on a C++ project which does numerical calculations. The vast, vast majority of the code uses single precision floating point values and works perfectly fine with that. Because of this I use compiler flags to make basic floating point literals single precision instead of the double precision, which is the default. I find that this makes expressions easier to read and I don't have to worry about forgetting a 'f' somewhere. However, every now and then I need the extra precision offered by double precision calculations and my question is how I can get a double precision literal into such an expression. Every way I've tried so far first store the value in a single precision variable and the converts the truncated value to a double precision value. Not what I want.
Some ways I've tried so far is given below.
#include <iostream>
int main()
{
std::cout << sizeof(1.0E200) << std::endl;
std::cout << 1.0E200 << std::endl;
std::cout << sizeof(1.0E200L) << std::endl;
std::cout << 1.0E200L << std::endl;
std::cout << sizeof(double(1.0E200)) << std::endl;
std::cout << double(1.0E200) << std::endl;
std::cout << sizeof(static_cast<double>(1.0E200)) << std::endl;
std::cout << static_cast<double>(1.0E200) << std::endl;
return 0;
}
A run with single precision constants give the following results.
~/path$ g++ test.cpp -fsingle-precision-constant && ./a.out
test.cpp:6:3: warning: floating constant exceeds range of ‘float’ [-Woverflow]
test.cpp:7:3: warning: floating constant exceeds range of ‘float’ [-Woverflow]
test.cpp:12:3: warning: floating constant exceeds range of ‘float’ [-Woverflow]
test.cpp:13:3: warning: floating constant exceeds range of ‘float’ [-Woverflow]
test.cpp:15:3: warning: floating constant exceeds range of ‘float’ [-Woverflow]
test.cpp:16:3: warning: floating constant exceeds range of ‘float’ [-Woverflow]
4
inf
16
1e+200
8
inf
8
inf
It is my understanding that the 8 bytes provided by the last two cases should be enough to hold 1.0E200, a theory supported by the following output, where the same program is compiled without -fsingle-precision-constant.
~/path$ g++ test.cpp && ./a.out
8
1e+200
16
1e+200
8
1e+200
8
1e+200
A possible workaround suggested by the above examples is to use quadruple precision floating point literals everywhere I originally intended to use double precision, and cast to double precision whenever required by libraries and such. However, this feels a bit wasteful.
What else can I do?
Like Mark said, the standard says that its a double unless its followed by an f.
There are good reasons behind the standard and using compiler flags to get around it for convenience is bad practice.
So, the correct approach would be:
Remove the compiler flag
Fix all the warnings about loss of precision when storing double values in floating point variables (add in all the f suffixes)
When you need double, omit the f suffix.
Its probably not the answer you were looking for, but it is the approach you should use if you care about the longevity of your code base.
If you read 2.13.3/1 you'll see:
The type of a floating literal is double unless explicitly specified
by a suffix. The suffixes f and F specify float, the suffixes l and L
specify long double.
In other words there is no suffix to specify double for a literal floating point constant if you change the default to float. Unfortunately you can't have the best of both worlds in this case.
If you can afford GCC 4.7 or Clang 3.1, use a user-defined literal:
double operator "" _d(long double v) { return v; }
Usage:
std::cout << sizeof(1.0E200_d) << std::endl;
std::cout << 1.0E200_d << std::endl;
Result:
8
1e+200
You can't define your own suffix, but maybe a macro like
#define D(x) (double(x##L))
would work for you. The compiler ought to just emit a double constant, and appears to with -O2 on my system.
define a float variable a, convert a to float & and int &, what does this mean? After the converting , a is a reference of itself? And why the two result is different?
#include <iostream>
using namespace std;
int
main(void)
{
float a = 1.0;
cout << (float &)a <<endl;
cout << (int &)a << endl;
return 0;
}
thinkpad ~ # ./a.out
1
1065353216
cout << (float &)a <<endl;
cout << (int &)a << endl;
The first one treats the bits in a like it's a float. The second one treats the bits in a like it's an int. The bits for float 1.0 just happen to be the bits for integer 1065353216.
It's basically the equivalent of:
float a = 1.0;
int* b = (int*) &a;
cout << a << endl;
cout << *b << endl;
(int &) a casts a to a reference to an integer. In other words, an integer reference to a. (Which, as I said, treats the contents of a as an integer.)
Edit: I'm looking around now to see if this is valid. I suspect that it's not. It's depends on the type being less than or equal to the actual size.
It means undefined behavior:-).
Seriously, it is a form of type punning. a is a float, but a is also a block of memory (typically four bytes) with bits in it. (float&)a means to treat that block of memory as if it were a float (in other words, what it actually is); (int&)a means to treat it as an int. Formally, accessing an object (such as a) through an lvalue expression with a type other than the actual type of the object is undefined behavior, unless the type is a character type. Practically, if the two types have the same size, I would expect the results to be a reinterpretation of the bit pattern.
In the case of a float, the bit pattern contains bits for the sign, an exponent and a mantissa. Typically, the exponent will use some excess-n notation, and only 0.0 will have 0 as an exponent. (Some representations, including the one used on PCs, will not store the high order bit of the mantissa, since in a normalized form in base 2, it must always be 1. In such cases, the stored mantissa for 1.0 will have all bits 0.) Also typically (and I don't know of any exceptions here), the exponent will be stored in the high order bits. The result is when you "type pun" a floating point value to a an integer of the same size, the value will be fairly large, regardless of the floating point value.
The values are different because interpreting a float as an int & (reference to int) throws the doors wide open. a is not an int, so pretty much anything could actually happen when you do that. As it happens, looking at that float like it's an int gives you 1065353216, but depending on the underlying machine architecture it could be 42 or an elephant in a pink tutu or even crash.
Note that this is not the same as casting to an int, which understands how to convert from float to int. Casting to int & just looks at bits in memory without understanding what the original meaning is.