I have recently analyzed an old piece of code compiled with VS2005 because of a different numerical behaviour in "debug" (no optimizations) and "release" (/O2 /Oi /Ot options) compilations. The (reduced) code looks like:
void f(double x1, double y1, double x2, double y2)
{
double a1, a2, d;
a1 = atan2(y1,x1);
a2 = atan2(y2,x2);
d = a1 - a2;
if (d == 0.0) { // NOTE: I know that == on reals is "evil"!
printf("EQUAL!\n");
}
The function f is expected to print "EQUAL" if invoked with identical pairs of values (e.g. f(1,2,1,2)), but this doesn't always happen in "release". Indeed it happened that the compiler has optimized the code as if it were something like d = a1-atan2(y2,x2) and removed completely the assignment to the intermediate variable a2. Moreover, it has taken advantage of the fact that the second atan2()'s result is already on the FPU stack, so reloaded a1 on FPU and subtracted the values. The problem is that the FPU works at extended precision (80 bits) while a1 was "only" double (64 bits), so saving the first atan2()'s result in memory has actually lost precision. Eventually, d contains the "conversion error" between extended and double precision.
I know perfectly that identity (== operator) with float/double should be avoided. My question is not about how to check proximity between doubles. My question is about how "contractual" an assignment to a local variable should be considered. By my "naive" point of view, an assignment should force the compiler to convert a value to the precision represented by the variable's type (double, in my case). What if the variables were "float"? What if they were "int" (weird, but legal)?
So, in short, what does the C standard say about that cases?
By my "naive" point of view, an assignment should force the compiler to convert a value to the precision represented by the variable's type (double, in my case).
Yes, this is what the C99 standard says. See below.
So, in short, what does the C standard say about that cases?
The C99 standard allows, in some circumstances, floating-point operations to be computed at a higher precision than that implied by the type: look for FLT_EVAL_METHOD and FP_CONTRACT in the standard, these are the two constructs related to excess precision. But I am not aware of any words that could be interpreted as meaning that the compiler is allowed to arbitrarily reduce the precision of a floating-point value from the computing precision to the type precision. This should, in a strict interpretation of the standard, only happen in specific spots, such as assignments and casts, in a deterministic fashion.
The best is to read Joseph S. Myers's analysis of the parts relevant to FLT_EVAL_METHOD:
C99 allows evaluation with excess range and precision following
certain rules. These are outlined in 5.2.4.2.2 paragraph 8:
Except for assignment and cast (which remove all extra range and
precision), the values of operations with floating operands and
values subject to the usual arithmetic conversions and of floating
constants are evaluated to a format whose range and precision may
be greater than required by the type. The use of evaluation
formats is characterized by the implementation-defined value of
FLT_EVAL_METHOD:
Joseph S. Myers goes on to describe the situation in GCC before the patch that accompanies his post. The situation was just as bad as it is in your compiler (and countless others):
GCC defines FLT_EVAL_METHOD to 2 when using x87 floating point. Its
implementation, however, does not conform to the C99 requirements for
FLT_EVAL_METHOD == 2, since it is implemented by the back end
pretending that the processor supports operations on SFmode and
DFmode:
Sometimes, depending on optimization, a value may be spilled to
memory in SFmode or DFmode, so losing excess precision unpredictably
and in places other than when C99 specifies that it is lost.
An assignment will not generally lose excess precision, although
-ffloat-store may make it more likely that it does.
…
The C++ standard inherits the definition of math.h from C99, and math.h is the header that defines FLT_EVAL_METHOD. For this reason you might expect C++ compilers to follow suit, but they do not seem to be taking the issue as seriously. Even G++ still does not support -fexcess-precision=standard, although it uses the same back-end as GCC (which has supported this option since Joseph S. Myers' post and accompanying patch).
Related
I am working on floating point determinism and having already studied so many surprising potential causes of indeterminism, I am starting to get paranoid about copying floats:
Does anything in the C++ standard or in general guarantee me that a float lvalue, after being copied to another float variable or when used as a const-ref or by-value parameter, will always be bitwise equivalent to the original value?
Can anything cause a copied float to be bitwise inquivalent to the original value, such as changing the floating point environment or passing it into a different thread?
Here is some sample code based on what I use to check for equivalence of floating point values in my test-cases, this one will fail because it expects FE_TONEAREST:
#include <cfenv>
#include <cstdint>
// MSVC-specific pragmas for floating point control
#pragma float_control(precise, on)
#pragma float_control(except, on)
#pragma fenv_access(on)
#pragma fp_contract(off)
// May make a copy of the floats
bool compareFloats(float resultValue, float comparisonValue)
{
// I was originally doing a bit-wise comparison here but I was made
// aware in the comments that this might not actually be what I want
// so I only check against the equality of the values here now
// (NaN values etc. have to be handled extra)
bool areEqual = (resultValue == comparisonValue);
// Additional outputs if not equal
// ...
return areEqual;
}
int main()
{
std::fesetround(FE_TOWARDZERO)
float value = 1.f / 10;
float expectedResult = 0x1.99999ap-4;
compareFloats(value, expectedResult);
}
Do I have to be worried that if I pass a float by-value into the comparison function it might come out differently on the other side, even though it is an lvalue?
No there is no such guarantee.
Subnormal, non-normalised floating points, and NaN are all cases where the bit patterns may differ.
I believe that signed negative zero is allowed to become a signed positive zero on assignment, although IEEE754 disallows that.
The C++ standard itself has virtually no guarantees on floating point math because it does not mandate IEEE-754 but leaves it up to the implementation (emphasis mine):
[basic.fundamental/12]
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
[ Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note ]
The C++ code you write is a high-level abstract description of what you want the abstract machine to do, and it is fully in the hands of the compiler what this gets translated to. "Assignments" is an aspect of the C++ standard, and as shown above, the C++ standard does not mandate the behavior of floating point operations. To verify the statement "assignments leave floating point values unchanged" your compiler would have to specify its floating point behavior in terms of the C++ abstract machine, and I've not seen any such documentation (especially not for MSVC).
In other words: Without nailing down the exact compiler, compiler version, compilation flags etc., it is impossible to say for sure what the floating point semantics of a C++ program are (especially regarding the difficult cases like rounding, NaNs or signed zero). Most compilers differentiate between strict IEEE conformance and relaxing some of those restrictions, but even then you are not necessarily guaranteed that the program has the same outputs in non-optimized vs optimized builds due to, say, constant folding, precision of intermediate results and so on.
Point in case: For gcc, even with -O0, your program in question does not compute 1.f / 10 at run-time but at compile-time and thus your rounding mode settings are ignored: https://godbolt.org/z/U8B6bc
You should not be paranoid about copying floats in particular but paranoid of compiler optimizations for floating point in general.
It is common knowledge that one has to be careful when comparing floating point values. Usually, instead of using ==, we use some epsilon or ULP based equality testing.
However, I wonder, are there any cases, when using == is perfectly fine?
Look at this simple snippet, which cases are guaranteed to succeed?
void fn(float a, float b) {
float l1 = a/b;
float l2 = a/b;
if (l1==l1) { } // case a)
if (l1==l2) { } // case b)
if (l1==a/b) { } // case c)
if (l1==5.0f/3.0f) { } // case d)
}
int main() {
fn(5.0f, 3.0f);
}
Note: I've checked this and this, but they don't cover (all of) my cases.
Note2: It seems that I have to add some plus information, so answers can be useful in practice: I'd like to know:
what the C++ standard says
what happens, if a C++ implementation follows IEEE-754
This is the only relevant statement I found in the current draft standard:
The value representation of floating-point types is implementation-defined. [ Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note ]
So, does this mean, that even "case a)" is implementation defined? I mean, l1==l1 is definitely a floating-point operation. So, if an implementation is "inaccurate", then could l1==l1 be false?
I think this question is not a duplicate of Is floating-point == ever OK?. That question doesn't address any of the cases I'm asking. Same subject, different question. I'd like to have answers specifically to case a)-d), for which I cannot find answers in the duplicated question.
However, I wonder, are there any cases, when using == is perfectly fine?
Sure there are. One category of examples are usages that involve no computation, e.g. setters that should only execute on changes:
void setRange(float min, float max)
{
if(min == m_fMin && max == m_fMax)
return;
m_fMin = min;
m_fMax = max;
// Do something with min and/or max
emit rangeChanged(min, max);
}
See also Is floating-point == ever OK? and Is floating-point == ever OK?.
Contrived cases may "work". Practical cases may still fail. One additional issue is that often optimisation will cause small variations in the way the calculation is done so that symbolically the results should be equal but numerically they are different. The example above could, theoretically, fail in such a case. Some compilers offer an option to produce more consistent results at a cost to performance. I would advise "always" avoiding the equality of floating point numbers.
Equality of physical measurements, as well as digitally stored floats, is often meaningless. So if your comparing if floats are equal in your code you are probably doing something wrong. You usually want greater than or less that or within a tolerance. Often code can be rewritten so these types of issues are avoided.
Only a) and b) are guaranteed to succeed in any sane implementation (see the legalese below for details), as they compare two values that have been derived in the same way and rounded to float precision. Consequently, both compared values are guaranteed to be identical to the last bit.
Case c) and d) may fail because the computation and subsequent comparison may be carried out with higher precision than float. The different rounding of double should be enough to fail the test.
Note that the cases a) and b) may still fail if infinities or NANs are involved, though.
Legalese
Using the N3242 C++11 working draft of the standard, I find the following:
In the text describing the assignment expression, it is explicitly stated that type conversion takes place, [expr.ass] 3:
If the left operand is not of class type, the expression is implicitly converted (Clause 4) to the cv-unqualified type of the left operand.
Clause 4 refers to the standard conversions [conv], which contain the following on floating point conversions, [conv.double] 1:
A prvalue of floating point type can be converted to a prvalue of another floating point type. If the
source value can be exactly represented in the destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination values, the result of the conversion
is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.
(Emphasis mine.)
So we have the guarantee that the result of the conversion is actually defined, unless we are dealing with values outside the representable range (like float a = 1e300, which is UB).
When people think about "internal floating point representation may be more precise than visible in code", they think about the following sentence in the standard, [expr] 11:
The values of the floating operands and the results of floating expressions may be represented in greater
precision and range than that required by the type; the types are not changed thereby.
Note that this applies to operands and results, not to variables. This is emphasized by the attached footnote 60:
The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.
(I guess, this is the footnote that Maciej Piechotka meant in the comments - the numbering seems to have changed in the version of the standard he's been using.)
So, when I say float a = some_double_expression;, I have the guarantee that the result of the expression is actually rounded to be representable by a float (invoking UB only if the value is out-of-bounds), and a will refer to that rounded value afterwards.
An implementation could indeed specify that the result of the rounding is random, and thus break the cases a) and b). Sane implementations won't do that, though.
Assuming IEEE 754 semantics, there are definitely some cases where you can do this. Conventional floating point number computations are exact whenever they can be, which for example includes (but is not limited to) all basic operations where the operands and the results are integers.
So if you know for a fact that you don't do anything that would result in something unrepresentable, you are fine. For example
float a = 1.0f;
float b = 1.0f;
float c = 2.0f;
assert(a + b == c); // you can safely expect this to succeed
The situation only really gets bad if you have computations with results that aren't exactly representable (or that involve operations which aren't exact) and you change the order of operations.
Note that the C++ standard itself doesn't guarantee IEEE 754 semantics, but that's what you can expect to be dealing with most of the time.
Case (a) fails if a == b == 0.0. In this case, the operation yields NaN, and by definition (IEEE, not C) NaN ≠ NaN.
Cases (b) and (c) can fail in parallel computation when floating-point round modes (or other computation modes) are changed in the middle of this thread's execution. Seen this one in practice, unfortunately.
Case (d) can be different because the compiler (on some machine) may choose to constant-fold the computation of 5.0f/3.0f and replace it with the constant result (of unspecified precision), whereas a/b must be computed at runtime on the target machine (which might be radically different). In fact, intermediate calculations may be performed in arbitrary precision. I've seen differences on old Intel architectures when intermediate computation was performed in 80-bit floating-point, a format that the language didn't even directly support.
In my humble opinion, you should not rely on the == operator because it has many corner cases. The biggest problem is rounding and extended precision. In case of x86, floating point operations can be done with bigger precision than you can store in variables (if you use coprocessors, IIRC SSE operations use same precision as storage).
This is usually good thing, but this causes problems like:
1./2 != 1./2 because one value is form variable and second is from floating point register. In the simplest cases, it will work, but if you add other floating point operations the compiler could decide to split some variables to the stack, changing their values, thus changing the result of the comparison.
To have 100% certainty you need look at assembly and see what operations was done before on both values. Even the order can change the result in non-trivial cases.
Overall what is point of using ==? You should use algorithms that are stable. This means they work even if values are not equal, but they still give the same results. The only place I know where == could be useful is serializing/deserializing where you know what result you want exactly and you can alter serialization to archive your goal.
In the following code:
#include <cstdint>
#include <cinttypes>
#include <cstdio>
using namespace std;
int main() {
double xd = 1.18;
int64_t xi = 1000000000;
int64_t res1 = (double)(xi * xd);
double d = xi * xd;
int64_t res2 = d;
printf("%" PRId64"\n", res1);
printf("%" PRId64"\n", res2);
}
Using v4.9.3 g++ -std=c++14 targeting 32-bit Windows I get output:
1179999999
1180000000
Are these values allowed to be different?
I expected that, even if the compiler uses a higher internal precision than double for the computation of xi * xd, it should do this consistently. Loss of precising in floating conversion is implementation-defined, and also the precision of this calculation is implementation-defined - [c.limits]/3 says that FLT_EVAL_METHOD should be imported from C99. IOW I expected that it should not be allowed to use a different precision for xi * xd on one line than it does on another line.
Note: This is intentionally a C++ question and not a C question - I believe the two languages have different rules in this area.
even if the compiler uses a higher internal precision than double for the computation of xi * xd, it should do this consistently
Whether required or not (discussed below), this clearly doesn't happen: Stackoverflow is littered with questions from people who've seen similar-seeming calculations change for no ostensible reason within the same program.
The C++ Standard draft n3690 says (emphasis mine):
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.62
62) The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.
So - in agreement with M.M.'s comment and contrary to my earlier edit - it's the version with the (double) cast that must be rounded to a 64-bit double - which evidently happens to be >= 1180000000 in the run documented in the question - before truncation to integer. The more general case sans 62) leaves the compiler freedom not to round early in the other case.
[c.limits]/3 says that FLT_EVAL_METHOD should be imported from C99. IOW I expected that it should not be allowed to use a different precision for xi * xd on one line than it does on another line.
Check the cppreference page:
Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision (unless #pragma STDC FP_CONTRACT is off)
As tmyklebu comments, it continues:
Cast and assignment strip away any extraneous range and precision: this models the action of storing a value from an extended-precision FPU register into a standard-sized memory location.
This last agrees with the "62)" part of the Standard.
M.M. comments:
STDC FP_CONTRACT does not seem to appear in the C++ Standard and also it's not clear to me exactly to what extent the C99 behaviour is 'imported'
Doesn't appear in the draft I looked at. That suggests C++ doesn't guarantee its availability, leaving the default mentioned above of "any floating-point expression may be contracted", but we know per M.M. comments and the Standard and cppreference quotes above the (double) cast is an exception forcing rounding to 64 bits.
The C++ Standard draft mentioned above says of <cfloat>:
The contents are the same as the Standard C library header .
See also: ISO C 7.1.5, 5.2.4.2.2, 5.2.4.2.1.
If one of those C Standards required STDC FP_CONTRACT there's more chance of it being portable for use by C++ programs, but I've not surveyed implementations for support.
Depending on FLT_EVAL_METHOD, xi * xd may be calculated with higher precision than double. If xi were so large that it cannot be represented exactly in double, then I'm not even sure if the compiler would be allowed to convert it exactly to long double or not - probably not, because that conversion happens before anything covered by FLT_EVAL_METHOD. There is no requirement that higher precision must be used consistently.
There are two places where conversion to double must happen: At the point of the cast (double) and at the point of assignment to a double. There have been gcc versions where the cast to double was "optimised" away if a value was already "officially" a double (like xi * xd here) even if in reality it was higher precision; that "optimisation" was always a bug because a cast must convert.
So you may have run into this bug where a cast to double wasn't performed (if the bug is still there), you may have run into inconsistent use of higher precision, which is legal if FLT_EVAL_METHOD allows it, and you may even have run into inconsistent use of higher precision when FLT_EVAL_METHOD didn't allow it at all, which would again be a bug (not the inconsistency, but the use of higher precision in the first place).
Say I have the following:
int i = 23;
float f = 3.14;
if (i == f) // do something
i will be promoted to a float and the two float numbers will be compared, but can a float represent all int values? Why not promote both the int and the float to a double?
When int is promoted to unsigned in the integral promotions, negative values are also lost (which leads to such fun as 0u < -1 being true).
Like most mechanisms in C (that are inherited in C++), the usual arithmetic conversions should be understood in terms of hardware operations. The makers of C were very familiar with the assembly language of the machines with which they worked, and they wrote C to make immediate sense to themselves and people like themselves when writing things that would until then have been written in assembly (such as the UNIX kernel).
Now, processors, as a rule, do not have mixed-type instructions (add float to double, compare int to float, etc.) because it would be a huge waste of real estate on the wafer -- you'd have to implement as many times more opcodes as you want to support different types. That you only have instructions for "add int to int," "compare float to float", "multiply unsigned with unsigned" etc. makes the usual arithmetic conversions necessary in the first place -- they are a mapping of two types to the instruction family that makes most sense to use with them.
From the point of view of someone who's used to writing low-level machine code, if you have mixed types, the assembler instructions you're most likely to consider in the general case are those that require the least conversions. This is particularly the case with floating points, where conversions are runtime-expensive, and particularly back in the early 1970s, when C was developed, computers were slow, and when floating point calculations were done in software. This shows in the usual arithmetic conversions -- only one operand is ever converted (with the single exception of long/unsigned int, where the long may be converted to unsigned long, which does not require anything to be done on most machines. Perhaps not on any where the exception applies).
So, the usual arithmetic conversions are written to do what an assembly coder would do most of the time: you have two types that don't fit, convert one to the other so that it does. This is what you'd do in assembler code unless you had a specific reason to do otherwise, and to people who are used to writing assembler code and do have a specific reason to force a different conversion, explicitly requesting that conversion is natural. After all, you can simply write
if((double) i < (double) f)
It is interesting to note in this context, by the way, that unsigned is higher in the hierarchy than int, so that comparing int with unsigned will end in an unsigned comparison (hence the 0u < -1 bit from the beginning). I suspect this to be an indicator that people in olden times considered unsigned less as a restriction on int than as an extension of its value range: We don't need the sign right now, so let's use the extra bit for a larger value range. You'd use it if you had reason to expect that an int would overflow -- a much bigger worry in a world of 16-bit ints.
Even double may not be able to represent all int values, depending on how much bits does int contain.
Why not promote both the int and the float to a double?
Probably because it's more costly to convert both types to double than use one of the operands, which is already a float, as float. It would also introduce special rules for comparison operators incompatible with rules for arithmetic operators.
There's also no guarantee how floating point types will be represented, so it would be a blind shot to assume that converting int to double (or even long double) for comparison will solve anything.
The type promotion rules are designed to be simple and to work in a predictable manner. The types in C/C++ are naturally "sorted" by the range of values they can represent. See this for details. Although floating point types cannot represent all integers represented by integral types because they can't represent the same number of significant digits, they might be able to represent a wider range.
To have predictable behavior, when requiring type promotions, the numeric types are always converted to the type with the larger range to avoid overflow in the smaller one. Imagine this:
int i = 23464364; // more digits than float can represent!
float f = 123.4212E36f; // larger range than int can represent!
if (i == f) { /* do something */ }
If the conversion was done towards the integral type, the float f would certainly overflow when converted to int, leading to undefined behavior. On the other hand, converting i to f only causes a loss of precision which is irrelevant since f has the same precision so it's still possible that the comparison succeeds. It's up to the programmer at that point to interpret the result of the comparison according to the application requirements.
Finally, besides the fact that double precision floating point numbers suffer from the same problem representing integers (limited number of significant digits), using promotion on both types would lead to having a higher precision representation for i, while f is doomed to have the original precision, so the comparison will not succeed if i has a more significant digits than f to begin with. Now that is also undefined behavior: the comparison might succeed for some couples (i,f) but not for others.
can a float represent all int values?
For a typical modern system where both int and float are stored in 32 bits, no. Something's gotta give. 32 bits' worth of integers doesn't map 1-to-1 onto a same-sized set that includes fractions.
The i will be promoted to a float and the two float numbers will be compared…
Not necessarily. You don't really know what precision will apply. C++14 §5/12:
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
Although i after promotion has nominal type float, the value may be represented using double hardware. C++ doesn't guarantee floating-point precision loss or overflow. (This is not new in C++14; it's inherited from C since olden days.)
Why not promote both the int and the float to a double?
If you want optimal precision everywhere, use double instead and you'll never see a float. Or long double, but that might run slower. The rules are designed to be relatively sensible for the majority of use-cases of limited-precision types, considering that one machine may offer several alternative precisions.
Most of the time, fast and loose is good enough, so the machine is free to do whatever is easiest. That might mean a rounded, single-precision comparison, or double precision and no rounding.
But, such rules are ultimately compromises, and sometimes they fail. To precisely specify arithmetic in C++ (or C), it helps to make conversions and promotions explicit. Many style guides for extra-reliable software prohibit using implicit conversions altogether, and most compilers offer warnings to help you expunge them.
To learn about how these compromises came about, you can peruse the C rationale document. (The latest edition covers up to C99.) It is not just senseless baggage from the days of the PDP-11 or K&R.
It is fascinating that a number of answers here argue from the origin of the C language, explicitly naming K&R and historical baggage as the reason that an int is converted to a float when combined with a float.
This is pointing the blame to the wrong parties. In K&R C, there was no such thing as a float calculation. All floating point operations were done in double precision. For that reason, an integer (or anything else) was never implicitly converted to a float, but only to a double. A float also could not be the type of a function argument: you had to pass a pointer to float if you really, really, really wanted to avoid conversion into a double. For that reason, the functions
int x(float a)
{ ... }
and
int y(a)
float a;
{ ... }
have different calling conventions. The first gets a float argument, the second (by now no longer permissable as syntax) gets a double argument.
Single-precision floating point arithmetic and function arguments were only introduced with ANSI C. Kernighan/Ritchie is innocent.
Now with the newly available single float expressions (single float previously was only a storage format), there also had to be new type conversions. Whatever the ANSI C team picked here (and I would be at a loss for a better choice) is not the fault of K&R.
Q1: Can a float represent all int values?
IEE754 can represent all integers exactly as floats, up to about 223, as mentioned in this answer.
Q2: Why not promote both the int and the float to a double?
The rules in the Standard for these conversions are slight modifications of those in K&R: the modifications accommodate the added types and the value preserving rules. Explicit license was added to perform calculations in a “wider” type than absolutely necessary, since this can sometimes produce smaller and faster code, not to mention the correct answer more often. Calculations can also be performed in a “narrower” type by the as if rule so long as the same end result is obtained. Explicit casting can always be used to obtain a value in a desired type.
Source
Performing calculations in a wider type means that given float f1; and float f2;, f1 + f2 might be calculated in double precision. And it means that given int i; and float f;, i == f might be calculated in double precision. But it isn't required to calculate i == f in double precision, as hvd stated in the comment.
Also C standard says so. These are known as the usual arithmetic conversions . The following description is taken straight from the ANSI C standard.
...if either operand has type float , the other operand is converted to type float .
Source and you can see it in the ref too.
A relevant link is this answer. A more analytic source is here.
Here is another way to explain this: The usual arithmetic conversions are implicitly performed to cast their values in a common type. Compiler first performs integer promotion, if operands still have different types then they are converted to the type that appears highest in the following hierarchy:
Source.
When a programming language is created some decisions are made intuitively.
For instance why not convert int+float to int+int instead of float+float or double+double? Why call int->float a promotion if it holds the same about of bits? Why not call float->int a promotion?
If you rely on implicit type conversions you should know how they work, otherwise just convert manually.
Some language could have been designed without any automatic type conversions at all. And not every decision during a design phase could have been made logically with a good reason.
JavaScript with it's duck typing has even more obscure decisions under the hood. Designing an absolutely logical language is impossible, I think it goes to Godel incompleteness theorem. You have to balance logic, intuition, practice and ideals.
The question is why: Because it is fast, easy to explain, easy to compile, and these were all very important reasons at the time when the C language was developed.
You could have had a different rule: That for every comparison of arithmetic values, the result is that of comparing the actual numerical values. That would be somewhere between trivial if one of the expressions compared is a constant, one additional instruction when comparing signed and unsigned int, and quite difficult if you compare long long and double and want correct results when the long long cannot be represented as double. (0u < -1 would be false, because it would compare the numerical values 0 and -1 without considering their types).
In Swift, the problem is solved easily by disallowing operations between different types.
The rules are written for 16 bit ints (smallest required size). Your compiler with 32 bit ints surely converts both sides to double. There are no float registers in modern hardware anyway so it has to convert to double. Now if you have 64 bit ints I'm not too sure what it does. long double would be appropriate (normally 80 bits but it's not even standard).
This is what I am trying to do:
//Let Bin2Float be a magic macro that packages specified bit pattern into float as a constant
const float MyInf = Bin2Float(01111111,10000000,00000000,00000000);
We all know how to package the bit patterns into integers ("binary constant" hacks) and the input to this magic prototype macro is the same as would be for corresponding 32-bit integer binary constant macro. Packaging the bits into integer constant is not a problem. But, after playing with pointer and union punning, I realized that type-punning the integer into float, however, leads to many issues (some on MSVC side, some on gcc side). So here is the list of requirements:
Must compile under gcc (C mode), g++, MSVC (even if I have to use conditional compiling to do two separate versions)
Must compile for both C and C++
In resulting assembly code, must compile into hardcoded constant,
not be dynamically computed
Must not use memcpy
Must not use static or global variables
Must not use the pointer-based type punning to avoid issues with
strict aliasing
First, there is rarely a need to specify floating-point constants in this way. For infinity, use INFINITY. For a NaN, use either NAN or nanf(string). These are defined in <math.h>. The compiler is likely to compile INFINITY and NAN to some sort of assembly-language constant (could be in the read-only data section, could be formed in immediate fields of instructions, et cetera). However, this cannot be guaranteed except by the compiler implementors, since the C standard does not guarantee it. nanf is likely to result in a function call, although the compiler is free to optimize it to a constant, if the string is a constant. For finite numbers, use hexadecimal floating-point constants (e.g., “0x3.4p5”). The only IEEE 754 floating-point object you cannot completely specify this way, down the last bit, is NaNs. The nan and nanf functions are not fully specified by the C standard, so you do not have full control of the significand bits unless the implementation provides it.
I am unfamiliar with the binary constant hacks you allude to. Supposing you have a macro Bin2Unsigned that provides an unsigned int, then you can use this:
const float MyInf = (union { unsigned u; float f; }) { Bin2Unsigned(…) } .f;
That is, believe it or not, standard C syntax and semantics up to the point where the bits are reinterpreted as a float. Obviously, the interpretation of the bits depends on the implementation. However, the compound literal and reinterpreting through a union is specified by the C standard.
I tested with gcc version 4.2.1 (Apple Inc. build 5666), targeting x86_64, with -O3 and default options otherwise, and the resulting assembly code used a constant, .long 2139095040.