What happens in C++ when an integer type is cast to a floating point type or vice-versa? - c++

Do the underlying bits just get "reinterpreted" as a floating point value? Or is there a run-time conversion to produce the nearest floating point value?
Is endianness a factor on any platforms (i.e., endianness of floats differs from ints)?
How do different width types behave (e.g., int to float vs. int to double)?
What does the language standard guarantee about the safety of such casts/conversions? By cast, I mean a static_cast or C-style cast.
What about the inverse float to int conversion (or double to int)? If a float holds a small magnitude value (e.g., 2), does the bit pattern have the same meaning when interpreted as an int?

Do the underlying bits just get "reinterpreted" as a floating point value?
No, the value is converted according to the rules in the standard.
is there a run-time conversion to produce the nearest floating point value?
Yes there's a run-time conversion.
For floating point -> integer, the value is truncated, provided that the source value is in range of the integer type. If it is not, behaviour is undefined. At least I think that it's the source value, not the result, that matters. I'd have to look it up to be sure. The boundary case if the target type is char, say, would be CHAR_MAX + 0.5. I think it's undefined to cast that to char, but as I say I'm not certain.
For integer -> floating point, the result is the exact same value if possible, or else is one of the two floating point values either side of the integer value. Not necessarily the nearer of the two.
Is endianness a factor on any platforms (i.e., endianness of floats differs from ints)?
No, never. The conversions are defined in terms of values, not storage representations.
How do different width types behave (e.g., int to float vs. int to double)?
All that matters is the ranges and precisions of the types. Assuming 32 bit ints and IEEE 32 bit floats, it's possible for an int->float conversion to be imprecise. Assuming also 64 bit IEEE doubles, it is not possible for an int->double conversion to be imprecise, because all int values can be exactly represented as a double.
What does the language standard guarantee about the safety of such casts/conversions? By cast, I mean a static_cast or C-style cast.
As indicated above, it's safe except in the case where a floating point value is converted to an integer type, and the value is outside the range of the destination type.
If a float holds a small magnitude value (e.g., 2), does the bit pattern have the same meaning when interpreted as an int?
No, it does not. The IEEE 32 bit representation of 2 is 0x40000000.

For reference, this is what ISO-IEC 14882-2003 says
4.9 Floating-integral conversions
An rvalue of a floating point type can be converted to an rvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [Note:If the destination type is `bool, see 4.12. ]
An rvalue of an integer type or of an enumeration type can be converted to an rvalue of a floating point type. The result is exact if possible. Otherwise, it is an implementation-defined choice of either the next lower or higher representable value. [Note:loss of precision occurs if the integral value cannot be represented exactly as a value of the floating type. ] If the source type is bool, the value falseis converted to zero and the value true is converted to one.
Reference: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Other highly valuable references on the subject of fast float to int conversions:
Know your FPU
Let's Go To The (Floating) Point
Know your FPU: Fixing Floating Fast
Have a good read!

There are normally run-time conversions, as the bit representations are not generally compatible (with the exception that binary 0 is normally both 0 and 0.0). The C and C++ standards deal only with value, not representation, and specify generally reasonable behavior. Remember that a large int value will not normally be exactly representable in a float, and a large float value cannot be represented by an int.
Therefore:
All conversions are by value, not bit patterns. Don't worry about the bit patterns.
Don't worry about endianness, either, since that's a matter of bitwise representation.
Converting int to float can lose precision if the integer value is large in absolute value; it is less likely to with double, since double is more precise, and can represent many more exact numbers. (The details depend on what representations the system is actually using.)
The language definitions say nothing about bit patterns.
Converting from float to int is also a matter of values, not bit patterns. An exact floating-point 2.0 will convert to an integral 2 because that's how the implementation is set up, not because of bit patterns.

When you convert an integer to a float, you are not liable to loose any precision unless you are dealing with extremely large integers.
When you convert a float to an int you are essentially performing the floor() operation. so it just drops the bits after the decimal
For more information on floating point read: http://www.lahey.com/float.htm
The IEEE single-precision format has 24 bits of mantissa, 8 bits of exponent, and a sign bit. The internal floating-point registers in Intel microprocessors such as the Pentium have 64 bits of mantissa, 15 bits of exponent and a sign bit. This allows intermediate calculations to be performed with much less loss of precision than many other implementations. The down side of this is that, depending upon how intermediate values are kept in registers, calculations that look the same can give different results.
So if your integer uses more than 24 bits (excluding the hidden leading bit), then you are likely to loose some precision in conversion.

Reinterpreted? The term "reinterpretation" usually refers to raw memory reinterpretation. It is, of course, impossible to meaningfully reinterpret an integer value as a floating-point value (and vice versa) since their physical representations are generally completely different.
When you cast the types, a run-time conversion is being performed (as opposed to reinterpretation). The conversion is normally not just conceptual, it requires an actual run-time effort, because of the difference in physical representation. There are no language-defined relationships between the bit patterns of source and target values. Endianness plays no role in it either.
When you convert an integer value to a floating-point type, the original value is converted exactly if it can be represented exactly by the target type. Otherwise, the value will be changed by the conversion process.
When you convert a floating-point value to integer type, the fractional part is simply discarded (i.e. not the nearest value is taken, but the number is rounded towards zero). If the result does not fit into the target integer type, the behavior is undefined.
Note also, that floating-point to integer conversions (and the reverse) are standard conversions and formally require no explicit cast whatsoever. People might sometimes use an explicit cast to suppress compiler warnings.

If you cast the value itself, it will get converted (so in a float -> int conversion 3.14 becomes 3).
But if you cast the pointer, then you will actually 'reinterpret' the underlying bits. So if you do something like this:
double d = 3.14;
int x = *reinterpret_cast<int *>(&d);
x will have a 'random' value that is based on the representation of floating point.

Converting FP to integral type is nontrivial and not even completely defined.
Typically your FPU implements a hardware instruction to convert from IEEE format to int. That instruction might take parameters (implemented in hardware) controlling rounding. Your ABI probably specifies round-to-nearest-even. If you're on X86+SSE, it's "probably not too slow," but I can't find a reference with one Google search.
As with anything FP, there are corner cases. It would be nice if infinity were mapped to (TYPE)_MAX but that is alas not typically the case — the result of int x = INFINITY; is undefined.

Related

Is there a function that can convert every double to a unique uint64_t, maintaining precision and ORDER? (Why can't I find one?)

My understanding is that
Doubles in C++ are (at least conceptually) encoded as double-precision IEEE 754-encoded floating point numbers.
IEEE 754 says that such numbers can be represented with 64 bits.
So I should expect there exists a function f that can map every double to a unique uint64_t, and that the order should be maintained -- namely, for all double lhs, rhs, lhs < rhs == f(lhs) < f(rhs), except when (lhs or rhs is NaN).
I haven't been able to find such a function in a library or StackOverflow answer, even though such a function is probably useful to avoid instantiating an extra template for doubles in sort algorithms where double is rare as a sort-key.
I know that simply dividing by EPSILON would not work because the precision actually decreases as the numbers get larger (and improves as numbers get very close to zero); I haven't quite worked out the exact details of that scaling, though.
Surely there exists such a function in principle.
Have I not found it because it cannot be written in standard C++? That it would be too slow? That it's not as useful to people as I think?
If the representations of IEEE-754 64-bit floats are treated as 64-bit twos-complement values those values have the same order as the corresponding floating point values. The only adjustment involved is the mental one to see the pattern of bits as representing either a floating-point value or an integer value. In the CPU that’s easy: you have 64 bits of data stored in memory, and if you apply a floating-point operation to those bits you’re doing floating-point operations and if you apply integer operations to those bits you’re doing integer operations.
In C++ the type of the data determines the type of operations you can do. To apply floating-point operations to a 64-bit data object that object has to be a floating-point type. To apply integral operations it has to be an integral type.
To convert bit patterns from floating-point to integral:
std::int64_t to_int(double d) {
std::int64_t res
std::memcpy(&res, &d, sizeof(std::int64_t));
return res;
}
Converting in the other direction is left as an exercise for the reader.

Z3 - Floating point arithmetic API function Z3_mk_fpa_to_ubv

I am playing around with Z3-4.6.0 C++ for the first time. Sorry for the noob questions.
My question has 2 parts.
If I have a floating point number, and I use Z3_mk_fpa_to_ubv(...) function to create an unsigned bit-vector.
How much precision is lost?
If the precision is not lost, can I use this new unsigned bit-vector as a regular bit-vector and apply all operations defined for it for e.g., Z3_mk_bvadd(....)?
I know I can use Z3_mk_fpa_to_ieee_bv(....) for graceful, and IEEE-754 compliant conversion. Afterwards I can add,sub etc the bit-vectors.
Just being curious.
Thank you very much.
I'm afraid you're misinterpreting the role of these functions. A good reference to keep open while working with SMTLib floats is: http://smtlib.cs.uiowa.edu/papers/BTRW15.pdf
mk_fpa_to_ubv
This function corresponds to the FPToUInt function in the cited paper. It's defined as follows:
(The NaN choice above is misleading: It should be read as "undefined.")
Note that the precision loss can be huge here, depending on what the FP value is and the bit-width of the vector. Imagine converting a double-precision floating point value to an 8-bit word: You're smashing values in the range ±2.23×10^−308 to ±1.80×10^308 to a mere 256 different values. This means a large number of conversions simply will go through massive rounding cancelations.
You should think of this as "casting" in C like languages:
unsigned char c;
double f;
c = (char) f;
This is the essence of conversion from double-precision to unsigned byte, which will suffer major precision loss. In the other direction, if you convert to a really large bit-vector (say one that has a thousand bits), then your conversion will still be losing precision per the rounding mode, though you'll be able to cover all the integer values precisely in the range. So, it really depends on what BV-type you convert to and the rounding mode you choose.
mk_fpa_to_ieee_bv
This function has nothing to do with "preserving" the value. So asking "precision loss" here is irrelevant. What it does is that it gives you the underlying bit-vector representation of the floating-point value, per the IEEE-754 spec. The wikipedia article has a good discussion on this representation: https://en.wikipedia.org/wiki/Double-precision_floating-point_format#IEEE_754_double-precision_binary_floating-point_format:_binary64
In particular, if you interpret the output of this function as a two's complement integer value, you'll get a completely irrelevant value that has nothing to do with the value of the floating-point number itself. (Also, this conversion is not unique since NaN has multiple corresponding bit-vector patterns.)
Summary
Long story short, conversions from floats to bit-vectors will suffer from precision loss not only due to losing the "fractional" part due to rounding, but also due to the limited range, unless you pick a very-large bit-vector size. The IEEE-754 representation conversion does not preserve value, and thus doing arithmetic on values converted via this function is more or less meaningless.

Why does C++ promote an int to a float when a float cannot represent all int values?

Say I have the following:
int i = 23;
float f = 3.14;
if (i == f) // do something
i will be promoted to a float and the two float numbers will be compared, but can a float represent all int values? Why not promote both the int and the float to a double?
When int is promoted to unsigned in the integral promotions, negative values are also lost (which leads to such fun as 0u < -1 being true).
Like most mechanisms in C (that are inherited in C++), the usual arithmetic conversions should be understood in terms of hardware operations. The makers of C were very familiar with the assembly language of the machines with which they worked, and they wrote C to make immediate sense to themselves and people like themselves when writing things that would until then have been written in assembly (such as the UNIX kernel).
Now, processors, as a rule, do not have mixed-type instructions (add float to double, compare int to float, etc.) because it would be a huge waste of real estate on the wafer -- you'd have to implement as many times more opcodes as you want to support different types. That you only have instructions for "add int to int," "compare float to float", "multiply unsigned with unsigned" etc. makes the usual arithmetic conversions necessary in the first place -- they are a mapping of two types to the instruction family that makes most sense to use with them.
From the point of view of someone who's used to writing low-level machine code, if you have mixed types, the assembler instructions you're most likely to consider in the general case are those that require the least conversions. This is particularly the case with floating points, where conversions are runtime-expensive, and particularly back in the early 1970s, when C was developed, computers were slow, and when floating point calculations were done in software. This shows in the usual arithmetic conversions -- only one operand is ever converted (with the single exception of long/unsigned int, where the long may be converted to unsigned long, which does not require anything to be done on most machines. Perhaps not on any where the exception applies).
So, the usual arithmetic conversions are written to do what an assembly coder would do most of the time: you have two types that don't fit, convert one to the other so that it does. This is what you'd do in assembler code unless you had a specific reason to do otherwise, and to people who are used to writing assembler code and do have a specific reason to force a different conversion, explicitly requesting that conversion is natural. After all, you can simply write
if((double) i < (double) f)
It is interesting to note in this context, by the way, that unsigned is higher in the hierarchy than int, so that comparing int with unsigned will end in an unsigned comparison (hence the 0u < -1 bit from the beginning). I suspect this to be an indicator that people in olden times considered unsigned less as a restriction on int than as an extension of its value range: We don't need the sign right now, so let's use the extra bit for a larger value range. You'd use it if you had reason to expect that an int would overflow -- a much bigger worry in a world of 16-bit ints.
Even double may not be able to represent all int values, depending on how much bits does int contain.
Why not promote both the int and the float to a double?
Probably because it's more costly to convert both types to double than use one of the operands, which is already a float, as float. It would also introduce special rules for comparison operators incompatible with rules for arithmetic operators.
There's also no guarantee how floating point types will be represented, so it would be a blind shot to assume that converting int to double (or even long double) for comparison will solve anything.
The type promotion rules are designed to be simple and to work in a predictable manner. The types in C/C++ are naturally "sorted" by the range of values they can represent. See this for details. Although floating point types cannot represent all integers represented by integral types because they can't represent the same number of significant digits, they might be able to represent a wider range.
To have predictable behavior, when requiring type promotions, the numeric types are always converted to the type with the larger range to avoid overflow in the smaller one. Imagine this:
int i = 23464364; // more digits than float can represent!
float f = 123.4212E36f; // larger range than int can represent!
if (i == f) { /* do something */ }
If the conversion was done towards the integral type, the float f would certainly overflow when converted to int, leading to undefined behavior. On the other hand, converting i to f only causes a loss of precision which is irrelevant since f has the same precision so it's still possible that the comparison succeeds. It's up to the programmer at that point to interpret the result of the comparison according to the application requirements.
Finally, besides the fact that double precision floating point numbers suffer from the same problem representing integers (limited number of significant digits), using promotion on both types would lead to having a higher precision representation for i, while f is doomed to have the original precision, so the comparison will not succeed if i has a more significant digits than f to begin with. Now that is also undefined behavior: the comparison might succeed for some couples (i,f) but not for others.
can a float represent all int values?
For a typical modern system where both int and float are stored in 32 bits, no. Something's gotta give. 32 bits' worth of integers doesn't map 1-to-1 onto a same-sized set that includes fractions.
The i will be promoted to a float and the two float numbers will be compared…
Not necessarily. You don't really know what precision will apply. C++14 §5/12:
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
Although i after promotion has nominal type float, the value may be represented using double hardware. C++ doesn't guarantee floating-point precision loss or overflow. (This is not new in C++14; it's inherited from C since olden days.)
Why not promote both the int and the float to a double?
If you want optimal precision everywhere, use double instead and you'll never see a float. Or long double, but that might run slower. The rules are designed to be relatively sensible for the majority of use-cases of limited-precision types, considering that one machine may offer several alternative precisions.
Most of the time, fast and loose is good enough, so the machine is free to do whatever is easiest. That might mean a rounded, single-precision comparison, or double precision and no rounding.
But, such rules are ultimately compromises, and sometimes they fail. To precisely specify arithmetic in C++ (or C), it helps to make conversions and promotions explicit. Many style guides for extra-reliable software prohibit using implicit conversions altogether, and most compilers offer warnings to help you expunge them.
To learn about how these compromises came about, you can peruse the C rationale document. (The latest edition covers up to C99.) It is not just senseless baggage from the days of the PDP-11 or K&R.
It is fascinating that a number of answers here argue from the origin of the C language, explicitly naming K&R and historical baggage as the reason that an int is converted to a float when combined with a float.
This is pointing the blame to the wrong parties. In K&R C, there was no such thing as a float calculation. All floating point operations were done in double precision. For that reason, an integer (or anything else) was never implicitly converted to a float, but only to a double. A float also could not be the type of a function argument: you had to pass a pointer to float if you really, really, really wanted to avoid conversion into a double. For that reason, the functions
int x(float a)
{ ... }
and
int y(a)
float a;
{ ... }
have different calling conventions. The first gets a float argument, the second (by now no longer permissable as syntax) gets a double argument.
Single-precision floating point arithmetic and function arguments were only introduced with ANSI C. Kernighan/Ritchie is innocent.
Now with the newly available single float expressions (single float previously was only a storage format), there also had to be new type conversions. Whatever the ANSI C team picked here (and I would be at a loss for a better choice) is not the fault of K&R.
Q1: Can a float represent all int values?
IEE754 can represent all integers exactly as floats, up to about 223, as mentioned in this answer.
Q2: Why not promote both the int and the float to a double?
The rules in the Standard for these conversions are slight modifications of those in K&R: the modifications accommodate the added types and the value preserving rules. Explicit license was added to perform calculations in a “wider” type than absolutely necessary, since this can sometimes produce smaller and faster code, not to mention the correct answer more often. Calculations can also be performed in a “narrower” type by the as if rule so long as the same end result is obtained. Explicit casting can always be used to obtain a value in a desired type.
Source
Performing calculations in a wider type means that given float f1; and float f2;, f1 + f2 might be calculated in double precision. And it means that given int i; and float f;, i == f might be calculated in double precision. But it isn't required to calculate i == f in double precision, as hvd stated in the comment.
Also C standard says so. These are known as the usual arithmetic conversions . The following description is taken straight from the ANSI C standard.
...if either operand has type float , the other operand is converted to type float .
Source and you can see it in the ref too.
A relevant link is this answer. A more analytic source is here.
Here is another way to explain this: The usual arithmetic conversions are implicitly performed to cast their values in a common type. Compiler first performs integer promotion, if operands still have different types then they are converted to the type that appears highest in the following hierarchy:
Source.
When a programming language is created some decisions are made intuitively.
For instance why not convert int+float to int+int instead of float+float or double+double? Why call int->float a promotion if it holds the same about of bits? Why not call float->int a promotion?
If you rely on implicit type conversions you should know how they work, otherwise just convert manually.
Some language could have been designed without any automatic type conversions at all. And not every decision during a design phase could have been made logically with a good reason.
JavaScript with it's duck typing has even more obscure decisions under the hood. Designing an absolutely logical language is impossible, I think it goes to Godel incompleteness theorem. You have to balance logic, intuition, practice and ideals.
The question is why: Because it is fast, easy to explain, easy to compile, and these were all very important reasons at the time when the C language was developed.
You could have had a different rule: That for every comparison of arithmetic values, the result is that of comparing the actual numerical values. That would be somewhere between trivial if one of the expressions compared is a constant, one additional instruction when comparing signed and unsigned int, and quite difficult if you compare long long and double and want correct results when the long long cannot be represented as double. (0u < -1 would be false, because it would compare the numerical values 0 and -1 without considering their types).
In Swift, the problem is solved easily by disallowing operations between different types.
The rules are written for 16 bit ints (smallest required size). Your compiler with 32 bit ints surely converts both sides to double. There are no float registers in modern hardware anyway so it has to convert to double. Now if you have 64 bit ints I'm not too sure what it does. long double would be appropriate (normally 80 bits but it's not even standard).

What does the C++ standard say about results of casting value of a type that lies outside the range of the target type?

Recently I had to perform some data type conversions from float to 16 bit integer. Essentially my code reduces to the following
float f_val = 99999.0;
short int si_val = static_cast<short int>(f_val);
// si_val is now -32768
This input value was a problem and in my code I had neglected to check the limits of the float value so I can see my fault, but it made me wonder about the exact rules of the language when one has to do this kind of ungainly cast. I was slightly surprised to find that value of the cast was -32768. Furthermore, this is the value I get whenever the value of the float exceeds the limits of a 16 bit integer. I have googled this but found a surprising lack of detailed info about it. The best I could find was the following from cplusplus.com
Converting to int from some smaller integer type, or to double from
float is known as promotion, and is guaranteed to produce the exact
same value in the destination type. Other conversions between
arithmetic types may not always be able to represent the same value
exactly:
If the conversion is from a floating-point type to an integer type, the value
is truncated (the decimal part is removed).
The conversions from/to bool consider false equivalent to zero (for numeric
types) and to null pointer (for pointer types); and true equivalent to all
other values.
Otherwise, when the destination type cannot represent the value, the conversion
is valid between numerical types, but the value is
implementation-specific (and may not be portable).
This suggestion that the results are implementation defined does not surprise me, but I have heard that cplusplus.com is not always reliable.
Finally, when performing the same cast from a 32 bit integer to 16 bit integer (again with a value outisde of 16 bit range) I saw results clearly indicating integer overflow. Although I was not surprised by this it has added to my confusion due to the inconsistency with the cast from float type.
I have no access to the C++ standard, but a lot of C++ people here do so I was wondering what the standard says on this issue? Just for completeness, I am using g++ version 4.6.3.
You're right to question what you've read. The conversion has no defined behaviour, which contradicts what you quoted in your question.
4.9 Floating-integral conversions [conv.fpint]
1 A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates;
that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type. [ Note: If the destination type is bool, see 4.12. -- end note ]
One potentially useful permitted result that you might get is a crash.

std::numeric_limits::is_exact ... what is a usable definition?

As I interpret it, MSDN's definition of numeric_limits::is_exactis almost always false:
[all] calculations done on [this] type are free of rounding errors.
And IBM's definition is almost always true: (Or a circular definition, depending on how you read it)
a type that has exact representations for all its values
What I'm certain of is that I could store a 2 in both a double and a long and they would both be represented exactly.
I could then divide them both by 10 and neither would hold the mathematical result exactly.
Given any numeric data type T, what is the correct way to define std::numeric_limits<T>::is_exact?
Edit:
I've posted what I think is an accurate answer to this question from details supplied in many answers. This answer is not a contender for the bounty.
The definition in the standard (see NPE's answer) isn't very exact, is it? Instead, it's circular and vague.
Given that the IEC floating point standard has a concept of "inexact" numbers (and an inexact exception when a computation yields an inexact number), I suspect that this is the origin of the name is_exact. Note that of the standard types, is_exact is false only for float, double, and long double.
The intent is to indicate whether the type exactly represents all of the numbers of the underlying mathematical type. For integral types, the underlying mathematical type is some finite subset of the integers. Since each integral types exactly represents each and every one of the members of the subset of the integers targeted by that type, is_exact is true for all of the integral types. For floating point types, the underlying mathematical type is some finite range subset of the real numbers. (An example of a finite range subset is "all real numbers between 0 and 1".) There's no way to represent even a finite range subset of the reals exactly; almost all are uncomputable. The IEC/IEEE format makes matters even worse. With that format, computers can't even represent a finite range subset of the rational numbers exactly (let alone a finite range subset of the computable numbers).
I suspect that the origin of the term is_exact is the long-standing concept of "inexact" numbers in various floating point representation models. Perhaps a better name would have been is_complete.
Addendum
The numeric types defined by the language aren't the be-all and end-all of representations of "numbers". A fixed point representation is essentially the integers, so they too would be exact (no holes in the representation). Representing the rationals as a pair of standard integral types (e.g., int/int) would not be exact, but a class that represented the rationals as a Bignum pair would, at least theoretically, be "exact".
What about the reals? There's no way to represent the reals exactly because almost all of the reals are not computable. The best we could possibly do with computers is the computable numbers. That would require representing a number as some algorithm. While this might be useful theoretically, from a practical standpoint, it's not that useful at all.
Second Addendum
The place to start is with the standard. Both C++03 and C++11 define is_exact as being
True if the type uses an exact representation.
That is both vague and circular. It's meaningless. Not quite so meaningless is that integer types (char, short, int, long, etc.) are "exact" by fiat:
All integer types are exact, ...
What about other arithmetic types? The first thing to note is that the only other arithmetic types are the floating point types float, double, and long double (3.9.1/8):
There are three floating point types: float, double, and long double. ... The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types.
The meaning of the floating point types in C++ is markedly murky. Compare with Fortran:
A real datum is a processor approximation to the value of a real number.
Compare with ISO/IEC 10967-1, Language independent arithmetic (which the C++ standards reference in footnotes, but never as a normative reference):
A floating point type F shall be a finite subset of ℝ.
C++ on the other hand is moot with regard to what the floating point types are supposed to represent. As far as I can tell, an implementation could get away with making float a synonym for int, double a synonym for long, and long double a synonym for long long.
Once more from the standards on is_exact:
... but not all exact types are integer. For example, rational and fixed-exponent representations are exact but not integer.
This obviously doesn't apply to user-developed extensions for the simple reason that users are not allowed to define std::whatever<MyType>. Do that and you're invoking undefined behavior. This final clause can only pertain to implementations that
Define float, double, and long double in some peculiar way, or
Provide some non-standard rational or fixed point type as an arithmetic type and decide to provide a std::numeric_limits<non_standard_type> for these non-standard extensions.
I suggest that is_exact is true iff all literals of that type have their exact value. So is_exact is false for the floating types because the value of literal 0.1 is not exactly 0.1.
Per Christian Rau's comment, we can instead define is_exact to be true when the results of the four arithmetic operations between any two values of the type are either out of range or can be represented exactly, using the definitions of the operations for that type (i.e., truncating integer division, unsigned wraparound). With this definition you can cavil that floating-point operations are defined to produce the nearest representable value. Don't :-)
The problem of exactnes is not restricted to C, so lets look further.
Germane dicussion about redaction of standards apart, inexact has to apply to mathematical operations that require rounding for representing the result with the same type. For example, Scheme has such kind of definition of exactness/inexactness by mean of exact operations and exact literal constants see R5RS §6. standard procedures from http://www.schemers.org/Documents/Standards/R5RS/HTML
For case of double x=0.1 we either consider that 0.1 is a well defined double literal, or as in Scheme, that the literal is an inexact constant formed by an inexact compile time operation (rounding to the nearest double the result of operation 1/10 which is well defined in Q). So we always end up on operations.
Let's concentrate on +, the others can be defined mathematically by mean of + and group property.
A possible definition of inexactness could then be:
If there exists any pair of values (a,b) of a type such that a+b-a-b != 0,
then this type is inexact (in the sense that + operation is inexact).
For every floating point representation we know of (trivial case of nan and inf apart) there obviously exist such pair, so we can tell that float (operations) are inexact.
For well defined unsigned arithmetic model, + is exact.
For signed int, we have the problem of UB in case of overflow, so no warranty of exactness... Unless we refine the rule to cope with this broken arithmetic model:
If there exists any pair (a,b) such that (a+b) is well defined
and a+b-a-b != 0,
then the + operation is inexact.
Above well definedness could help us extend to other operations as well, but it's not really necessary.
We would then have to consider the case of / as false polymorphism rather than inexactness
(/ being defined as the quotient of Euclidean division for int).
Of course, this is not an official rule, validity of this answer is limited to the effort of rational thinking
The definition given in the C++ standard seems fairly unambiguous:
static constexpr bool is_exact;
True if the type uses an exact representation. All integer types are exact, but not all exact types are
integer. For example, rational and fixed-exponent representations are exact but not integer.
Meaningful for all specializations.
In C++ the int type is used to represent a mathematical integer type (i.e. one of the set of {..., -1, 0, 1, ...}). Due to the practical limitation of implementation, the language defines the minimum range of values that should be held by that type, and all valid values in that range must be represented without ambiguity on all known architectures.
The standard also defines types that are used to hold floating point numbers, each with their own range of valid values. What you won't find is the list of valid floating point numbers. Again, due to practical limitations the standard allows for approximations of these types. Many people try to say that only numbers that can be represented by the IEEE floating point standard are exact values for those types, but that's not part of the standard. Though it is true that the implementation of the language on binary computers has a standard for how double and float are represented, there is nothing in the language that says it has to be implemented on a binary computer. In other words float isn't defined by the IEEE standard, the IEEE standard is just an acceptable implementation. As such, if there were an implementation that could hold any value in the range of values that define double and float without rounding rules or estimation, you could say that is_exact is true for that platform.
Strictly speaking, T can't be your only argument to tell whether a type "is_exact", but we can infer some of the other arguments. Because you're probably using a binary computer with standard hardware and any publicly available C++ compiler, when you assign a double the value of .1 (which is in the acceptable range for the floating point types), that's not the number the computer will use in calculations with that variable. It uses the closest approximation as defined by the IEEE standard. Granted, if you compare a literal with itself your compiler should return true, because the IEEE standard is pretty explicit. We know that computers don't have infinite precision and therefore calculations that we expect to have a value of .1 won't necessarily end up with the same approximate representation that the literal value has. Enter the dreaded epsilon comparison.
To practically answer your question, I would say that for any type which requires an epsilon comparison to test for approximate equality, is_exact should return false. If strict comparison is sufficient for that type, it should return true.
std::numeric_limits<T>::is_exact should be false if and only if T's definition allows values that may be unstorable.
C++ considers any floating point literal to be a valid value for its type. And implementations are allowed to decide which values have exact stored representation.
So for every real number in the allowed range (such as 2.0 or 0.2), C++ always promises that the number is a valid double and never promises that the value can be stored exactly.
This means that two assumptions made in the question - while true for the ubiquitous IEEE floating point standard - are incorrect for the C++ definition:
I'm certain that I could store a 2 in a double exactly.
I could then divide [it] by 10 and [the double would not] hold the
mathematical result exactly.