The following C/C++ code:
long long foo = -9223372036854775808LL; // -2^63
compiles (g++) with the warning
integer constant is so large that it is unsigned.
clang++ gives a similar warning.
Thanks to this bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52661. I now understand why GCC gives this warning. Unfortunately, the response to the bug report didn't explain the reason for this behaviour very well.
Questions:
Why is no warning given for the equivalent code for a 32/16/8-bit signed integer constant?
GCC and Clang both give this warning, so it is clearly intentional behaviour and not just 'to make it easier to parse,' as is suggested in response to the bug report. Why?
Is this behaviour mandated by the C/C++ standard? Some other standard?
This has to do with how the type of integer constants is defined.
First, as mentioned in the gcc bug report, -9223372036854775808LL is actually two tokens: the unary - operator and the integer constant 9223372036854775808LL. So the warning applies only to the latter.
Section 6.4.4.1p5 of the C standard states:
The type of an integer constant is the first of the corresponding list in which its value can be represented.
Based on this, a decimal integer constant with no suffix will have type int, long, or long long based on the value. These are all signed types. So anything value small enough to fit in an 8 bit or 16 bit type still has type int, and a value too large for a 32 bit signed int will have type long or long long depending on the size of the type on that system. The same goes for a constant with the LL suffix, but only the long long type is tried.
The warning comes up because the value you're using doesn't fit in the above type list. Any lesser value will result in the value having a signed type meaning there's no conversion to unsigned.
As various more or less confused people in the bug report said, the integer constant 9223372036854775808LL is too large to fit inside a long long.
For decimal constants, the standard has a list in 6.4.4.1 (see the answer by #dbush) describing what types the compiler will try to give to an integer constant. In this case, the only valid option for type is (signed) long long and it won't fit there. Then §6 under that table kicks in:
If an integer constant cannot be represented by any type in its list, it may have an
extended integer type, if the extended integer type can represent its value. /--/
If the list contains both signed and unsigned types, the extended integer type
may be signed or unsigned.
Extended integer type is a fuzzy but formal term in the standard. In this case the compiler apparenty tries to squeeze the constant into a unsigned long long "extended integer type" where it fits. This isn't really guaranteed behavior but implementation-defined.
Then the unary - operator is applied to the unsigned long long which produces the warning.
This is the reason why library headers such as limits.h like to define LLONG_MIN as
#define LLONG_MIN (-9223372036854775807LL - 1)
You could do something similar to avoid this warning. Or better yet, use LLONG_MIN.
Why is no warning given for the equivalent code for a 32/16/8-bit signed integer constant?
A constant is not limited 8, 16, or 32 bit. It is the first type that fits and decimal constants can go up to at least 63-bits.
9223372036854775808LL is outside OP's long long range as 9223372036854775808 takes 64-bits.
The - is applied after the constant is made.
On a int,long,long long as 32,32,64 bit implementation: -2147483648 is type long long, not int.
GCC and Clang both give this warning, so it is clearly intentional behavior and not just 'to make it easier to parse,' as is suggested in response to the bug report. Why?
No comment. Link was not informative. Best to data here.
Is this behavior mandated by the C/C++ standard? Some other standard?
Yes, by the C standard.
Related
The C and C++ standards stipulate that, in binary operations between a signed and an unsigned integer of the same rank, the signed integer is cast to unsigned. There are many questions on SO caused by this... let's call it strange behavior: unsigned to signed conversion, C++ Implicit Conversion (Signed + Unsigned), A warning - comparison between signed and unsigned integer expressions, % (mod) with mixed signedness, etc.
But none of these give any reasons as to why the standard goes this way, rather than casting towards signed ints. I did find a self-proclaimed guru who says it's the obvious right thing to do, but he doesn't give a reasoning either: http://embeddedgurus.com/stack-overflow/2009/08/a-tutorial-on-signed-and-unsigned-integers/.
Looking through my own code, wherever I combine signed and unsigned integers, I always need to cast from unsigned to signed. There are places where it doesn't matter, but I haven't found a single example of code where it makes sense to cast the signed integer to unsigned.
What are cases where casting to unsigned in the correct thing to do? Why is the standard the way it is?
Casting from unsigned to signed results in implementation-defined behaviour if the value cannot be represented. Casting from signed to unsigned is always modulo two to the power of the unsigned's bitsize, so it is always well-defined.
The standard conversion is to the signed type if every possible unsigned value is representable in the signed type. Otherwise, the unsigned type is chosen. This guarantees that the conversion is always well-defined.
Notes
As indicated in comments, the conversion algorithm for C++ was inherited from C to maintain compatibility, which is technically the reason it is so in C++.
When this note was written, the C++ standard allowed three binary representations, including sign-magnitude and ones' complement. That's no longer the case, and there's every reason to believe that it won't be the case for C either in the reasonably bear future. I'm leaving the footnote as a historical relic, but it says nothing relevant to the current language.
It has been suggested that the decision in the standard to define signed to unsigned conversions and not unsigned to signed conversion is somehow arbitrary, and that the other possible decision would be symmetric. However, the possible conversion are not symmetric.
In both of the non-2's-complement representations contemplated by the standard, an n-bit signed representation can represent only 2n−1 values, whereas an n-bit unsigned representation can represent 2n values. Consequently, a signed-to-unsigned conversion is lossless and can be reversed (although one unsigned value can never be produced). The unsigned-to-signed conversion, on the other hand, must collapse two different unsigned values onto the same signed result.
In a comment, the formula sint = uint > sint_max ? uint - uint_max : uint is proposed. This coalesces the values uint_max and 0; both are mapped to 0. That's a little weird even for non-2s-complement representations, but for 2's-complement it's unnecessary and, worse, it requires the compiler to emit code to laboriously compute this unnecessary conflation. By contrast the standard's signed-to-unsigned conversion is lossless and in the common case (2's-complement architectures) it is a no-op.
If the signed casting was chosen, then simple a+1 would always result in signed type (unless constant was typed as 1U).
Assume a was unsigned int, then this seemingly innocent increment a+1 could lead to things like undefined overflow or "index out of bound", in the case of arr[a+1]
Thus, "unsigned casting" seems like a safer approach because people probably don't even expect casting to be happening in the first place, when simply adding a constant.
This is sort of a half-answer, because I don't really understand the committee's reasoning.
From the C90 committee's rationale document: https://www.lysator.liu.se/c/rat/c2.html#3-2-1-1
Since the publication of K&R, a serious divergence has occurred among implementations of C in the evolution of integral promotion rules. Implementations fall into two major camps, which may be characterized as unsigned preserving and value preserving. The difference between these approaches centers on the treatment of unsigned char and unsigned short, when widened by the integral promotions, but the decision has an impact on the typing of constants as well (see §3.1.3.2).
... and apparently also on the conversions done to match the two operands for any operator. It continues:
Both schemes give the same answer in the vast majority of cases, and both give the same effective result in even more cases in implementations with twos-complement arithmetic and quiet wraparound on signed overflow --- that is, in most current implementations.
It then specifies a case where ambiguity of interpretation arises, and states:
The result must be dubbed questionably signed, since a case can be made for either the signed or unsigned interpretation. Exactly the same ambiguity arises whenever an unsigned int confronts a signed int across an operator, and the signed int has a negative value. (Neither scheme does any better, or any worse, in resolving the ambiguity of this confrontation.) Suddenly, the negative signed int becomes a very large unsigned int, which may be surprising --- or it may be exactly what is desired by a knowledgable programmer. Of course, all of these ambiguities can be avoided by a judicious use of casts.
and:
The unsigned preserving rules greatly increase the number of situations where unsigned int confronts signed int to yield a questionably signed result, whereas the value preserving rules minimize such confrontations. Thus, the value preserving rules were considered to be safer for the novice, or unwary, programmer. After much discussion, the Committee decided in favor of value preserving rules, despite the fact that the UNIX C compilers had evolved in the direction of unsigned preserving.
Thus, they consider the case of int + unsigned an unwanted situation, and chose conversion rules for char and short that yield as few of those situations as possible, even though most compilers at the time followed a different approach. If I understand right, this choice then forced them to follow the current choice of int + unsigned yielding an unsigned operation.
I still find all of this truly bizarre.
Why does C++ standard specify signed integer be cast to unsigned in binary operations with mixed signedness?
I suppose that you mean converted rather than "cast". A cast is an explicit conversion.
As I'm not the author nor have I encountered documentation about this decision, I cannot promise that my explanation is the truth. However, there is a fairly reasonable potential explanation: Because that's how C works, and C++ was based on C. Unless there was an opportunity improve upon the rules, there would be no reason to change what works and what programmers have been used to. I don't know if the committee even deliberated changing this.
I know what you may be thinking: "Why does C standard specify signed integer...". Well, I'm also not the author of C standard, but there is at least a fairly extensive document titled "Rationale for
American National Standard
for Information Systems -
Programming Language -
C". As extensive it is, it doesn't cover this question unfortunately (it does cover a very similar question of how to promote integer types narrower than int in which regard the standard differs from some of the C implementations that pre-date the standard).
I don't have access to a pre-standard K&R documents, but I did find a passage from book "Expert C Programming: Deep C Secrets" which quotes rules from the pre-standard K&R C (in context of comparing the rule with the standardised ones):
Section 6.6 Arithmetic Conversions
A great many operators cause conversions and yield result types in a similar way. This pattern will be called the "usual arithmetic conversions."
First, any operands of type char or short are converted to int, and any of type float are converted to double. Then if either operand is double, the other is converted to double and that is the type of the result. Otherwise, if either operand is long, the other is converted to long and that is the type of the result. Otherwise, if either operand is unsigned, the other is converted to unsigned and that is the type of the result. Otherwise, both operands must be int, and that is the type of the result.
So, it appears that this has been the rule from since before standardisation of C and was presumably the chosen by the designer himself. Unless someone can find a written rationale, we may never know the answer.
What are cases where casting to unsigned in the correct thing to do?
Here is an extremely simple case:
unsigned u = INT_MAX;
u + 42;
The type of the literal 42 is signed, so with your proposed / designer rule, u + 42 would also be signed. This would be quite surprising and would result in the shown program to have undefined behaviour due to signed integer overflow.
Basically, implicit conversion to signed and to unsigned each have their problems.
In a code compiled on i386 Linux using g++, I have used static_cast<char>() cast on a value that might exceed the valid range of -128,127 for a char. There were no errors or exceptions and so I used the code in production.
The problem is now I don't know how this code might behave when a value outside this range is thrown at it. There is no problem if data is modified or truncated, I only need to know how this modification behaves on this particular platform.
Also what would happen if C-style cast ((char)value) had been used? would it behave differently?
In your case this would be an explicit type conversion. Or to be more precise an integral conversions.
The standard says about this(4.7):
If the destination type is signed, the value is unchanged if it can be represented in the destination type (and
bit-field width); otherwise, the value is implementation-defined.
So your problem is implementation-defined. On the other hand I have not yet seen a compiler that does not just truncate the larger value to the smaller one. And I have never seen any compiler that uses the rule mentioned above.
So it should be fairly safe to just cast your integer/short to the char.
I don't know the rules for an C cast by heart and I really try to avoid them because it is not easy to say which rule will kick in.
This is dealt with in §4.7 of the standard (integral conversions).
The answer depends on whether in the implementation in question char is signed or unsigned. If it is unsigned, then modulo arithmetic is applied. §4.7/2 of C++11 states: "If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2 n where n is the number of bits used to represent the unsigned type)." This means that if the input integer is not negative, the normal bit truncation you expect will arise. If is is negative, the same will apply if negative numbers are represented by 2's complement, otherwise the conversion will be bit altering.
If char is signed, §4.7/3 of C++11 applies: "If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined." So it is up to the documentation for the particular implementation you use. Having said that, on 2's complement systems (ie all those in normal use) I have not seen a case where anything other than normal bit truncation occurs for char types: apart from anything else, by virtue of §3.9.1/1 of the c++11 standard all character types (char, unsigned char and signed char) must have the same object representation and alignment.
The effect of a C style case, an explicit static_cast and an implicit narrowing conversion is the same.
Technically the language specs for unsigned types agree in inposing a plain base-2. And for unsigned plain base-2 its pretty obvious what extension and truncation do.
When going to unsigned, however, the specs are more "tolerant" allowing potentially different kind of processor to use different ways to represent signed numbers. And since a same number may have in different platform different representations is practically not possible to provide a description on what happen to it when adding or removing bits.
For this reason, language specification remain more vague by saying that "the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined"
In other words, compiler manufacturer are required to do the best as they can to keep the numeric value. But when this cannot be done, they are free to adapt to what is more efficient for them.
Recently I had to perform some data type conversions from float to 16 bit integer. Essentially my code reduces to the following
float f_val = 99999.0;
short int si_val = static_cast<short int>(f_val);
// si_val is now -32768
This input value was a problem and in my code I had neglected to check the limits of the float value so I can see my fault, but it made me wonder about the exact rules of the language when one has to do this kind of ungainly cast. I was slightly surprised to find that value of the cast was -32768. Furthermore, this is the value I get whenever the value of the float exceeds the limits of a 16 bit integer. I have googled this but found a surprising lack of detailed info about it. The best I could find was the following from cplusplus.com
Converting to int from some smaller integer type, or to double from
float is known as promotion, and is guaranteed to produce the exact
same value in the destination type. Other conversions between
arithmetic types may not always be able to represent the same value
exactly:
If the conversion is from a floating-point type to an integer type, the value
is truncated (the decimal part is removed).
The conversions from/to bool consider false equivalent to zero (for numeric
types) and to null pointer (for pointer types); and true equivalent to all
other values.
Otherwise, when the destination type cannot represent the value, the conversion
is valid between numerical types, but the value is
implementation-specific (and may not be portable).
This suggestion that the results are implementation defined does not surprise me, but I have heard that cplusplus.com is not always reliable.
Finally, when performing the same cast from a 32 bit integer to 16 bit integer (again with a value outisde of 16 bit range) I saw results clearly indicating integer overflow. Although I was not surprised by this it has added to my confusion due to the inconsistency with the cast from float type.
I have no access to the C++ standard, but a lot of C++ people here do so I was wondering what the standard says on this issue? Just for completeness, I am using g++ version 4.6.3.
You're right to question what you've read. The conversion has no defined behaviour, which contradicts what you quoted in your question.
4.9 Floating-integral conversions [conv.fpint]
1 A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates;
that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type. [ Note: If the destination type is bool, see 4.12. -- end note ]
One potentially useful permitted result that you might get is a crash.
I am modifying legacy code that utilizes a "long long" (LL) data type definition for a hard-coded constant, as follows:
0xFFFFFFFFFFFFFFFFLL
I trust that the LL appended to the constant guarantees that this constant will be interpreted as a long long.
However, I do not want to depend on long long having any particular compiler-dependent interpretation in terms of the number of bits.
Therefore, I would like my variable declaration to do without the LL in the constant, and instead use:
uint64_t a = static_cast<uint64_t>(0xFFFFFFFFFFFFFFFF);
I would like to think that the constant 0xFFFFFFFFFFFFFFFF is not interpreted by the compiler as a 32-bit integer BEFORE the cast to uint64_t, which would result in a being a 64-bit integer that contained the value 0xFFFFFFFF, rather than the desired value.
(My current 64-bit compilers of interest are VS 2010, and Ubuntu 12.04 LTS GCC. However, I would hope that this code behaves in the desired way for any modern compiler.)
Will the above code work as desired for most or all modern compilers, so the the value of a is properly set to include all digits, as desired, from the constant 0xFFFFFFFFFFFFFFFF, WITHOUT including the LL at the end of the constant?
(Note: Including I64 at the end of the constant gives a compiler error. Perhaps there is another token that needs (or can) be included at the end of the constant to tell the compiler to interpret the constant as a 64-bit integer?)
(Also: Perhaps even the static_cast<uint64_t> is unnecessary, since the variable is explicitly being defined as uint64_t?)
To reduce what Andy says to the essentials: if the implementation has one or more standard integer types that is capable of representing 0xFFFFFFFFFFFFFFFF, then the literal 0xFFFFFFFFFFFFFFFF has one of those types.
It doesn't really matter to you which one, since no matter which it is, the result of the conversion to uint64_t is the same.
If the (pre-C++11) implementation doesn't have any integer type big enough, then (a) the program is ill-formed, so you should get a diagnostic; and (b) it probably won't have uint64_t anyway.
You are correct that the static_cast is unnecessary. It does the same conversion that assigning to uint64_t would do anyway. Sometimes a cast will suppress compiler warnings that you get for certain implicit integer conversions, but I think it's unlikely that any compiler would warn for an implicit conversion in this case. Often there won't be one, since 0xFFFFFFFFFFFFFFFF will commonly have type uint64_t already.
As an aside, it's probably better to write static_cast<uint64_t>(-1), or just uint64_t a = -1;. It's guaranteed to be equal to 0xFFFFFFFFFFFFFFFF, but it's much easier for a reader to see the difference between -1 and 0xFFFFFFFFFFFFFFF than it is to see the difference between 0xFFFFFFFFFFFFFFFF and 0xFFFFFFFFFFFFFFF.
Per Paragraph 2.1.14/2 of the C++11 Standard:
The type of an integer literal is the first of the corresponding list in Table 6 in which its value can be
represented
Table 6 specifies that for hexadecimal literal constants, the type of the literal should be:
int; or (if it doesn't fit)
unsigned int; or (if it doesn't fit)
long int; or (if it doesn't fit)
unsigned long int; or (if it doesn't fit)
long long int; or (if it doesn't fit)
unsigned long long int.
If we make the reasonable assumption that 0xFFFFFFFFFFFFFFFF will not fit in any of the first 5 types from the above list, its type should be unsigned long long int. As long as you are working with a 64 bit compiler, it is reasonable to assume that values of this type will have a size of 64 bit, and the constant will be interpreted as a 64-bit unsigned long long int, as you hoped.
I was having a look over this page: http://www.devbistro.com/tech-interview-questions/Cplusplus.jsp, and didn't understand this question:
What’s potentially wrong with the following code?
long value;
//some stuff
value &= 0xFFFF;
Note: Hint to the candidate about the base platform they’re developing for. If the person still doesn’t find anything wrong with the code, they are not experienced with C++.
Can someone elaborate on it?
Thanks!
Several answers here state that if an int has a width of 16 bits, 0xFFFF is negative. This is not true. 0xFFFF is never negative.
A hexadecimal literal is represented by the first of the following types that is large enough to contain it: int, unsigned int, long, and unsigned long.
If int has a width of 16 bits, then 0xFFFF is larger than the maximum value representable by an int. Thus, 0xFFFF is of type unsigned int, which is guaranteed to be large enough to represent 0xFFFF.
When the usual arithmetic conversions are performed for evaluation of the &, the unsigned int is converted to a long. The conversion of a 16-bit unsigned int to long is well-defined because every value representable by a 16-bit unsigned int is also representable by a 32-bit long.
There's no sign extension needed because the initial type is not signed, and the result of using 0xFFFF is the same as the result of using 0xFFFFL.
Alternatively, if int is wider than 16 bits, then 0xFFFF is of type int. It is a signed, but positive, number. In this case both operands are signed, and long has the greater conversion rank, so the int is again promoted to long by the usual arithmetic conversions.
As others have said, you should avoid performing bitwise operations on signed operands because the numeric result is dependent upon how signedness is represented.
Aside from that, there's nothing particularly wrong with this code. I would argue that it's a style concern that value is not initialized when it is declared, but that's probably a nit-pick level comment and depends upon the contents of the //some stuff section that was omitted.
It's probably also preferable to use a fixed-width integer type (like uint32_t) instead of long for greater portability, but really that too depends on the code you are writing and what your basic assumptions are.
I think depending on the size of a long the 0xffff literal (-1) could be promoted to a larger size and being a signed value it will be sign extended, potentially becoming 0xffffffff (still -1).
I'll assume it's because there's no predefined size for a long, other than it must be at least as big as the preceding size (int). Thus, depending on the size, you might either truncate value to a subset of bits (if long is more than 32 bits) or overflow (if it's less than 32 bits).
Yeah, longs (per the spec, and thanks for the reminder in the comments) must be able to hold at least -2147483647 to 2147483647 (LONG_MIN and LONG_MAX).
For one value isn't initialized before doing the and so I think the behaviour is undefined, value could be anything.
long type size is platform/compiler specific.
What you can here say is:
It is signed.
We can't know the result of value &= 0xFFFF; since it could be for example value &= 0x0000FFFF; and will not do what expected.
While one could argue that since it's not a buffer-overflow or some other error that's likely to be exploitable, it's a style thing and not a bug, I'm 99% confident that the answer that the question-writer is looking for is that value is operated on before it's assigned to. The value is going to be arbitrary garbage, and that's unlikely to be what was meant, so it's "potentially wrong".
Using MSVC I think that the statement would perform what was most likely intended - that is: clear all but the least significant 16 bits of value, but I have encountered other platforms which would interpret the literal 0xffff as equivalent to (short)-1, then sign extend to convert to long, in which case the statement "value &= 0xFFFF" would have no effect.
"value &= 0x0FFFF" is more explicit and robust.