Is masking before unsigned left shift in C/C++ too paranoid? - c++

This question is motivated by me implementing cryptographic algorithms (e.g. SHA-1) in C/C++, writing portable platform-agnostic code, and thoroughly avoiding undefined behavior.
Suppose that a standardized crypto algorithm asks you to implement this:
b = (a << 31) & 0xFFFFFFFF
where a and b are unsigned 32-bit integers. Notice that in the result, we discard any bits above the least significant 32 bits.
As a first naive approximation, we might assume that int is 32 bits wide on most platforms, so we would write:
unsigned int a = (...);
unsigned int b = a << 31;
We know this code won't work everywhere because int is 16 bits wide on some systems, 64 bits on others, and possibly even 36 bits. But using stdint.h, we can improve this code with the uint32_t type:
uint32_t a = (...);
uint32_t b = a << 31;
So we are done, right? That's what I thought for years. ... Not quite. Suppose that on a certain platform, we have:
// stdint.h
typedef unsigned short uint32_t;
The rule for performing arithmetic operations in C/C++ is that if the type (such as short) is narrower than int, then it gets widened to int if all values can fit, or unsigned int otherwise.
Let's say that the compiler defines short as 32 bits (signed) and int as 48 bits (signed). Then these lines of code:
uint32_t a = (...);
uint32_t b = a << 31;
will effectively mean:
unsigned short a = (...);
unsigned short b = (unsigned short)((int)a << 31);
Note that a is promoted to int because all of ushort (i.e. uint32) fits into int (i.e. int48).
But now we have a problem: shifting non-zero bits left into the sign bit of a signed integer type is undefined behavior. This problem happened because our uint32 was promoted to int48 - instead of being promoted to uint48 (where left-shifting would be okay).
Here are my questions:
Is my reasoning correct, and is this a legitimate problem in theory?
Is this problem safe to ignore because on every platform the next integer type is double the width?
Is a good idea to correctly defend against this pathological situation by pre-masking the input like this?: b = (a & 1) << 31;. (This will necessarily be correct on every platform. But this could make a speed-critical crypto algorithm slower than necessary.)
Clarifications/edits:
I'll accept answers for C or C++ or both. I want to know the answer for at least one of the languages.
The pre-masking logic may hurt bit rotation. For example, GCC will compile b = (a << 31) | (a >> 1); to a 32-bit bit-rotation instruction in assembly language. But if we pre-mask the left shift, it is possible that the new logic is not translated into bit rotation, which means now 4 operations are performed instead of 1.

Speaking to the C side of the problem,
Is my reasoning correct, and is this a legitimate problem in theory?
It is a problem that I had not considered before, but I agree with your analysis. C defines the behavior of the << operator in terms of the type of the promoted left operand, and it it conceivable that the integer promotions result in that being (signed) int when the original type of that operand is uint32_t. I don't expect to see that in practice on any modern machine, but I'm all for programming to the actual standard as opposed to my personal expectations.
Is this problem safe to ignore because on every platform the next integer type is double the width?
C does not require such a relationship between integer types, though it is ubiquitous in practice. If you are determined to rely only on the standard, however -- that is, if you are taking pains to write strictly conforming code -- then you cannot rely on such a relationship.
Is a good idea to correctly defend against this pathological situation by pre-masking the input like this?: b = (a & 1) << 31;.
(This will necessarily be correct on every platform. But this could
make a speed-critical crypto algorithm slower than necessary.)
The type unsigned long is guaranteed to have at least 32 value bits, and it is not subject to promotion to any other type under the integer promotions. On many common platforms it has exactly the same representation as uint32_t, and may even be the same type. Thus, I would be inclined to write the expression like this:
uint32_t a = (...);
uint32_t b = (unsigned long) a << 31;
Or if you need a only as an intermediate value in the computation of b, then declare it as an unsigned long to begin with.

Q1: Masking before the shift does prevent undefined behavior that OP has concern.
Q2: "... because on every platform the next integer type is double the width?" --> no. The "next" integer type could be less than 2x or even the same size.
The following is well defined for all compliant C compilers that have uint32_t.
uint32_t a;
uint32_t b = (a & 1) << 31;
Q3: uint32_t a; uint32_t b = (a & 1) << 31; is not expected to incur code that performs a mask - it is not needed in the executable - just in the source. If a mask does occur, get a better compiler should speed be an issue.
As suggested, better to emphasize the unsigned-ness with these shifts.
uint32_t b = (a & 1U) << 31;
#John Bollinger good answer well details how to handle OP's specific problem.
The general problem is how to form a number that is of at least n bits, a certain sign-ness and not subject to surprising integer promotions - the core of OP's dilemma. The below fulfills this by invoking an unsigned operation that does not change the value - effective a no-op other than type concerns. The product will be at least the width of unsigned or uint32_t. Casting, in general, may narrow the type. Casting needs to be avoided unless narrowing is certain to not occur. An optimization compiler will not create unnecessary code.
uint32_t a;
uint32_t b = (a + 0u) << 31;
uint32_t b = (a*1u) << 31;

Taking a clue from this question about possible UB in uint32 * uint32 arithmetic, the following simple approach should work in C and C++:
uint32_t a = (...);
uint32_t b = (uint32_t)((a + 0u) << 31);
The integer constant 0u has type unsigned int. This promotes the addition a + 0u to uint32_t or unsigned int, whichever is wider. Because the type has rank int or higher, no more promotion occurs, and the shift can be applied with the left operand being uint32_t or unsigned int.
The final cast back to uint32_t will just suppress potential warnings about a narrowing conversion (say if int is 64 bits).
A decent C compiler should be able to see that adding zero is a no-op, which is less onerous than seeing that a pre-mask has no effect after an unsigned shift.

To avoid unwanted promotion, you may use the greater type with some typedef, as
using my_uint_at_least32 = std::conditional_t<(sizeof(std::uint32_t) < sizeof(unsigned)),
unsigned,
std::uint32_t>;

For this segment of code:
uint32_t a = (...);
uint32_t b = a << 31;
To promote a to a unsigned type instead of signed type, use:
uint32_t b = a << 31u;
When both sides of << operator is an unsigned type, then this line in 6.3.1.8 (C standard draft n1570) applies:
Otherwise, if both operands have signed integer types or both have unsigned integer types, the operand with the type of lesser integer conversion rank is converted to the type of the operand with greater rank.
The problem you are describing is caused you use 31 which is signed int type so another line in 6.3.1.8
Otherwise, if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, then the operand with unsigned integer type is converted to the type of the operand with signed integer type.
forces a to promoted to a signed type
Update:
This answer is not correct because 6.3.1.1(2) (emphasis mine):
...
If an int can represent all values of the original type (as restricted
by the width, for a bit-field), the value is converted to an int;
otherwise, it is converted to an unsigned int. These are called the
integer promotions.58) All other types are unchanged by the integer
promotions.
and footnote 58 (emphasis mine):
58) The integer promotions are applied only: as part of the usual arithmetic conversions, to certain argument expressions, to the operands of the unary +, -, and ~ operators, and to both operands of the shift operators, as specified by their respective subclauses.
Since only integer promotion is happening and not common arithmetic conversion, using 31u does not guarantee a to be converted to unsigned int as stated above.

Related

What's the difference between casting a long to int versus using a bitwise AND in order to get the 4 least significant bytes?

I know that in order to get the 4 least significant bytes of a number of type long I can cast it to int/unsigned int or use a bitwise AND (& 0xFFFFFFFF).
This code produces the following output:
#include <stdio.h>
int main()
{
long n = 0x8899AABBCCDDEEFF;
printf("0x%016lX\n", n);
printf("0x%016X\n", (int)n);
printf("0x%016X\n", (unsigned int)n);
printf("0x%016lX\n", n & 0xFFFFFFFF);
}
Output:
0x8899AABBCCDDEEFF
0x00000000CCDDEEFF
0x00000000CCDDEEFF
0x00000000CCDDEEFF
Does that mean that the two methods used are equivalent? If so, do they always produce the same output regardless of the platform/compiler?
Also, is there any catch or pitfall while casting to unsigned int rather than int for the purpose of this question?
Finally, why is the output the same if you change the number n to be an unsigned long instead?
The methods are definitely different.
According to integral conversion rules (cf, for example, this online c++11 standard), a conversion (e.g. through an explicit cast) from one integral type to another depends on whether the destination type is signed or unsigned. If the destination type is unsigned, one can rely on a "modulo 2n" truncation, whereas with signed destination types one could tap into implementation defined behaviour:
4.7 Integral conversions [conv.integral]
2 If the destination type is unsigned, the resulting value is the
least unsigned integer congruent to the source integer (modulo 2n
where n is the number of bits used to represent the unsigned type). [
Note: In a two's complement representation, this conversion is
conceptual and there is no change in the bit pattern (if there is no
truncation). — end note ]
3 If the destination type is signed, the value is unchanged if it can
be represented in the destination type (and bit-field width);
otherwise, the value is implementation-defined.
For your first question, as others have pointed out, the size of int and long is dependent on the platform, so the methods are not equivalent. In C data types, check that the types say "at least XX bits in size"
For the second question, it comes down to this: long and int are signed, meaning that one bit is reserved for sign (take a look also to two's complement). If you were the compiler, what can you do with negative values (especially the long ones)? As Stepahn Lechner mentioned, this is implementation defined (that is, is up to the compiler).
Finally, in the spirit of "your code must do what it says it does", the best thing to do if you need to do masks is to use masks (and, if you use masks, use unsigned types). Don't try to use cleaver answers. Believe me, they always bite you in the rear. I've dealt with a lot of legacy code to know that by heart.
What's the difference between casting a long to int versus using a bitwise AND in order to get the 4 least significant bytes?
Type. Casting makes the value an int. And'ing does not change the type.
Range. Depending on int,long range, a cast may not change the value at all.
IDB and UB. implementation defined behavior and undefined behavior are present with mixing signed-ness.
To "get" the 4 LSBytes, use & 0xFFFFFFFFu or cast to uint32_t.
OP's question is unnecessarily convoluted.
long n = 0x8899AABBCCDDEEFF; --> Converting a value outside the range of a signed integer type is implementation-defined.
Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
C11 §6.3.1.3 3
printf("0x%016lX\n", n); --> Printing a long with a "%lX" outside the the common range of long/unsigned long is undefined behavior.
Let's go forward with unsigned long:
unsigned long n = 0x8899AABBCCDDEEFF; // no problem,
printf("0x%016lX\n", n); // no problem,
printf("0x%016X\n", (int)n); // problem, C11 6.3.1.3 3
printf("0x%016X\n", (unsigned int)n); // no problem,
printf("0x%016lX\n", n & 0xFFFFFFFF); // no problem,
The "no problem" are OK even is unsigned long is 32-bit or 64-bit. The output will differ, yet is OK.
Recall that int,long are not always 32,64 bit. (16,32), (32,32), (32,64) are common.
int is at least 16 bit.
long is at least that of int and at least 32 bit.

When does implicit type conversion occur in C++?

Let's say we have the following code, meant to shift the bits of a to the left by i and replace the i LSBs of a with the i MSBs of b
unsigned short a;
unsigned short b;
auto i = 4;
a = (a << i) | (b >> (16 - i));
This works as expected. Now, let's consider the situation where we want to be able to do this for pairs of different types. i.e. we want to shift the bits of a to the left by i and replace the i LSBs of a with the i MSBs of b, except now we have no guarantee on the size of a and b, only that they are unsigned. Let's consider the case where the size of a is less than that of b
unsigned short a;
unsigned int b;
auto i = 4;
a = (a << i) | (b >> (32- i));
I'm worried about "overshifting" a. However, depending on when the promotion of a to unsigned int occurs, a value may or may not be an overshift. Consider a value of i=24. This would cause undefined behavior if the type conversion happens after the shift, however not before. Thus, my question is when type conversion in this case will occur? I suppose I could explicitly cast a to the larger type, but I'd like to avoid that if possible.
Here's the relevant quote from the standard, [expr.shift/1]:
[...] The operands shall be of integral or unscoped enumeration type and integral promotions are performed. The type of the result is that of the promoted left operand. [...]
This means in your case that a will be promoted to int (or unsigned int) before doing the shift.
Note, that with current 32/64-bit compilers, the promoted type is int, not unsigned int. The promoted type is unsigned int only if int cannot represent the whole range of unsigned short (for example, this is true for old, 16-bit compilers, where sizeof(int) is 2, and sizeof(short)==sizeof(int)).

Why does this if condition fail for comparison of negative and positive integers [duplicate]

This question already has answers here:
sizeof() operator in if-statement
(5 answers)
Closed 4 years ago.
#include <stdio.h>
int arr[] = {1,2,3,4,5,6,7,8};
#define SIZE (sizeof(arr)/sizeof(int))
int main()
{
printf("SIZE = %d\n", SIZE);
if ((-1) < SIZE)
printf("less");
else
printf("more");
}
The output after compiling with gcc is "more". Why the if condition fails even when -1 < 8?
The problem is in your comparison:
if ((-1) < SIZE)
sizeof typically returns an unsigned long, so SIZE will be unsigned long, whereas -1 is just an int. The rules for promotion in C and related languages mean that -1 will be converted to size_t before the comparison, so -1 will become a very large positive value (the maximum value of an unsigned long).
One way to fix this is to change the comparison to:
if (-1 < (long long)SIZE)
although it's actually a pointless comparison, since an unsigned value will always be >= 0 by definition, and the compiler may well warn you about this.
As subsequently noted by #Nobilis, you should always enable compiler warnings and take notice of them: if you had compiled with e.g. gcc -Wall ... the compiler would have warned you of your bug.
TL;DR
Be careful with mixed signed/unsigned operations (use -Wall compiler warnings). The Standard has a long section about it. In particular, it is often but not always true that signed is value-converted to unsigned (although it does in your particular example). See this explanation below (taken from this Q&A)
Relevant quote from the C++ Standard:
5 Expressions [expr]
10 Many binary operators that expect operands of arithmetic or
enumeration type cause conversions and yield result types in a similar
way. The purpose is to yield a common type, which is also the type of
the result. This pattern is called the usual arithmetic conversions,
which are defined as follows:
[2 clauses about equal types or types of equal sign omitted]
— Otherwise, if the operand that has unsigned integer type has rank
greater than or equal to the rank of the type of the other operand,
the operand with signed integer type shall be converted to the type of
the operand with unsigned integer type.
— Otherwise, if the type of
the operand with signed integer type can represent all of the values
of the type of the operand with unsigned integer type, the operand
with unsigned integer type shall be converted to the type of the
operand with signed integer type.
— Otherwise, both operands shall be
converted to the unsigned integer type corresponding to the type of
the operand with signed integer type.
Your actual example
To see into which of the 3 cases your program falls, modify it slightly to this
#include <stdio.h>
int arr[] = {1,2,3,4,5,6,7,8};
#define SIZE (sizeof(arr)/sizeof(int))
int main()
{
printf("SIZE = %zu, sizeof(-1) = %zu, sizeof(SIZE) = %zu \n", SIZE, sizeof(-1), sizeof(SIZE));
if ((-1) < SIZE)
printf("less");
else
printf("more");
}
On the Coliru online compiler, this prints 4 and 8 for the sizeof() of -1 and SIZE, respectively, and selects the "more" branch (live example).
The reason is that the unsigned type is of greater rank than the signed type. Hence, clause 1 applies and the signed type is value-converted to the unsigned type (on most implementation, typically by preserving the bit-representation, so wrapping around to a very large unsigned number), and the comparison then proceeds to select the "more" branch.
Variations on a theme
Rewriting the condition to if ((long long)(-1) < (unsigned)SIZE) would take the "less" branch (live example).
The reason is that the signed type is of greater rank than the unsigned type and can also accomodate all the unsigned values. Hence, clause 2 applies and the unsigned type is converted to the signed type, and the comparison then proceeds to select the "less" branch.
Of course, you would never write such a contrived if() statement with explicit casts, but the same effect could happen if you compare variables with types long long and unsigned. So it illustrates the point that mixed signed/unsigned arithmetic is very subtle and depends on the relative sizes ("ranking" in the words of the Standard). In particular, there is no fixed rules saying that signed will always be converted to unsigned.
When you do comparison between signed and unsigned where unsigned has at least an equal rank to that of the signed type (see TemplateRex's answer for the exact rules), the signed is converted to the type of the unsigned.
With regards to your case, on a 32bit machine the binary representation of -1 as unsigned is 4294967295. So in effect you are comparing if 4294967295 is smaller than 8 (it isn't).
If you had enabled warnings, you would have been warned by the compiler that something fishy is going on:
warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
Since the discussion has shifted a bit on how appropriate the use of unsigned is, let me put a quote by James Gosling with regards to the lack of unsigned types in Java (and I will shamelessly link to another post of mine on the subject):
Gosling: For me as a language designer, which I don't really count
myself as these days, what "simple" really ended up meaning was could
I expect J. Random Developer to hold the spec in his head. That
definition says that, for instance, Java isn't -- and in fact a lot of
these languages end up with a lot of corner cases, things that nobody
really understands. Quiz any C developer about unsigned, and pretty
soon you discover that almost no C developers actually understand what
goes on with unsigned, what unsigned arithmetic is. Things like that
made C complex. The language part of Java is, I think, pretty simple.
The libraries you have to look up.
This is an historical design bug of C that was also repeated in C++.
It dates back to 16-bit computers and the error was deciding to use all 16 bits to represent sizes up to 65536 giving up the possibility to represent negative sizes.
This in se wouldn't have been an error if unsigned meaning was "non-negative integer" (a size cannot logically be negative) but it's a problem with the conversion rules of the language.
Given the conversion rules of the language the unsigned type in C doesn't represent a non-negative number, but it's instead more like a bitmask (the mathematical term is actually "a member of the ℤ/n ring"). To see why consider that for the C and C++ language
unsigned - unsigned gives an unsigned result
signed + unsigned gives and unsigned result
both of them clearly make no sense at all if you read unsigned as "non-negative number".
Of course saying that the size of an object is a member of ℤ/n ring doesn't make any sense at all and here it's where the error resides.
Practical implications:
Every time you deal with the size of an object be careful because the value is unsigned and that type in C/C++ has a lot of properties that are illogical for a number. Please always remember that unsigned doesn't mean "non-negative integer" but "member of ℤ/n algebraic ring" and that, most dangerous, in case of a mixed operation an int is converted to unsigned int and not the opposite.
For example:
void drawPolyline(const std::vector<P2d>& pts) {
for (int i=0; i<pts.size()-1; i++) {
drawLine(pts[i], pts[i+1]);
}
}
is buggy, because if passed an empty vector of points it will do illegal (UB) operations. The reason is that pts.size() is an unsigned.
The rules of the language will convert 1 (an integer) to 1{mod n}, will perform the subtraction in ℤ/n resulting in (size-1){mod n}, will convert i also to a {mod n} representation and will do the comparison in ℤ/n.
C/C++ actually defines a < operator in ℤ/n (rarely done in math) and you will end up accessing pts[0], pts[1] ... and so on until huge numbers even if the input vector was empty.
A correct loop could be
void drawPolyline(const std::vector<P2d>& pts) {
for (int i=1; i<pts.size(); i++) {
drawLine(pts[i-1], pts[i]);
}
}
but I normally prefer
void drawPolyline(const std::vector<P2d>& pts) {
for (int i=0,n=pts.size(); i<n-1; i++) {
drawLine(pts[i], pts[i+1]);
}
}
in other words getting rid of unsigned as soon as possible, and just working with regular ints.
Never use unsigned to represent size of containers or counters because unsigned means "member of ℤ/n" and the size of a container is not one of those things. Unsigned types are useful, but NOT to represent size of objects.
The standard C/C++ library unfortunately made this wrong choice, and it's too late to fix it. You are not forced to do the same mistake however.
In the words of Bjarne Stroustrup:
Using an unsigned instead of an int to gain one more bit to represent
positive integers is almost never a good idea. Attempts to ensure that
some values are positive by declaring variables unsigned will
typically be defeated by the implicit conversion rules
well, i'm not going to repeat the strong words Paul R said, but when you are comparing unsigned and integers you are going to experience dome bad things.
do if ((-1) < (int)SIZE)
instead of your if condition
Convert the unsigned type returned from sizeof operator to signed
when you compare two unsigned and signed number compiler implicitly converts signed to unsigned.
-1 signed representation in 4 byte int is 11111111 11111111 11111111 11111111 when converted to unsigned this representation would refer to 2^16-1
So basically your are comparing that 2^16-1>SIZE, which would be true.
You have to override that by explicitly casting the unsigned value to signed.
Since sizeof operator returns unsigned long long you should cast it to signed long long
if((-1)<(signed long long)SIZE)
use this if condition in your code

How does casting to "signed int" and back to "signed short" work for values larger than 32,767?

Code:
typedef signed short SIGNED_SHORT; //16 bit
typedef signed int SIGNED_INT; //32 bit
SIGNED_SHORT x;
x = (SIGNED_SHORT)(SIGNED_INT) 45512; //or any value over 32,767
Here is what I know:
Signed 16 bits:
Signed: From −32,768 to 32,767
Unsigned: From 0 to 65,535
Don't expect 45512 to fit into x as x is declared a 16 bit signed integer.
How and what does the double casting above do?
Thank You!
typedef signed short SIGNED_SHORT; //16 bit
typedef signed int SIGNED_INT; //32 bit
These typedefs are not particularly useful. A typedef does nothing more than provide a new name for an existing type. Type signed short already has a perfectly good name: "signed short"; calling it SIGNED_SHORT as well doesn't buy you anything. (It would make sense if it abstracted away some information about the type, or if the type were likely to change -- but using the name SIGNED_SHORT for a type other than signed short would be extremely confusing.)
Note also that short and int are both guaranteed to be at least 16 bits wide, and int is at least as wide as short, but different sizes are possible. For example, a compiler could make both short and int 16 bits -- or 64 bits for that matter. But I'll assume the sizes for your compiler are as you state.
In addition, signed short and short are names for the same type, as are signed int and int.
SIGNED_SHORT x;
x = (SIGNED_SHORT)(SIGNED_INT) 45512; //or any value over 32,767
A cast specifies a conversion to a specified type. Two casts specify two such conversions. The value 45512 is converted to signed int, and then to signed short.
The constant 45512 is already of type int (another name for signed int), so the innermost cast is fairly pointless. (Note that if int is only 16 bits, then 45512 will be of type long.)
When you assign a value of one numeric type to an object of another numeric type, the value is implicitly converted to the object's type, so the outermost cast is also redundant.
So the above code snippet is exactly equivalent to:
short x = 45512;
Given the ranges of int and short on your system, the mathematical value 45512 cannot be represented in type short. The language rules state that the result of such a conversion is implementation-defined, which means that it's up to each implementation to determine what the result is, and it must document that choice, but different implementations can do it differently. (Actually that's not quite the whole story; the 1999 ISO C standard added permission for such a conversion to raise an implementation-defined signal. I don't know of any compiler that does this.)
The most common semantics for this kind of conversion is that the result gets the low-order bits of the source value. This will probably result in the value -20024 being assigned to x. But you shouldn't depend on that if you want your program to be maximally portable.
When you cast twice, the casts are applied in sequence.
int a = 45512;
int b = (int) a;
short x = (short) b;
Since 45512 does not fit in a short on most (but not all!) platforms, the cast overflows on those platforms. This will either raise an implementation-defined signal or result in an implementation-defined value.
In practice, many platforms define the result as the truncated value, which is -20024 in this case. However, there are platforms which raise a signal, which will probably terminate your program if uncaught.
Citation: n1525 §6.3.1.3
Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
The double casting is equivalent to:
short x = static_cast<short>(static_cast<int>(45512));
which is equivalent to:
short x = 45512;
which will likely wrap around so x equals -20024, but technically it's implementation defined behavior if a short has a maximum value less than 45512 on your platform. The literal 45512 is of type int.
You can assume it does two type conversions (although signed int and int are only separated once in the C standard, IIRC).
If SIGNED_SHORT is too small to handle 45512, the result is either implementation-defined or an implementation-defined signal is raised. (In C++ only the former applies.)

How do promotion rules work when the signedness on either side of a binary operator differ? [duplicate]

This question already has answers here:
Implicit type conversion rules in C++ operators
(9 answers)
Closed 4 years ago.
Consider the following programs:
// http://ideone.com/4I0dT
#include <limits>
#include <iostream>
int main()
{
int max = std::numeric_limits<int>::max();
unsigned int one = 1;
unsigned int result = max + one;
std::cout << result;
}
and
// http://ideone.com/UBuFZ
#include <limits>
#include <iostream>
int main()
{
unsigned int us = 42;
int neg = -43;
int result = us + neg;
std::cout << result;
}
How does the + operator "know" which is the correct type to return? The general rule is to convert all of the arguments to the widest type, but here there's no clear "winner" between int and unsigned int. In the first case, unsigned int must be being chosen as the result of operator+, because I get a result of 2147483648. In the second case, it must be choosing int, because I get a result of -1. Yet I don't see in the general case how this is decidable. Is this undefined behavior I'm seeing or something else?
This is outlined explicitly in §5/9:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:
If either operand is of type long double, the other shall be converted to long double.
Otherwise, if either operand is double, the other shall be converted to double.
Otherwise, if either operand is float, the other shall be converted to float.
Otherwise, the integral promotions shall be performed on both operands.
Then, if either operand is unsigned long the other shall be converted to unsigned long.
Otherwise, if one operand is a long int and the other unsigned int, then if a long int can represent all the values of an unsigned int, the unsigned int shall be converted to a long int; otherwise both operands shall be converted to unsigned long int.
Otherwise, if either operand is long, the other shall be converted to long.
Otherwise, if either operand is unsigned, the other shall be converted to unsigned.
[Note: otherwise, the only remaining case is that both operands are int]
In both of your scenarios, the result of operator+ is unsigned. Consequently, the second scenario is effectively:
int result = static_cast<int>(us + static_cast<unsigned>(neg));
Because in this case the value of us + neg is not representable by int, the value of result is implementation-defined – §4.7/3:
If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined.
Before C was standardized, there were differences between compilers -- some followed "value preserving" rules, and others "sign preserving" rules. Sign preserving meant that if either operand was unsigned, the result was unsigned. This was simple, but at times gave rather surprising results (especially when a negative number was converted to an unsigned).
C standardized on the rather more complex "value preserving" rules. Under the value preserving rules, promotion can/does depend on the actual ranges of the types, so you can get different results on different compilers. For example, on most MS-DOS compilers, int is the same size as short and long is different from either. On many current systems int is the same size as long, and short is different from either. With value preserving rules, these can lead to the promoted type being different between the two.
The basic idea of value preserving rules is that it'll promote to a larger signed type if that can represent all the values of the smaller type. For example, a 16-bit unsigned short can be promoted to a 32-bit signed int, because every possible value of unsigned short can be represented as a signed int. The types will be promoted to an unsigned type if and only if that's necessary to preserve the values of the smaller type (e.g., if unsigned short and signed int are both 16 bits, then a signed int can't represent all possible values of unsigned short, so an unsigned short will be promoted to unsigned int).
When you assign the result as you have, the result will get converted to the destination type anyway, so most of this makes relatively little difference -- at least in most typical cases, where it'll just copy the bits into the result, and it's up to you to decide whether to interpret that as signed or unsigned.
When you don't assign the result such as in a comparison, things can get pretty ugly though. For example:
unsigned int a = 5;
signed int b = -5;
if (a > b)
printf("Of course");
else
printf("What!");
Under sign preserving rules, b would be promoted to unsigned, and in the process become equal to UINT_MAX - 4, so the "What!" leg of the if would be taken. With value preserving rules, you can manage to produce some strange results a bit like this as well, but 1) primarily on the DOS-like systems where int is the same size as short, and 2) it's generally harder to do it anyway.
It's choosing whatever type you put your result into or at least cout is honoring that type during output.
I don't remember for sure but I think C++ compilers generate the same arithmetic code for both, it's only compares and output that care about sign.