Do bitwise operations with negative numbers cause ub?

Do bitwise operations with negative numbers cause ub? - c++

Is it implementation-defined or undefined behaviour to do things like -1 ^ mask and other bitwise operations like signed.bitOp(unsigned) and signed.bitOp(signed) in C++17?

Before the various bitwise operations in C++17, the two operands undergo the "usual arithmetic conversions" to make them have the same type. Depending on how the two types differ, you can get a signed or unsigned type. Those conversions dictate whether you have well-defined behavior or not.
If the "usual arithmetic conversions" causes a negative number to be converted to an unsigned type, then you trigger [conv.integral]/2, which causes negative numbers to be mapped to "the least unsigned integer congruent to the source integer".
The actual operation is... bitwise. The standard requires that implementations provide some binary representation of signed integers. So a bitwise operation on two signed integers is whatever you get by doing a bitwise operation on that binary representation. Since the actual representation is implementation-defined, the result is allowed to vary based on that representation. However, since the implementation requires that the signed representation's positive values match the corresponding unsigned integer's representation for the same range of numbers, bitwise operations have reliable results for positive values stored in signed integers.
The results are not undefined; you will get a value from them. But the results can be different on different implementations.
C++20 standardized 2's complement signed integer representations (since pretty much every C++ compiler already did this), so the results are consistent across implementations.

Related

Why is integer overflow undefined behavior only for signed integers, and not for unsigned integers?

The purpose of making signed integer overflow undefined behavior is to permit compiler optimizations. But isn't this an equally valid argument to make unsigned integer overflow undefined behavior as well?

The purpose for keeping signed integer overflow undefined may be compiler optimization1. But the original reason was that the bit representation of signed integers was not defined by the standard. Different implementations offered different signed integer representations, and their overflow characteristics would be different. This was only permissible because the standard did not define what those overflow characteristics were.
By contrast, unsigned integer bit representations were always well-defined (otherwise, you couldn't effectively do a lot of bitwise operations), and therefore their overflow behavior could also be well-defined.
Boolean arithmetic for unsigned integers of a specific size, under that value representation, works modulo the max unsigned value + 1. Therefore, the standard has a way to say what the result of any math operation will be: it's the expected numerical result modulo the max unsigned value + 1.
That is, if you have a 16-bit unsigned integer holding 65535, and you add 1 to it, the numerical result is 65536. However, the result you get in a 16-bit number is 0. This is how the standard defines it, because this is how boolean arithmetic works for a specific bit-depth. The representation defines the behavior.
By contrast, different signed integer forms have different overflow characteristics. If the standard defined a specific meaning for 32767 + 1 in a 16-bit signed integer, then if the particular signed integer representation did not naturally provide that answer, the compiler would have to change how it adds those values to produce that answer. For signed/magnitude, this addition results in -0. If the standard made that the actual behavior, then every twos complement implementation would be unable to just add numbers. It would have to check for overflow and fudge the results.
On every math operation.
1 there are other reasons, such as the fact that most code doesn't tolerate it being well-defined that well. That is most code where signed integers overflow would be just as broken if it were well-defined

Bit shifting with leading 1

When I use the >> bitwise operator on 1000 in c++ it gives this result: 1100. I want the result to be 0100. When the 1 is in any other position this is exactly what happens, but with a leading 1 it goes wrong. Why is that and how can it be avoided?

The behavior you describe is coherent with what happens on some platforms when right-shifting a signed integer with the high bit set (so, negative values).
In this case, on many platforms compilers will emit code to perform an arithmetic shift, which propagates the sign bit; this, on platforms with 2's complement representation for negative integers (= virtually every current platform) has the effect of giving the "x >> i = floor(x/2i)" behavior even on negative values. Notice that this is not contractual - as far as the C++ standard is concerned, shifting negative integers in implementation-defined behavior, so any compiler is free to implement different semantics for it1.
To come to your question, to obtain the "regular" shift behavior (generally called "logical shift") you have to make sure to work on unsigned integers. This can be obtained either making sure that the variable you are shifting is of unsigned type (e.g. unsigned int) or, if it's a literal, by putting an U suffix to it (e.g. 1 is an int, 1U is an unsigned int).
If the data you have is of a signed type (e.g. int) you may cast it to the corresponding unsigned type before shifting without risks (conversion from a signed int to an unsigned one is well-defined by the standard, and doesn't change the bit values on 2's complement machines).
Historically, this comes from the fact that C strove to support even machines that didn't have "cheap" arithmetic shift functionality at hardware level and/or didn't use 2's complement representation.

As mentioned by others, when right shifting on a signed int, it is implementation defined whether you will get 1s or 0s. In your case, because the left most bit in 1000 is a 1, the "replacement bits" are also 1. Assuming you must work with signed ints, in order to get rid of it, you can apply a bitmask.

send int over socket: signed vs unsigned

Say I want to send 4 byte integer over network. The integer has fixed size, due to using types from stdint.
My question is: Does it matter if I try to send either signed or unsigned integer using these 4 bytes? (assuming I use same method to serialize/deserialize the integer to/from bytes, both on client and server side). Can there be some other problems? (I don't refer to endianness issues either)

This issue seldom gets the attention it deserves.
As Floris observes, only the bytes of the representation get sent. C and C++ define the bitwise representation* of unsigned numbers, but not signed ones, so sending signed numbers as bytes opens a compatibility gap.
It's easy to "fix" the format for transmission. Casting a signed int to its corresponding unsigned type is guaranteed to generate two's complement representation. But how to convert back? Casting an unsigned integer to its signed counterpart generates signed integer overflow when you want a negative number, which produces an unspecified result — you could get anything.
To be really safe, use a branch:
signed int deserialize_sint( unsigned int nonnegative ) {
if ( nonnegative < INT_MAX ) return nonnegative;
else return - (int) ( - nonnegative ); // Only cast an unsigned number < INT_MAX
}
With luck, the compiler will see that both cases are the same and eliminate the branch.
The above function is written in C; apologies to the C++ crowd.
If you want to be extra paranoid, you could check - nonnegative < INT_MAX before performing the cast, because the most negative number in a two's complement will still overflow a one's complement machine. The best you can do for the case of nonnegative == - nonnegative is to return a wider type, or if that's impossible, flag a runtime error.
* Endianness becomes ambiguous when the bits are divided into a byte sequence, though.

Because the standard does not mandate a particular representation for signed types:
3.9.1 Fundamental types [basic.fundamental] Paragraph 7 of n3936
Types bool, char, char16_t, char32_t, wchar_t, and the signed and unsigned integer types are collectively called integral types. A synonym for integral type is integer type. The representations of integral types shall define values by use of a pure binary numeration system. [ Example: this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types. —end example ]
Sending signed integer values in a binary representation is not well defined (unless you explicitly specify this as part of your protocol and do some manual work to make sure you know how to read/write that binary representation).
There are a couple of solutions depending on the exact requirements.
If speed is not primary concern then you could use an English (substitute language of your choice) representation and serialize integers to/from text. For a lot of problems this is not a bad solution as the major speed bump is not the serialization cost but network latency. Network latency is the major problem in most situations (but not always).
So alternatively if you need binary representation (because you timed it and the volume/density of your numbers requires it). Then the endianess problem is not hard to solve because of htonl() and family. Which covers all unsigned integral types (well at least 16/32 bit values).
So all you really need to solve is the representation of signed values. So pick one (Use the most common representation for the machines you use and the translation will then usually be a no-op). But if you know the on the wire representation (because it is specified in your protocol), then you can translate to/from this representation (usually this cost is small (a conditional addition)) on machines that do not natively support this representation.

When you send a number over a socket, it's just bytes.
Now if you want to send a negative number, and the representation of negative numbers is different at the receiving end, then you might have a problem. Otherwise, it's just bytes.
So if there is a chance that the binary representation of the negative number would be misunderstood at the receiving end, then you need to do some translating (maybe send a sign byte followed by four magnitude bytes, and put it all together at the other end).
That's quite unlikely though.

Is the non-negative range of a signed C++ integer at least as big as the negative range?

Does the C++ standard mandate that the non-negative range of a standard signed integer type is at least as big as the negative range?
EDIT: Please note that I am referring to the non-negative range here, not the positive range which is obviously one smaller than the non-negative range.
EDIT: If we assume C++11, the answer is "Yes". See my clarifications below. From the point of view of C++03, the answer is probably "No".
The same question can be posed as follows: Does the standard guarantee that the result of a - b is representable in a standard signed integer type T assuming that both a and b are negative values of type T, and that a ≥ b?
I know that the standard allows for two's complement, ones' complement, and sign magnitude representation of negative values (see C++11 section 3.9.1 [basic.fundamental] paragraph 7), but I am not sure whether it demands the use of one of those three representations. Probably not.
If we assume one of these three representations, and we assume that there is no "spurious" restrictions on either of the two ranges (negative, and non-negative), then it is indeed true that the non-negative range is at least as big as the negative one. In fact, with two's complement the size of the two ranges will be equal, and with the two other representations, the size of the non-negative range will be one greater than the size of the negative one.
However, even if we assume one of the mentioned representations, it is really not enough to guarantee anything about the size of either range.
What I am seeking here, is a section (or set of sections) that unambiguously provides the desired guarantee.
Any help will be appreciated.
Note that something like the following would suffice: Every bit within the "storage slot" of the integer has one, and only one of the following functions:
Unused
Sign bit (exclusively, or mixed sign/value bit)
Value bit (participating in the value)
I have a vague memory that C99 says something along those lines. Anyone that knows anything about that?
Alright, C99 (with TC3) does provide the necessary guarantees in section 6.2.6.2 "Integer types" paragraph 2:
For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; there shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then M ≤ N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
the corresponding value with sign bit 0 is negated (sign and magnitude);
the sign bit has the value −(2N ) (two’s complement);
the sign bit has the value −(2N − 1) (ones’ complement ).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones’ complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.
Can someone confirm that this part of C99 is also a binding part of C++11?
I have taken another careful look at both the C99 and the C++11 standards, and it is clear that the guarantees in C99 section 6.2.6.2 paragraph 2 are binding in C++11 too.
C89/C90 does not provide the same guarantees, so we do need C99, which means that we do need C++11.
In summary, C++11 (and C99) provides the following guarantees:
Negative values in fundamental signed integers types (standard + extended) must be represented using one of the following three representations: Two's complement, ones' complement, or sign magnitude.
The size of the non-negative range is one greater than, or equal to the size of the negative range for all fundamental signed integers types (standard + extended).
The second guarantee can be restated as follows:
-1 ≤ min<T> + max<T> ≤ 0
for any fundamental signed integers type T (standard + extended) where min<T> and max<T> are shorthands for std::numeric_limits<T>::min() and std::numeric_limits<T>::max() respectively.
Also, if we assume that a and b are values of the same, or of different fundamental signed integer types (standard or extended), then it follows that a - b is well defined and representable in decltype(a - b) as long as a and b are either both negative or both non-negative.

The standard does seem to not mandate such a thing although I may be missing key passages. All we know about fundamental signed integral types is in 3.9.1/2:
There are five standard signed integer types : “signed char”, “short
int”, “int”, “long int”, and “long long int”. In this list, each type
provides at least as much storage as those preceding it in the list.
And in 3.9.1/7:
Types bool, char, char16_t, char32_t, wchar_t, and the signed and
unsigned integer types are collectively called integral types.48 A
synonym for integral type is integer type. The representations of
integral types shall define values by use of a pure binary numeration
system.
Neither of these passages seem to say anything about the respective positive and negative ranges. Even given that I can't conceive of a binary representation that wouldn't meet your needs.

Why is unsigned integer overflow defined behavior but signed integer overflow isn't?

Unsigned integer overflow is well defined by both the C and C++ standards. For example, the C99 standard (§6.2.5/9) states
A computation involving unsigned operands can never overﬂow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
However, both standards state that signed integer overflow is undefined behavior. Again, from the C99 standard (§3.4.3/1)
An example of undeﬁned behavior is the behavior on integer overﬂow
Is there an historical or (even better!) a technical reason for this discrepancy?

The historical reason is that most C implementations (compilers) just used whatever overflow behaviour was easiest to implement with the integer representation it used. C implementations usually used the same representation used by the CPU - so the overflow behavior followed from the integer representation used by the CPU.
In practice, it is only the representations for signed values that may differ according to the implementation: one's complement, two's complement, sign-magnitude. For an unsigned type there is no reason for the standard to allow variation because there is only one obvious binary representation (the standard only allows binary representation).
Relevant quotes:
C99 6.2.6.1:3:
Values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.
C99 6.2.6.2:2:
If the sign bit is one, the value shall be modified in one of the following ways:
— the corresponding value with sign bit 0 is negated (sign and magnitude);
— the sign bit has the value −(2N) (two’s complement);
— the sign bit has the value −(2N − 1) (one’s complement).
Nowadays, all processors use two's complement representation, but signed arithmetic overflow remains undefined and compiler makers want it to remain undefined because they use this undefinedness to help with optimization. See for instance this blog post by Ian Lance Taylor or this complaint by Agner Fog, and the answers to his bug report.

Aside from Pascal's good answer (which I'm sure is the main motivation), it is also possible that some processors cause an exception on signed integer overflow, which of course would cause problems if the compiler had to "arrange for another behaviour" (e.g. use extra instructions to check for potential overflow and calculate differently in that case).
It is also worth noting that "undefined behaviour" doesn't mean "doesn't work". It means that the implementation is allowed to do whatever it likes in that situation. This includes doing "the right thing" as well as "calling the police" or "crashing". Most compilers, when possible, will choose "do the right thing", assuming that is relatively easy to define (in this case, it is). However, if you are having overflows in the calculations, it is important to understand what that actually results in, and that the compiler MAY do something other than what you expect (and that this may very depending on compiler version, optimisation settings, etc).

First of all, please note that C11 3.4.3, like all examples and foot notes, is not normative text and therefore not relevant to cite!
The relevant text that states that overflow of integers and floats is undefined behavior is this:
C11 6.5/5
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or
not in the range of representable values for its type), the behavior
is undefined.
A clarification regarding the behavior of unsigned integer types specifically can be found here:
C11 6.2.5/9
The range of nonnegative values of a signed integer type is a subrange
of the corresponding unsigned integer type, and the representation of
the same value in each type is the same. A computation involving
unsigned operands can never overflow, because a result that cannot be
represented by the resulting unsigned integer type is reduced modulo
the number that is one greater than the largest value that can be
represented by the resulting type.
This makes unsigned integer types a special case.
Also note that there is an exception if any type is converted to a signed type and the old value can no longer be represented. The behavior is then merely implementation-defined, although a signal may be raised.
C11 6.3.1.3
6.3.1.3 Signed and unsigned integers
When a value with integer
type is converted to another integer type other than _Bool, if the
value can be represented by the new type, it is unchanged.
Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.
Otherwise, the new type is signed and the value
cannot be represented in it; either the result is
implementation-defined or an implementation-defined signal is raised.

In addition to the other issues mentioned, having unsigned math wrap makes the unsigned integer types behave as abstract algebraic groups (meaning that, among other things, for any pair of values X and Y, there will exist some other value Z such that X+Z will, if properly cast, equal Y and Y-Z will, if properly cast, equal X). If unsigned values were merely storage-location types and not intermediate-expression types (e.g. if there were no unsigned equivalent of the largest integer type, and arithmetic operations on unsigned types behaved as though they were first converted them to larger signed types, then there wouldn't be as much need for defined wrapping behavior, but it's difficult to do calculations in a type which doesn't have e.g. an additive inverse.
This helps in situations where wrap-around behavior is actually useful - for example with TCP sequence numbers or certain algorithms, such as hash calculation. It may also help in situations where it's necessary to detect overflow, since performing calculations and checking whether they overflowed is often easier than checking in advance whether they would overflow, especially if the calculations involve the largest available integer type.

Perhaps another reason for why unsigned arithmetic is defined is because unsigned numbers form integers modulo 2^n, where n is the width of the unsigned number. Unsigned numbers are simply integers represented using binary digits instead of decimal digits. Performing the standard operations in a modulus system is well understood.
The OP's quote refers to this fact, but also highlights the fact that there is only one, unambiguous, logical way to represent unsigned integers in binary. By contrast, Signed numbers are most often represented using two's complement but other choices are possible as described in the standard (section 6.2.6.2).
Two's complement representation allows certain operations to make more sense in binary format. E.g., incrementing negative numbers is the same that for positive numbers (expect under overflow conditions). Some operations at the machine level can be the same for signed and unsigned numbers. However, when interpreting the result of those operations, some cases don't make sense - positive and negative overflow. Furthermore, the overflow results differ depending on the underlying signed representation.

The most technical reason of all, is simply that trying to capture overflow in an unsigned integer requires more moving parts from you (exception handling) and the processor (exception throwing).
C and C++ won't make you pay for that unless you ask for it by using a signed integer. This isn't a hard-fast rule, as you'll see near the end, but just how they proceed for unsigned integers. In my opinion, this makes signed integers the odd-one out, not unsigned, but it's fine they offer this fundamental difference as the programmer can still perform well-defined signed operations with overflow. But to do so, you must cast for it.
Because:
unsigned integers have well defined overflow and underflow
casts from signed -> unsigned int are well defined, [uint's name]_MAX - 1 is conceptually added to negative values, to map them to the extended positive number range
casts from unsigned -> signed int are well defined, [uint's name]_MAX - 1 is conceptually deducted from positive values beyond the signed type's max, to map them to negative numbers)
You can always perform arithmetic operations with well-defined overflow and underflow behavior, where signed integers are your starting point, albeit in a round-about way, by casting to unsigned integer first then back once finished.
int32_t x = 10;
int32_t y = -50;
// writes -60 into z, this is well defined
int32_t z = int32_t(uint32_t(y) - uint32_t(x));
Casts between signed and unsigned integer types of the same width are free, if the CPU is using 2's compliment (nearly all do). If for some reason the platform you're targeting doesn't use 2's Compliment for signed integers, you will pay a small conversion price when casting between uint32 and int32.
But be wary when using bit widths smaller than int
usually if you are relying on unsigned overflow, you are using a smaller word width, 8bit or 16bit. These will promote to signed int at the drop of a hat (C has absolutely insane implicit integer conversion rules, this is one of C's biggest hidden gotcha's), consider:
unsigned char a = 0;
unsigned char b = 1;
printf("%i", a - b); // outputs -1, not 255 as you'd expect
To avoid this, you should always cast to the type you want when you are relying on that type's width, even in the middle of an operation where you think it's unnecessary. This will cast the temporary and get you the signedness AND truncate the value so you get what you expected. It's almost always free to cast, and in fact, your compiler might thank you for doing so as it can then optimize on your intentions more aggressively.
unsigned char a = 0;
unsigned char b = 1;
printf("%i", (unsigned char)(a - b)); // cast turns -1 to 255, outputs 255

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js