Does arithmetic overflow overwrite data?

Does arithmetic overflow overwrite data? - c++

std::uint8_t x = 256; //Implicitly converts to 0
std::uint8_t y = 255;
y++;
For x, I assume everything is handled because 100000000 gets converted to 00000000 using some defined conversion from int to uint8_t. x's memory should be 0 00000000 not 1 00000000.
However with y I believe the overflow stays in memory. y is initially 11111111. After adding 1, it becomes 1 00000000. This wraps around back to 0 because y only looks at the 8 LSB.
Does the 1 after y++; stay in memory, or is it discarded when the addition is done?
If it is there, could it corrupt data before y?

Does arithmetic overflow overwrite data?
The behaviour of signed arithmetic overflow is undefined. It's neither guaranteed to overwrite data, nor guaranteed to not overwrite data.
std::uint8_t y = 255;
y++;
Unsigned overflow is well defined. y will be 0, and there are no other side-effects.
Citation from the C++ standard (latest draft):
[basic.fundamental]
... The range of representable values for the unsigned type is 0 to 2N−1 (inclusive); arithmetic for the
unsigned type is performed modulo 2N.
[Note 2: Unsigned arithmetic does not overflow.
Overflow for signed arithmetic yields undefined behavior ([expr.pre]).
— end note]
[expr.pre]
If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined.
Since unsigned arithmetic is modular, the result can never be outside of representable values.

When using gcc with optimizations enabled, unless one uses the -fwrapv compiler option, integer overflow may arbitrarily disrupt program behavior, leading to memory corruption, even in cases where the result of the computation that overflowed would not be used. Reading through the published Rationale, it seems unlikely that the authors of the Standard would have expected a general-purpose compiler for a commonplace platform to behave in such fashion, but the Standard makes no attempt to anticipate and forbid all the gratuitously nonsensical ways implementations might process things.

Related

Why does reinterpret_cast only work for 0?

While studying the behavior of casts in C++, I discovered that reinterpret_cast-ing from float* to int* only works for 0:
float x = 0.0f;
printf("%i\n", *reinterpret_cast<int*>(&x));
prints 0, whereas
float x = 1.0f;
printf("%i\n", *reinterpret_cast<int*>(&x));
prints 1065353216.
Why? I expected the latter to print 1 just like static_cast would.

A reinterpret_cast says to reinterpret the bits of its operand. In your example, the operand, &x, is the address of a float, so it is a float *. reinterpret_cast<int *> asks to reinterpret these bits as an int *, so you get an int *. If pointers to int and pointers to float have the same representation in your C++ implementation, this may1 work to give you an int * that points to the memory of the float.
However, the reinterpret_cast does not change the bits that are pointed to. They still have their same values. In the case of a float with value zero, the bits used to represent it are all zeros. When you access these through a dereferenced int *, they are read and interpreted as an int. Bits that are all zero represent an int value of zero.
In the case of a float with value one, the bits used to represent it in your C++ implementation are, using hexadecimal to show them, 3f80000016. This is because the exponent field of the format is stored with an offset, so there are some non-zero bits to show the value of the exponent. (That is part of how the floating-point format is encoded. Conceptually, 1.0f is represented as + 1.000000000000000000000002 • 20. Then the + sign and the bits after the “1.” are stored literally as zero bits. However, the exponent is stored by adding 127 and storing the result as an eight-bit integer. So an exponent of 0 is stored as 127. The represented value of the exponent is zero, but the bits that represent it are not zero.) When you access these bits through a dereferenced int *, they are read and interpreted as an int. These bits represent an int value of 1065353216 (which equals 3f80000016).
Footnote
1 The C++ standard does not guarantee this, and what actually happens is dependent on other factors.

In both cases the behaviour of the program is undefined, because you access an object through a glvalue that doesn't refer to an object of same or compatible type.
What you have observed is one possible behaviour. The behaviour could have been different, but there is no guarantee of that it would have been, and it wasn't. Whether you expected one result or another is not guaranteed to have effect on the behaviour.
I expected the latter to print 1 just like static_cast would.
It is unreasonable to expect reinterpret_cast to behave as static_cast would. They are wildly different and one can not be substituted for the other. Using static_cast to convert the pointers would make the program ill-formed.
rainterpret_cast should not be used unless one knows what it does, and knows that its use is correct. The practical use cases are rare.
Here are a few examples that have well defined behaviour, and are guaranteed to print 1:
int i = x;
printf("%i\n", i);
printf("%i\n", static_cast<int>(x));
printf("%g\n", x);
printf("%.0f\n", x);
Given that we've concluded that behaviour is undefined, there is no need for further analysis.
But we can consider why the behaviour may have happened to be what we observed. It is however important to understand that these considerations will not be useful in controlling what the result will be while the behaviour is undefined.
The binary representations of 32 bit IEEE 754 floating point number for 1.0f and +0.0f happen to be:
0b00111111100000000000000000000000
0b00000000000000000000000000000000
Which also happen to be the binary representation of the integer 1065353216 and 0. Is it a coincidence that the output of the programs were these specific integers whose binary representation match the representation of the float value? Could be in theory, but it probably isn't.

float has a different representation than int, so that you cannot treat float representation as int. That's undefined behaviour in C++.
It so happens that on modern architectures a 0-bit pattern represents any fundamental type with value of 0 (so that one can memset with zeroes a float or double, or integer types, or pointer types and get 0-valued object, that's what that calloc function does). This is why that cast-and-dereference "works" for 0 value, but that is still undefined behaviour. The C++ standard doesn't require a 0-bit pattern to represent a 0 floating point value, neither it requires any specific representation of floating point numbers.
A conversion of float to int is implicit and no cast is required.
A solution:
float x = 1.0f;
int x2 = x;
printf("%i\n", x2);
// or
printf("%i\n", static_cast<int>(x));

Does std::memcpy make its destination determinate?

Here is the code:
unsigned int a; // a is indeterminate
unsigned long long b = 1; // b is initialized to 1
std::memcpy(&a, &b, sizeof(unsigned int));
unsigned int c = a; // Is this not undefined behavior? (Implementation-defined behavior?)
Is a guaranteed by the standard to be a determinate value where we access it to initialize c? Cppreference says:
void* memcpy( void* dest, const void* src, std::size_t count );
Copies count bytes from the object pointed to by src to the object pointed to by dest. Both objects are reinterpreted as arrays of unsigned char.
But I don't see anywhere in cppreference that says if an indeterminate value is "copied to" like this, it becomes determinate.
From the standard, it seems it's analogous to this:
unsigned int a; // a is indeterminate
unsigned long long b = 1; // b is initialized to 1
auto* a_ptr = reinterpret_cast<unsigned char*>(&a);
auto* b_ptr = reinterpret_cast<unsigned char*>(&b);
a_ptr[0] = b_ptr[0];
a_ptr[1] = b_ptr[1];
a_ptr[2] = b_ptr[2];
a_ptr[3] = b_ptr[3];
unsigned int c = a; // Is this undefined behavior? (Implementation defined behavior?)
It seems like the standard leaves room for this to be allowed, because the type aliasing rules allow for the object a to be accessed as an unsigned char this way. But I can't find something that says this makes a no longer indeterminate.

Is this not undefined behavior
It's UB, because you're copying into the wrong type. [basic.types]2 and 3 permit byte copying, but only between objects of the same type. You copied from a long long into an int. That has nothing to do with the value being indeterminate. Even though you're only copying sizeof(int) bytes, the fact that you're not copying from an actual int means that you don't get the protection of those rules.
If you were copying into the value of the same type, then [basic.types]3 says that it's equivalent to simply assigning them. That is, a " shall subsequently hold the same value as" b.

TL;DR: It's implementation-defined whether there will be undefined behavior or not. Proof-style, with lines of code numbered:
unsigned int a;
The variable a is assumed to have automatic storage duration. Its lifetime begins (6.6.3/1). Since it is not a class, its lifetime begins with default initialization, in which no other initialization is performed (9.3/7.3).
unsigned long long b = 1ull;
The variable b is assumed to have automatic storage duration. Its lifetime begins (6.6.3/1). Since it is not a class, its lifetime begins with copy-initialization (9.3/15).
std::memcpy(&a, &b, sizeof(unsigned int));
Per 16.2/2, std::memcpy should have the same semantics and preconditions as the C standard library's memcpy. In the C standard 7.21.2.1, assuming sizeof(unsigned int) == 4, 4 characters are copied from the object pointed to by &b into the object pointed to by &a. (These two points are what is missing from other answers.)
At this point, the sizes of unsigned int, unsigned long long, their representations (e.g. endianness), and the size of a character are all implementation defined (to my understanding, see 6.7.1/4 and its note saying that ISO C 5.2.4.2.1 applies). I will assume that the implementation is little-endian, unsigned int is 32 bits, unsigned long long is 64 bits, and a character is 8 bits.
Now that I have said what the implementation is, I know that a has a value-representation for an unsigned int of 1u. Nothing, so far, has been undefined behavior.
unsigned int c = a;
Now we access a. Then, 6.7/4 says that
For trivially copyable types, the value representation is a set of bits in the object representation that determines a value, which is one discrete element of an implementation-defined set of values.
I know now that the value of a is determined by the implementation-defined value bits in a, which I know hold the value-representation for 1u. The value of a is then 1u.
Then like (2), the variable c is copy-initialized to 1u.
We made use of implementation-defined values to find what happens. It is possible that the implementation-defined value of 1ull is not one of the implementation-defined set of values for unsigned int. In that case, accessing a will be undefined behavior, because the standard doesn't say what happens when you access a variable with a value-representation that is invalid.
AFAIK, we can take advantage of the fact that most implementations define an unsigned int where any possible bit pattern is a valid value-representation. Therefore, there will be no undefined behavior.

Note: I updated this answer since by exploring the issue further in some of the comments has reveled cases where it would be implementation defined or even undefined in a case I did not consider originally (specifically in C++17 as well).
I believe that this is either implementation defined behavior in some cases and undefined in others (as another answer came to conclude for similar reasons). In a sense it's implementation defined if it's undefined behavior or implementation defined, so I am not sure if it being undefined in general takes precedence in such a classification.
Since std::memcpy works entirely on the object representation of the types in question (by aliasing the pointers given to unsigned char as is specified by 6.10/8.8 [basic.lval]). If the bits within the bytes in question of the unsigned long long are guaranteed to be something specific then you can manipulate them however you wish or write them into the object representation of any other type. The destination type will then use the bits to form its value based on its value representation (whatever that may be) as is defined in 6.9/4 [basic.types]:
The object representation of an object of type T is the sequence of N
unsigned char objects taken up by the object of type T, where N equals
sizeof(T). The value representation of an object is the set of bits
that hold the value of type T. For trivially copyable types, the value
representation is a set of bits in the object representation that
determines a value, which is one discrete element of an
implementation-defined set of values.
And that:
The intent is that the memory model of C++ is compatible with that of
ISO/IEC 9899 Programming Language C.
Knowing this, all that matters now is what the object representation of the integer types in question are. According to 6.9.1/7 [basic.fundemental]:
Types bool, char, char16_t, char32_t, wchar_t, and the signed and
unsigned integer types are collectively called integral types. A
synonym for integral type is integer type. The representations of
integral types shall define values by use of a pure binary numeration
system. [Example: This International Standard permits two’s
complement, ones’ complement and signed magnitude representations for
integral types. — end example ]
A footnote does clarify the definition of "binary numeration system" however:
A positional representation for integers that uses the binary digits 0
and 1, in which the values represented by successive bits are
additive, begin with 1, and are multiplied by successive integral
power of 2, except perhaps for the bit with the highest position.
(Adapted from the American National Dictionary for Information
Processing Systems.)
We also know that unsigned integers have the same value representation as signed integers, just under a modulus according to 6.9.1/4 [basic.fundemental]:
Unsigned integers shall obey the laws of arithmetic modulo 2^n where n
is the number of bits in the value representation of that particular
size of integer.
While this does not say exactly what the value representation may be, based on the specified definition of a binary numeration system, successive bits are to be additive powers of two as expected (rather than allowing the bits to be in any given order), with the exception of the maybe present sign bit. Additionally since signed and unsigned value representations this means an unsigned integer will stored as an increasing binary sequence up until 2^(n-1) (past then depending on how signed number are handled things are implementation defined).
There are still some other considerations however, such as endianness and how many padding bits may be present due to sizeof(T) only measuring the size of the object representation rather than the value representation (as stated before). Since in C++17 there is no standard way (I think) to check for endianness, this is the main factor that would leave this to be implementation defined in what the outcome would be. As for padding bits, while they may be present (but not specified where they will be from what I can tell other than the implication that they will not interrupt the contiguous sequence of bits forming the value representation of a integer), writing to them can prove potentially problematic. Since the intent of the C++ memory model is based on the C99 standard's memory model in a "comparable" way, a footnote from 6.2.6.2 (which is referenced in the C++20 standard as a note to remind that it's based on that) can be taken which say as follows:
Some combinations of padding bits might generate trap representations,
for example, if one padding bit is a parity bit. Regardless, no
arithmetic operation on valid values can generate a trap
representation other than as part of an exceptional condition such as
an overflow, and this cannot occur with unsigned types. All other
combinations of padding bits are alternative object representations of
the value specified by the value bits.
This implies that writing directly to padding bits incorrectly could potentially generate a trap representation from what I can tell.
This shows that in some cases depending on if padding bits are present and endianness, the result can be influenced in an implementation-defined manner. If some combination of padding bits is also a trap representation, this may become undefined behavior.
While not possible in C++17, in C++20 one can use std::endian in conjunction with std::has_unique_object_representations<T> (which was present in C++17) or some math with CHAR_BIT, UINT_MAX/ULLONG_MAX and the sizeof those types to ensure the expected endianness is correct as well as the absence of padding bits, allowing this to actually produce the expected result in a defined manner given what was previously established with how integers are said to be stored. Of course C++20 also further refines this and specifies that integer are to be stored in two's complement alone eliminating further implementation-specific issues.

Is over/underflow an undefined behavior at execution time?

I was reading about undefined behavior, and I'm not sure if it's a compile-time only feature, or if it can occurs at execution-time.
I understand this example well (this is extracted from the Undefined Behavior page of Wikipedia):
An example for the C language:
int foo(unsigned x)
{
int value = 5;
value += x;
if (value < 5)
bar();
return value;
}
The value of x cannot be negative and, given that signed integer overflow is undefined behavior in C, the compiler can assume that at the line of the if check value >= 5. Thus the if and the call to the function bar can be ignored by the compiler since the if has no side effects and its condition will never be satisfied. The code above is therefore semantically equivalent to:
int foo(unsigned x)
{
int value = 5;
value += x;
return value;
}
But this occurs at compilation-time.
What if I write, for example:
void foo(int x) {
if (x + 150 < 5)
bar();
}
int main() {
int x;
std::cin >> x;
foo(x);
}
and then the user type in MAX_INT - 100 ("2147483547", if 32 bits-integer).
There will be an integer overflow, but AFAIK, it is the arithmetic logic unit of the CPU that will make an overflow, so the compiler is not involved here.
Is it still undefined behavior?
If yes, how does the compiler detect the overflow?
The best I could imagine is with the overflow flag of the CPU. If this is the case, does it means that the compiler can do anything he wants if the overflow flag of the CPU is set anytime at execution-time?

Yes but not necessarily in the way I think you might have meant it, that is, if in the machine code there is an addition and at runtime that addition wraps (or otherwise overflows, but on most architectures it would wrap) that is not UB by itself. The UB is solely in the domain of C (or C++). That addition may have been adding unsigned integers or be some sort of optimizations that the compiler can make because it knows the semantics of the target platform and can safely use optimizations that rely on wrapping (but you cannot, unless of course you do it with unsigned types).
Of course that does not at all mean that it is safe to use constructs that "wrap only at runtime", because those code paths are poisoned at compile time as well. For example in your example,
extern void bar(void);
void foo(int x) {
if (x + 150 < 5)
bar();
}
Is compiled by GCC 6.3 targeting x64 to
foo:
cmp edi, -145
jl .L4
ret
.L4:
jmp bar
Which is the equivalent of
void foo(int x) {
if (x < -145)
bar(); // with tail call optimization
}
.. which is the same if you assume that signed integer overflow is impossible (in the sense that it puts an implicit precondition on the inputs to be such that overflow will not happen).

Your analysis of the first example is incorrect. value += x; is equivalent to:
value = value + x;
In this case value is int and x is unsigned, so the usual arithmetic conversion means that value is first converted to unsigned, so we have an unsigned addition which by definition cannot overflow (it has well-defined semantics in accordance with modular arithmetic).
When the unsigned result is assigned back to value, if it is larger than INT_MAX then this is an out-of-range assignment which has implementation-defined behaviour. This is NOT overflow because it is assignment, not an arithmetic operation.
Which optimizations are possible therefore depends on how the implementation defines the behaviour of out-of-range assignment for integers. Modern systems all take the value which has the same 2's complement representation, but historically other systems have done some different things.
So the original example does not have undefined behaviour in any circumstance and the suggested optimization is , for most systems, not possible.
Your second example has nothing to do with your first example since it does not involve any unsigned arithmetic. If x > INT_MAX - 150 then the expression x + 150 causes undefined behaviour due to signed integer overflow. The language definition does not mention ALUs or CPUs so we can be certain that those things are not related to whether or not the behaviour is undefined.
If yes, how does the compiler detect the overflow?
It doesn't have to. Precisely because the behaviour is undefined, it means the compiler is not constrained by having to worry about what happens when there is overflow. It only has to emit an executable that exemplifies the behaviour for the cases which are defined.
In this program those are the inputs in the range [INT_MIN, INT_MAX-150] and so the compiler can transform the comparison to x < -145 because that has the same behaviour for all inputs in the well-defined range, and it doesn't matter about the undefined cases.

Is signed to unsigned conversion, and back, defined behaviour for integers?

#include <cstdint>
#include <iostream>
int main() {
uint32_t i = -64;
int32_t j = i;
std::cout << j;
return 0;
}
Most compilers I've tried will create programs that output -64, but is this defined behaviour?
Is the assignment of a signed integer to and unsigned integer uint32_t i = -64; defined behaviour?
Is the signed integer assignment int32_t j = i;, when i equals 4294967232, defined behaviour?

For unsigned integer out-of-range conversion, the result is defined; for signed integers, it's implementation-defined.
C++11(ISO/IEC 14882:2011) §4.7 Integral conversions [conv.integral/2]
If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2^n where n is the number of bits used to represent the unsigned type). [ Note: In a two’s complement representation, this conversion is conceptual and there is no change in the bit pattern (if there is no truncation). —end note ]
If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); otherwise, the value is implementation-defined.
This text remains the same for C++14.

The Standard requires that implementations document, somehow, how they will determine what value to use when an integer is converted to a signed type which is too small to accommodate it. It does not specify the form such documentation will take. A conforming implementation's documentation could specify in readable print that values will be truncated and two's-complement sign extended, and then in impossibly small print specify "...except when a program is compiled on the fifth Tuesday of a month, in which case out-of-range conversions will yield the value 24601". Such documentation would, of course, be less than helpful, but the Standard does not concern itself with "quality of implementation" issues.
In practice, implementations that define the behavior in any fashion other than 100% consistent truncation and two's-complement sign extension are extremely rare; I would not be particularly surprised if in fact 100% of conforming C99 and C11 implementations that are intended for production code default to working in that fashion. Unfortunately, neither <limits.h> nor any other standard header defines any means via which implementations can indicate that they follow the essentially-universal convention.
To be sure, it's unlikely that code which expects the common behavior will be tripped up by the behavior of any conforming compiler. It's plausible, however, that compilers might offer a non-conforming mode, since that could make certain kinds of code more efficient. For example, given:
int32_t x,i;
int16_t *p;
...
x = ++p[i];
If int is larger than 16 bits, behavior would be defined in case p[i] was 32767 before the code executed. The increment would yield -32768, the value would be converted to int16_t in Implementation-Defined fashion (which is guaranteed to yield -32768 unless an implementation documents something else), and that value would then be stored to both x and p[i].
On processors like the ARM which always do arithmetic using 32 bits, truncating the value stored to p[i] would cost nothing, but truncating the value written to x would require an instruction (or, for some older ARM models, two instructions). Allowing x to receive +32768 in that case would improve efficiency on such processors. Such an option would not affect the behavior of most programs, but it would be helpful if the Standard defined a means via which code which relied upon behavior could say, e.g.
#ifdef __STDC_UNUSUAL_INT_TRUNCATION
#error This code relies upon truncating integer type conversions
#endif
so that those programs that would be affected could guard against accidental compilation in such modes. As yet the Standard doesn't define any such test macro.

Behavior of increment operator at bounds for character type

I wonder how C++ behaves in this case:
char variable = 127;
variable++;
In this case, variable now equals to -128. However did the increment operator wrapped the value to its lower bound or did an overflow occurred?

An overflow occured and results in undefined behavior.
Section 5.5:
Ifduring the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values
for its type, the behavior is undefined [...]
The standard goes on to note that integer overflows are, in most implementations, ignored. But this doesn't represent a guarantee.

Plain char can be either signed or unsigned. If the maximum value is 127, then it must be signed in your implementation.
For unsigned types, "overflow" is well defined, and causes wraparound. For signed types, the behavior on arithmetic overflow is undefined (wraparound is typical, but not required). But that actually doesn't apply in this particular case; instead, the value stored in variable is implementation-defined.
For types narrower than int, things are a little more complicated. This:
variable ++;
is equivalent to this:
variable = variable + 1;
The operands of the + operator have the "usual arithmetic conversions" applied to them, which in this case means that both operands are promoted to int. Since int is more that wide enough to hold the result, there's no overflow; the result is 128, and is of type int. When that result is stored back into variable, it's converted from int to char.
The rules for overflow are different for conversions than they are for arithmetic operations like "+". For a signed-to-signed or unsigned-to-signed conversion, if the value can't be represented in the target type, the behavior is not undefined; it merely yields an implementation-defined result.
For a typical implementation that uses a 2's-complement representation for signed integer types, the value stored will probably be -128 -- but other behaviors are possible. (For example, an implementation could use saturating arithmetic.)
Another (rather obscure) possibility is that char and int could be the same size (which can happen only if char is at least 16 bits). That can have some interesting effects, but I won't go into that (yet).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js