Let's say I want to implement std::vector without invoking any undefined behavior (UB). Is the code below invokes UB:
struct X{int i;};
int main(){
auto p = static_cast<X*>(::operator new(sizeof(X)*2));
new(p) X{};
new(p+1) X{};// p+1 UB?
}
Folowing a selection of quote from the standard that may help:
[basic.stc.dynamic.allocation]
The pointer returned (by an allocation function) shall be suitably aligned so that it can be converted to a pointer to any
suitable complete object type (21.6.2.1) and then used to access the object or array in the storage allocated
(until the storage is explicitly deallocated by a call to a corresponding deallocation function).
[expr.add]
When an expression that has integral type is added to or subtracted from a pointer, the result has the type
of the pointer operand. If the expression P points to element x[i] of an array object x with n elements,
the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element
x[i + j] if 0<=i + j<=n; otherwise, the behavior is undefined. Likewise, the expression P - J points to the
(possibly-hypothetical) element x[i − j] if 0<=i − j <=n; otherwise, the behavior is undefined.
My interpretation is that allocation provides an possibly-hypothetical array of X (in C++ arrays are objects) so pointer arithmetic on allocated storage as in the exemple may not invoke undefined behavior. Or my interpretation of hypothetical is wrong? How could I do if the previous code snipest is UB?
Yes, technically it has undefined behaviour, though we tend to ignore that. P0593 should fix it properly.
The phrase "possibly-hypothetical" refers to one-past-the-end "elements" (ref), and does not permit this case.
Related
I know that in c++ access out of buffer bounds is undefined behaviour.
Here is example from cppreference:
int table[4] = {};
bool exists_in_table(int v)
{
// return true in one of the first 4 iterations or UB due to out-of-bounds access
for (int i = 0; i <= 4; i++) {
if (table[i] == v) return true;
}
return false;
}
But, I can't find according paragraph in c++ standard.
Can anyone point me out on concrete paragraph in standard where such case is explained?
It's undefined behavior. We can juxtapose a couple of passages to be convinced of it. First, and I won't explicitly prove it, table[4] is *(table + 4). We need only ask ourselves the properties of the pointer value table + 4 and how it relates to the requirements of the indirection operator.
On the pointer, we have this passage:
[basic.compound]
3 Every value of pointer type is one of the following:
a pointer to an object or function (the pointer is said to point to the object or function), or
a pointer past the end of an object ([expr.add]), or
the null pointer value for that type, or
an invalid pointer value.
Our pointer is of the second bullet's type, not the first. As for the indirection operator:
[expr.unary.op]
1 The unary * operator performs indirection: the expression to which it is applied shall be a pointer to an object type, or a pointer to a function type and the result is an lvalue referring to the object or function to which the expression points. If the type of the expression is “pointer to T”, the type of the result is “T”.
I hope it's obvious from reading this paragraph that the operation is defined for a pointer of the category described by the first bullet in the preceding paragraph.
So we apply an operation to a pointer value for which its behavior is not defined. The result is undefined behavior.
Subscript operator is defined through addition operator. The array decays to a pointer to first element in this identical expression, so rules of pointer arithmetic apply. Indirection operator is used on the hypothetical result of the addition.
[expr.sub]
A postfix expression followed by an expression in square brackets is a postfix expression.
One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type.
The result is of type “T”.
The type “T” shall be a completely-defined object type.
The expression E1[E2] is identical (by definition) to *((E1)+(E2)), ...
In case where the array index is more than one past the last element i.e. E2 > std::size(E1) (which isn't the case in the example program), the hypothetical pointer arithmetic itself is undefined.
[expr.add]
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
If P evaluates to a null pointer value ... (does not apply)
Otherwise, if P points to an array element i of an array object x with n elements ([dcl.array]), the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i+j of x if 0≤i+j≤n and the expression P - J points to the (possibly-hypothetical) array element i−j of x if 0≤i−j≤n. (does not apply when i-j > n)
Otherwise, the behavior is undefined.
In case of E2 == std::size(E1) (which is the case in last iteration of the example), the hypothetical result of the addition is a pointer to one past the array and points to outside the storage of the array. The hypothetical pointer arithmetic is well defined.
[basic.compound]
A value of a pointer type that is a pointer ... past the end of an object represents ... the first byte in memory after the end of the storage occupied by the object
Access is defined in terms of objects. But there is no object there, nor is there even storage, and thus there isn't definition for the behaviour.
OK, there might in some cases be an unrelated object in the pointed memory address. Following note says that pointer past the end is not a pointer to such unrelated object sharing the address. I couldn't find which normative rule causes this.
[Note 2: A pointer past the end of an object ([expr.add]) is not considered to point to an unrelated object of the object's type, even if the unrelated object is located at that address. ...
Alternatively, we can look at the definition of indirection operator:
[expr.unary.op]
The unary * operator performs indirection: the expression to which it is applied shall be a pointer to an object type ... and the result is an lvalue referring to the object ... to which the expression points. ...
There is a contradiction because there is no object that could be referred to.
So, in conclusion:
int table[N] = {};
table[N] == 0; // UB, accessing non-existing object
table[N + 1]; // UB, [expr.add]
table + N; // OK, one past last element
table[N]; // ¯\_(ツ)_/¯ See CWG 232
This is a follow up to the following question. I was under the assumption, that the pointer arithmetic I originally used would cause undefined behavior. However I was told by a colleague, that the usage is actually well defined. The following is a simplified example:
typedef struct StructA {
int a;
} StructA ;
typedef struct StructB {
StructA a;
StructA* b;
} StructB;
int main() {
StructB* original = (StructB*)malloc(sizeof(StructB));
original->a.a = 5;
original->b = &original->a;
StructB* copy = (StructB*)malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
ptrdiff_t offset = (char*)copy - (char*)original;
StructA* a = (StructA*)((char*)(copy->b) + offset);
printf("%i\n", a->a);
free(copy)
}
According to §5.7 ¶5 of the C++11 spec:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
I assumed, that the following part of the code:
ptrdiff_t offset = (char*)copy - (char*)original;
StructA* a = (StructA*)((char*)(copy->b) + offset);
causes undefined behavior, since it:
subtracts two pointers, which point to different arrays
the resulting pointer of the offset calculation does not point into the same array anymore.
Does this cause undefined behavior, or do I misinterpret the C++ specification? Does the same apply in C as well?
Edit:
Following the comments I assume the following modification would still be undefined behavior because of the object usage after the lifetime has ended:
ptrdiff_t offset = (char*)(copy->b) - (char*)original;
StructA* a = (StructA*)((char*)copy + offset);
Would it be defined when working with indexes instead:
typedef struct StructB {
StructA a;
ptrdiff_t b_offset;
} StructB;
int main() {
StructB* original = (StructB*)malloc(sizeof(StructB));
original->a.a = 5;
original->b_offset = (char*)&(original->a) - (char*)original
StructB* copy = (StructB*)malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
StructA* a = (StructA*)((char*)copy + copy->b_offset);
printf("%i\n", a->a);
free(copy);
}
It is undefined behavior because there are severe restrictions on what can be done with pointer arithmetic. The edits that you have made and that were suggested do nothing to fix this.
Undefined Behavior in Addition
StructA* a = (StructA*)((char*)copy + offset);
First of all, this is undefined behavior due to the addition onto copy:
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
(4.1) If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
(4.2) Otherwise, if P points to an array element i of an array object x with n elements ([dcl.array]), the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i+j of x if 0 ≤ i + j ≤ n and the expression P - J points to the (possibly-hypothetical) array element i − j of x if 0 ≤ i − j ≤ n.
(4.3) Otherwise, the behavior is undefined.
See https://eel.is/c++draft/expr.add#4
In short, performing pointer arithmetic on non-arrays and non-null-pointers is always undefined behavior. Even if copy or its members were arrays, adding onto a pointer so that it becomes:
two or more past the end of the array
at least one before the first element
is also undefined behavior.
Undefined Behavior in Subtraction
ptrdiff_t offset = (char*)original - (char*)(copy->b);
The subtraction of your two pointers is also undefined behavior:
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; [...]
(5.1) If P and Q both evaluate to null pointer values, the result is 0.
(5.2) Otherwise, if P and Q point to, respectively, array elements i and j of the same array object x, the expression P - Q has the value i − j.
(5.3) Otherwise, the behavior is undefined.
See https://eel.is/c++draft/expr.add#5
So subtracting pointers from one another, when they are not both null or pointers to elements of the same array is undefined behavior.
Undefined Behavior in C
The C standard has similar restrictions:
(8) [...] If the pointer operand points to an element of
an array object, and the array is large enough, the result points to an element offset from
the original element such that the difference of the subscripts of the resulting and original
array elements equals the integer expression.
(The standard does not mention what happens for non-array pointer addition)
(9) When two pointers are subtracted, both shall point to elements of the same array object,
or one past the last element of the array object; [...]
See §6.5.6 Additive Operators in the C11 standard (n1570).
Using Data Member Pointers Instead
A clean and type-safe solution in C++ would be to use data member pointers.
typedef struct StructB {
StructA a;
StructA StructB::*b_offset;
} StructB;
int main() {
StructB* original = (StructB*) malloc(sizeof(StructB));
original->a.a = 5;
original->b_offset = &StructB::a;
StructB* copy = (StructB*) malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
printf("%i\n", (copy->*(copy->b_offset)).a);
free(copy);
}
Notes
The standard citations are from a C++ draft. The C++11 which you have cited does not appear to have any looser restrictions on pointer arithmetic, it is just formatted differently. See C++11 standard (n3337).
The Standard explicitly provides that in situations it characterizes as Undefined Behavior, implementations may behave "in a documented fashion characteristic of the environment". According to the Rationale, the intention of such characterization was, among other things, to identify avenues of "conforming language extension"; the question of when implementations support such "popular extensions" was a Quality of Implementation issue best left to the marketplace.
Many implementations intended and/or configured for low-level programming on commonplace platforms extend the language by specifying that the following equivalences hold, for any pointers p and q of type T* and integer expression i:
The bit patterns of p, (uintptr_t)p, and (intptr_t)p are identical.
p+i is equivalent to (T*)((uintptr_t)p + (uintptr_t)i * sizeof (T))
p-i is equivalent to (T*)((uintptr_t)p - (uintptr_t)i * sizeof (T))
p-q is equivalent to ((uintptr_t)p - (uintptr_t)q) / sizeof (T) in all cases where the division would have no remainder.
p>q is equivalent to (uintptr_t)p > (uintptr_t)q and likewise for all other relational and comparison operators.
The Standard does not recognize any category of implementations that always uphold those equivalences, as distinct from those that do not, in part because they did not wish to portray as "inferior" implementations for unusual platforms where such upholding equivalence would be impractical. Instead, it expected that such implementations would be upheld on implementations where that would make sense, and programmers would know when they were targeting such implementations. Someone writing memory-management code for the 68000, or for small-model 8086 (where such equivalences would naturally hold) could write memory management code that would run interchangeably on other systems where those equivalences would hold, but someone writing memory-management code for large-model 8086 would need to design it explicitly for that platform because those equivalences do not hold (pointers are 32 bits, but individual objects are limited to 65520 bytes and most pointer operations only act upon the bottom 16 bits of a pointer).
Unfortunately, even on platforms where such equivalences would normally hold, some kinds of optimizations may yield corner-case behaviors that differ from those otherwise implied by those equivalences. Commercial compilers generally uphold the Spirit of C principle "don't prevent the programmer from doing what needs to be done", and can be configured to uphold the equivalences even when most optimizations are enabled. The gcc and clang C compilers, however, don't allow such control over semantics. When all optimizations are disabled, they will uphold those equivalences on commonplace platforms, but there is no other optimization setting that will prevent them from making inferences that would be inconsistent with them.
Consider the following code:
int* p1 = new int[100];
int* p2 = new int[100];
const ptrdiff_t ptrDiff = p1 - p2;
int* p1_42 = &(p1[42]);
int* p2_42 = p1_42 + ptrDiff;
Now, does the Standard guarantee that p2_42 points to p2[42]? If not, is it always true on Windows, Linux or webassembly heap?
To add the standard quote:
expr.add#5
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header ([support.types]).
(5.1)
If P and Q both evaluate to null pointer values, the result is 0.
(5.2)
Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.
(5.3)
Otherwise, the behavior is undefined.
[ Note: If the value i−j is not in the range of representable values of type std::ptrdiff_t, the behavior is undefined.
— end note
]
(5.1) does not apply as the pointers are not nullptrs. (5.2) does not apply because the pointers are not into the same array. So, we are left with (5.3) - UB.
const ptrdiff_t ptrDiff = p1 - p2;
This is undefined behavior. Subtraction between two pointers is well defined only if they point to elements in the same array. ([expr.add] ¶5.3).
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header ([support.types]).
If P and Q both evaluate to null pointer values, the result is 0.
Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.
Otherwise, the behavior is undefined
And even if there was some hypothetical way to obtain this value in a legal way, even that summation is illegal, as even a pointer+integer summation is restricted to stay inside the boundaries of the array ([expr.add] ¶4.2)
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
Otherwise, if P points to element x[i] of an array object x with n elements,81 the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element x[i+j] if 0≤i+j≤n and the expression P - J points to the (possibly-hypothetical) element x[i−j] if 0≤i−j≤n.
Otherwise, the behavior is undefined.
The third line is Undefined Behavior, so the Standard allows anything after that.
It's only legal to subtract two pointers pointing to (or after) the same array.
Windows or Linux aren't really relevant; compilers and especially their optimizers are what breaks your program. For instance, an optimizer might recognize that p1 and p2 both point to the begin of an int[100] so p1-p2 has to be 0.
The Standard allows for implementations on platforms where memory is divided into discrete regions which cannot be reached from each other using pointer arithmetic. As a simple example, some platforms use 24-bit addresses that consist of an 8-bit bank number and a 16-bit address within a bank. Adding one to an address that identifies the last byte of a bank will yield a pointer to the first byte of that same bank, rather than the first byte of the next bank. This approach allows address arithmetic and offsets to be computed using 16-bit math rather than 24-bit math, but requires that no object span a bank boundary. Such a design would impose some extra complexity on malloc, and would likely result in more memory fragmentation than would otherwise occur, but user code wouldn't generally need to care about the partitioning of memory into banks.
Many platforms do not have such architectural restrictions, and some compilers which are designed for low-level programming on such platforms will allow address arithmetic to be performed between arbitrary pointers. The Standard notes that a common way of treating Undefined Behavior is "behaving during translation or program execution in a documented manner characteristic of the environment", and support for generalized pointer arithmetic in environments that support it would fit nicely under that category. Unfortunately, the Standard fails to provide any means of distinguishing implementations that behave in such useful fashion and those which don't.
Consider the following code:
int data[2][2];
int* p(&data[0][0]);
p[3] = 0;
Or equivalently:
int data[2][2];
int (&row0)[2] = data[0];
int* p = &row0[0];
p[3] = 0;
It's not clear to me whether this is undefined behaviour or not.
p is a pointer to the first element of an array row0 with 2 elements, therefore p[3] accesses past the end of the array, which is UB according to 7.6.6 [expr.add]:
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
Otherwise, if P points to element x[i] of an array object x with n elements, the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element x[i+j] if 0 ≤ i + j ≤ n and the expression P - J points to the (possibly-hypothetical) element x[i−j] if 0 ≤ i − j ≤ n.
Otherwise, the behavior is undefined.
I don't see anything in the standard that gives special treatment to multidimensional arrays, so I can only conclude that the above is, in fact, UB.
Am I correct?
What about the case of data being declared as std::array<std::array<int, 2>, 2>? This case seems even more likely to be UB, as structs may have padding.
Yes, you are correct, and there is not much to add to it. There are no mutidimensional arrays in C++ type system, there are only arrays (of arrays of arrays of arrays ad libitum).
Accessing an element beyond array size is undefined behavior.
The C++17 standard seems to say that an integer can only be added to a pointer if the pointer is to an array element, or, as a special exception, the pointer is the result of unary operator &:
8.5.6 [expr.add] describing addition to a pointer:
When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the expression P points to element x[i] of an array object x with n elements, the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element x[i + j] if 0 ≤ i + j ≤ n; otherwise, the behavior is undefined.
This quote includes a non-normative footnote:
An object that is not an array element is considered to belong to a single-element array for this purpose; see 8.5.2.1
which references 8.5.2.1 [expr.unary.op] discussing the unary & operator:
The result of the unary & operator is a pointer to its operand... For purposes of pointer arithmetic (8.5.6) and comparison (8.5.9, 8.5.10), an object that is not an array element whose address is taken in this way is considered to belong to an array with one element of type T.
The non-normative footnote seems to be slightly misleading, as the section it references describes behavior specific to the result of unary operator &. Nothing appears to permit other pointers (e.g. from non-array new) to be considered single-element arrays.
This seems to suggest:
void f(int a) {
int* z = (new int) + 1; // undefined behavior
int* w = &a + 1; // ok
}
Is this an oversight in the changes made for C++17? Am I missing something? Is there a reason that the "single-element array rule" is only provided specifically for unary operator &?
Note: As specified in the title, this question is specific to C++17. The C standard and prior versions of the C++ standard contained clear normative language that is no longer present. Older, vague questions like this are not relevant.
Yes, this appears to be a bug in the c++17 standard.
int* z = (new int)+1; // undefined behavior.
int* a = new int;
int* b = a+1; // undefined behavior, same reason as `z`
&*a; // seeming noop, but magically makes `*a` into an array of one element!
int* c = a+1; // defined behavior!
this is pretty ridiculous.
8.5.2.1 [expr.unary.op]
[...] an object that is not an array element whose address is taken in this way is considered to belong to an array with one element of type T
once "blessed" by 8.5.2.1, the object is an array of one element. If you don't bless it by invoking & at least once, it has never been blessed by 8.5.2.1 and is not an array of one element.
It was fixed as a defect in c++20.