Consider the following code:
int* p1 = new int[100];
int* p2 = new int[100];
const ptrdiff_t ptrDiff = p1 - p2;
int* p1_42 = &(p1[42]);
int* p2_42 = p1_42 + ptrDiff;
Now, does the Standard guarantee that p2_42 points to p2[42]? If not, is it always true on Windows, Linux or webassembly heap?
To add the standard quote:
expr.add#5
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header ([support.types]).
(5.1)
If P and Q both evaluate to null pointer values, the result is 0.
(5.2)
Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.
(5.3)
Otherwise, the behavior is undefined.
[ Note: If the value i−j is not in the range of representable values of type std::ptrdiff_t, the behavior is undefined.
— end note
]
(5.1) does not apply as the pointers are not nullptrs. (5.2) does not apply because the pointers are not into the same array. So, we are left with (5.3) - UB.
const ptrdiff_t ptrDiff = p1 - p2;
This is undefined behavior. Subtraction between two pointers is well defined only if they point to elements in the same array. ([expr.add] ¶5.3).
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header ([support.types]).
If P and Q both evaluate to null pointer values, the result is 0.
Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.
Otherwise, the behavior is undefined
And even if there was some hypothetical way to obtain this value in a legal way, even that summation is illegal, as even a pointer+integer summation is restricted to stay inside the boundaries of the array ([expr.add] ¶4.2)
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
Otherwise, if P points to element x[i] of an array object x with n elements,81 the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element x[i+j] if 0≤i+j≤n and the expression P - J points to the (possibly-hypothetical) element x[i−j] if 0≤i−j≤n.
Otherwise, the behavior is undefined.
The third line is Undefined Behavior, so the Standard allows anything after that.
It's only legal to subtract two pointers pointing to (or after) the same array.
Windows or Linux aren't really relevant; compilers and especially their optimizers are what breaks your program. For instance, an optimizer might recognize that p1 and p2 both point to the begin of an int[100] so p1-p2 has to be 0.
The Standard allows for implementations on platforms where memory is divided into discrete regions which cannot be reached from each other using pointer arithmetic. As a simple example, some platforms use 24-bit addresses that consist of an 8-bit bank number and a 16-bit address within a bank. Adding one to an address that identifies the last byte of a bank will yield a pointer to the first byte of that same bank, rather than the first byte of the next bank. This approach allows address arithmetic and offsets to be computed using 16-bit math rather than 24-bit math, but requires that no object span a bank boundary. Such a design would impose some extra complexity on malloc, and would likely result in more memory fragmentation than would otherwise occur, but user code wouldn't generally need to care about the partitioning of memory into banks.
Many platforms do not have such architectural restrictions, and some compilers which are designed for low-level programming on such platforms will allow address arithmetic to be performed between arbitrary pointers. The Standard notes that a common way of treating Undefined Behavior is "behaving during translation or program execution in a documented manner characteristic of the environment", and support for generalized pointer arithmetic in environments that support it would fit nicely under that category. Unfortunately, the Standard fails to provide any means of distinguishing implementations that behave in such useful fashion and those which don't.
Related
Say I have global variables defined in a TU such as:
extern const std::string s0{"s0"};
extern const std::string s1{"s11"};
extern const std::string s2{"s222"};
// etc...
And a function get_1 to get them depending on an index:
size_t get_1(size_t i)
{
switch (i)
{
case 0: return s0.size();
case 1: return s1.size();
case 2: return s2.size();
// etc...
}
}
And someone proposes replacing get_1 with get_2 with:
size_t get_2(size_t i)
{
return *(&s0 + i);
}
Are global variables defined next to each other in a translation unit like this guaranteed to be stored contiguously, and in the order defined?
Ie will &s1 == &s0 + 1 and &s2 == &s1 + 1 always be true?
Or can a compiler (does the standard allow a compiler to) place the variables s0 higher than s1 in memory ie. swap them?
Is it well defined behaviour to perform pointer arithmetic, like in get_2, over such variables? (that crucially aren't in the same sub-object or in an array etc., they're just globals like this)
Do rules about using relational operators on pointers from https://stackoverflow.com/a/9086675/8594193 apply to pointer arithmetic too? (Is the last comment on this answer about std::less and friends yielding a total order over any void*s where the normal relational operators don't relevant here too?)
Edit: this is not necessarily a duplicate of/asking about variables on the stack and their layout in memory, I'm aware of that already, I was specifically asking about global variables. Although the answer turns out to be the same, the question is not.
Pointer arithmetic on disparate objects yields undefined behavior as per [expr.add]:
4 When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
(4.1) — If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
(4.2) — Otherwise, if P points to an array element i of an array object x with n elements (9.3.4.5), the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i + j of x if 0 ≤ i + j ≤ n and the expression P - J points to the (possibly-hypothetical) array element i − j of x if 0 ≤ i − j ≤ n.
(4.3) — Otherwise, the behavior is undefined.
Since s0 through s2 are not elements of an array, get_2 yields explicitly documented undefined behavior.
As far as I can tell, the standard puts no limits on the order in memory of these variables, so the compiler could order them any way it wanted, with any amount of padding or other variables between them. This is not explicitly mentioned as such, but as was pointed out to me in the comments, [expr.rel] and [expr.eq] determine that the results of relational operators in these cases are undefined/unspecified. In particular, [expr.eq] states about operators == and != that
(3.1) — If one pointer represents the address of a complete object, and another pointer represents the address one past the last element of a different complete object, the result of the comparison is unspecified.
and [expr.rel] about <, >, <=, >= that
4 The result of comparing unequal pointers to objects is defined in terms of a partial order consistent with the following rules:
(4.1) — If two pointers point to different elements of the same array, or to subobjects thereof, the pointer to the element with the higher subscript is required to compare greater.
(4.2) — If two pointers point to different non-static data members of the same object, or to subobjects of such members, recursively, the pointer to the later declared member is required to compare greater provided the two members have the same access control (11.9), neither member is a subobject of zero size, and their class is not a union.
(4.3) — Otherwise, neither pointer is required to compare greater than the other.
Again, since s0, s1, s2 are not part of the same array and not members of the same object, 4.3 is relevant, and the results of comparing pointers to them is unspecified. In practical terms, this means that the compiler can order them in memory in an arbitrary fashion.
This is a follow up to the following question. I was under the assumption, that the pointer arithmetic I originally used would cause undefined behavior. However I was told by a colleague, that the usage is actually well defined. The following is a simplified example:
typedef struct StructA {
int a;
} StructA ;
typedef struct StructB {
StructA a;
StructA* b;
} StructB;
int main() {
StructB* original = (StructB*)malloc(sizeof(StructB));
original->a.a = 5;
original->b = &original->a;
StructB* copy = (StructB*)malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
ptrdiff_t offset = (char*)copy - (char*)original;
StructA* a = (StructA*)((char*)(copy->b) + offset);
printf("%i\n", a->a);
free(copy)
}
According to §5.7 ¶5 of the C++11 spec:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
I assumed, that the following part of the code:
ptrdiff_t offset = (char*)copy - (char*)original;
StructA* a = (StructA*)((char*)(copy->b) + offset);
causes undefined behavior, since it:
subtracts two pointers, which point to different arrays
the resulting pointer of the offset calculation does not point into the same array anymore.
Does this cause undefined behavior, or do I misinterpret the C++ specification? Does the same apply in C as well?
Edit:
Following the comments I assume the following modification would still be undefined behavior because of the object usage after the lifetime has ended:
ptrdiff_t offset = (char*)(copy->b) - (char*)original;
StructA* a = (StructA*)((char*)copy + offset);
Would it be defined when working with indexes instead:
typedef struct StructB {
StructA a;
ptrdiff_t b_offset;
} StructB;
int main() {
StructB* original = (StructB*)malloc(sizeof(StructB));
original->a.a = 5;
original->b_offset = (char*)&(original->a) - (char*)original
StructB* copy = (StructB*)malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
StructA* a = (StructA*)((char*)copy + copy->b_offset);
printf("%i\n", a->a);
free(copy);
}
It is undefined behavior because there are severe restrictions on what can be done with pointer arithmetic. The edits that you have made and that were suggested do nothing to fix this.
Undefined Behavior in Addition
StructA* a = (StructA*)((char*)copy + offset);
First of all, this is undefined behavior due to the addition onto copy:
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
(4.1) If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
(4.2) Otherwise, if P points to an array element i of an array object x with n elements ([dcl.array]), the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i+j of x if 0 ≤ i + j ≤ n and the expression P - J points to the (possibly-hypothetical) array element i − j of x if 0 ≤ i − j ≤ n.
(4.3) Otherwise, the behavior is undefined.
See https://eel.is/c++draft/expr.add#4
In short, performing pointer arithmetic on non-arrays and non-null-pointers is always undefined behavior. Even if copy or its members were arrays, adding onto a pointer so that it becomes:
two or more past the end of the array
at least one before the first element
is also undefined behavior.
Undefined Behavior in Subtraction
ptrdiff_t offset = (char*)original - (char*)(copy->b);
The subtraction of your two pointers is also undefined behavior:
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; [...]
(5.1) If P and Q both evaluate to null pointer values, the result is 0.
(5.2) Otherwise, if P and Q point to, respectively, array elements i and j of the same array object x, the expression P - Q has the value i − j.
(5.3) Otherwise, the behavior is undefined.
See https://eel.is/c++draft/expr.add#5
So subtracting pointers from one another, when they are not both null or pointers to elements of the same array is undefined behavior.
Undefined Behavior in C
The C standard has similar restrictions:
(8) [...] If the pointer operand points to an element of
an array object, and the array is large enough, the result points to an element offset from
the original element such that the difference of the subscripts of the resulting and original
array elements equals the integer expression.
(The standard does not mention what happens for non-array pointer addition)
(9) When two pointers are subtracted, both shall point to elements of the same array object,
or one past the last element of the array object; [...]
See §6.5.6 Additive Operators in the C11 standard (n1570).
Using Data Member Pointers Instead
A clean and type-safe solution in C++ would be to use data member pointers.
typedef struct StructB {
StructA a;
StructA StructB::*b_offset;
} StructB;
int main() {
StructB* original = (StructB*) malloc(sizeof(StructB));
original->a.a = 5;
original->b_offset = &StructB::a;
StructB* copy = (StructB*) malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
printf("%i\n", (copy->*(copy->b_offset)).a);
free(copy);
}
Notes
The standard citations are from a C++ draft. The C++11 which you have cited does not appear to have any looser restrictions on pointer arithmetic, it is just formatted differently. See C++11 standard (n3337).
The Standard explicitly provides that in situations it characterizes as Undefined Behavior, implementations may behave "in a documented fashion characteristic of the environment". According to the Rationale, the intention of such characterization was, among other things, to identify avenues of "conforming language extension"; the question of when implementations support such "popular extensions" was a Quality of Implementation issue best left to the marketplace.
Many implementations intended and/or configured for low-level programming on commonplace platforms extend the language by specifying that the following equivalences hold, for any pointers p and q of type T* and integer expression i:
The bit patterns of p, (uintptr_t)p, and (intptr_t)p are identical.
p+i is equivalent to (T*)((uintptr_t)p + (uintptr_t)i * sizeof (T))
p-i is equivalent to (T*)((uintptr_t)p - (uintptr_t)i * sizeof (T))
p-q is equivalent to ((uintptr_t)p - (uintptr_t)q) / sizeof (T) in all cases where the division would have no remainder.
p>q is equivalent to (uintptr_t)p > (uintptr_t)q and likewise for all other relational and comparison operators.
The Standard does not recognize any category of implementations that always uphold those equivalences, as distinct from those that do not, in part because they did not wish to portray as "inferior" implementations for unusual platforms where such upholding equivalence would be impractical. Instead, it expected that such implementations would be upheld on implementations where that would make sense, and programmers would know when they were targeting such implementations. Someone writing memory-management code for the 68000, or for small-model 8086 (where such equivalences would naturally hold) could write memory management code that would run interchangeably on other systems where those equivalences would hold, but someone writing memory-management code for large-model 8086 would need to design it explicitly for that platform because those equivalences do not hold (pointers are 32 bits, but individual objects are limited to 65520 bytes and most pointer operations only act upon the bottom 16 bits of a pointer).
Unfortunately, even on platforms where such equivalences would normally hold, some kinds of optimizations may yield corner-case behaviors that differ from those otherwise implied by those equivalences. Commercial compilers generally uphold the Spirit of C principle "don't prevent the programmer from doing what needs to be done", and can be configured to uphold the equivalences even when most optimizations are enabled. The gcc and clang C compilers, however, don't allow such control over semantics. When all optimizations are disabled, they will uphold those equivalences on commonplace platforms, but there is no other optimization setting that will prevent them from making inferences that would be inconsistent with them.
For subtraction of pointers i and j to elements of the same array object the note in [expr.add#5] reads:
[ Note: If the value i−j is not in the range of representable values of type std::ptrdiff_t, the behavior is undefined. — end note ]
But given [support.types.layout#2], which states that (emphasis mine):
The type ptrdiff_t is an implementation-defined signed integer type that can hold the difference of two subscripts in an array object, as described in [expr.add].
Is it even possible for the result of i-j not to be in the range of representable values of ptrdiff_t?
PS: I apologize if my question is caused by my poor understanding of the English language.
EDIT: Related: Why is the maximum size of an array "too large"?
Is it even possible for the result of i-j not to be in the range of representable values of ptrdiff_t?
Yes, but it's unlikely.
In fact, [support.types.layout]/2 does not say much except the proper rules about pointers subtraction and ptrdiff_t are defined in [expr.add]. So let us see this section.
[expr.add]/5
When two pointers to elements of the same array object are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header.
First of all, note that the case where i and j are subscript indexes of different arrays is not considered. This allows to treat i-j as P-Q would be where P is a pointer to the element of an array at subscript i and Q is a pointer to the element of the same array at subscript j. In deed, subtracting two pointers to elements of different arrays is undefined behavior:
[expr.add]/5
If the expressions P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j
; otherwise, the behavior is undefined.
As a conclusion, with the notation defined previously, i-j and P-Q are defined to have the same value, with the latter being of type std::ptrdiff_t. But nothing is said about the possibility for this type to hold such a value. This question can, however, be answered with the help of std::numeric_limits; especially, one can detect if an array some_array is too big for std::ptrdiff_t to hold all index differences:
static_assert(std::numeric_limits<std::ptrdiff_t>::max() > sizeof(some_array)/sizeof(some_array[0]),
"some_array is too big, subtracting its first and one-past-the-end element indexes "
"or pointers would lead to undefined behavior as per [expr.add]/5."
);
Now, on usual target, this would usually not happen as sizeof(std::ptrdiff_t) == sizeof(void*); which means an array would need to be stupidly big for ptrdiff_t to overflow. But there is no guarantee of it.
I think it is a bug of the wordings.
The rule in [expr.add] is inherited from the same rule for pointer subtraction in the C standard. In the C standard, ptrdiff_t is not required to hold any difference of two subscripts in an array object.
The rule in [support.types.layout] comes from Core Language Issue 1122. It added direct definitions for std::size_t and std::ptrdiff_t, which is supposed to solve the problem of circular definition. I don't see there is any reason (at least not mentioned in any official document) to make std::ptrdiff_t hold any difference of two subscripts in an array object. I guess it just uses an improper definition to solve the circular definition issue.
As another evidence, [diff.library] does not mention any difference between std::ptrdiff_t in C++ and ptrdiff_t in C. Since in C ptrdiff_t has no such constraint, in C++ std::ptrdiff_t should not have such constraint too.
I used to think that adding an integral type to a pointer (provided that the the pointer points to an array of a certain size etc. etc.) is always well defined, regardless of the integral type. The C++11 standard says ([expr.add]):
When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression. In other words, if the expression P points to the i -th element of an array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N (where N has the value n ) point to, respectively, the i + n -th and i − n -th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
On the other hand, it was brought to my attention recently that the built-in add operators for pointers are defined in terms of ptrdiff_t, which is a signed type (see 13.6/13). This seems to hint that if one does a malloc() with a very large (unsigned) size and then tries to reach the end of the allocated space via a pointer addition with a std::size_t value, this might result in undefined behaviour because the unsigned std::size_t will be converted to ptrdiff_t which is potentially UB.
I imagine similar issues would arise, e.g., in the operator[]() of std::vector, which is implemented in terms of an unsigned size_type. In general, it seems to me like this would make practically impossible to fully use the memory storage available on a platform.
It's worth noting that nor GCC nor Clang complain about signed-unsigned integral conversions with all the relevant diagnostic turned on when adding unsigned values to pointers.
Am I missing something?
EDIT: I'd like to clarify that I am talking about additions involving a pointer and an integral type (not two pointers).
EDIT2: an equivalent way of formulating the question might be this. Does this code result in UB in the second line, if ptrdiff_t has a smaller positive range than size_t?
char *ptr = static_cast<char * >(std::malloc(std::numeric_limits<std::size_t>::max()));
auto end = ptr + std::numeric_limits<std::size_t>::max();
Your question is based on a false premise.
Subtraction of pointers produces a ptrdiff_t §[expr.add]/6:
When two pointers to elements of the same array object are subtracted, the result is the difference of the subscripts of the two array elements. The type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the header (18.2).
That does not, however, mean that addition is defined in terms of ptrdiff_t. Rather the contrary, for addition only one conversion is specified (§[expr.add]/1):
The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.
The "usual arithmetic conversions" are defined in §[expr]/10. This includes only one conversion from unsigned type to signed type:
Otherwise, if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, the operand with unsigned integer type shall be converted to the type of the operand with signed integer type.
So, while there may be some room for question about exactly what type the size_t will be converted to (and whether it's converted at all), there's no question on one point: the only way it can be converted to a ptrdiff_t is if all its values can be represented without change as a ptrdiff_t.
So, given:
size_t N;
T *p;
...the expression p + N will never fail because of some (imagined) conversion of N to a ptrdiff_t before the addition takes place.
Since §13.6 is being mentioned, perhaps it's best to back up and look carefully at what §13.6 really is:
The candidate operator functions that represent the built-in operators defined in Clause 5 are specified in this subclause. These candidate functions participate in the operator overload resolution process as described in 13.3.1.2 and are used for no other purpose.
[emphasis added]
In other words, the fact that §13.6 defines an operator that adds a ptrdiff_t to a pointer does not mean that when any other integer type is added to a pointer, it's first converted to a ptrdiff_t, or anything like that. More generally, the operators defined in §13.6 are never used to carry out any arithmetic operations.
With that, and the rest of the text you quoted from §[expr.add], we can quickly conclude that adding a size_t to a pointer can overflow if and only if there aren't that many elements in the array after the pointer.
Given the above, one more question probably occurs to you. If I have code like this:
char *p = huge_array;
size_t N = sizeof(huge_array);
char *p2 = p + N;
ptrdiff_t diff = p2 - p;
...is it possible that the final subtraction will overflow? The short and simple answer to that is: Yes, it can.
I've always wondered: isn't ptrdiff_t supposed to be able to hold the difference of any two pointers by definition? How come it fails when the two pointers are too far? (I'm not pointing at any particular language... I'm referring to all languages which have this type.)
(e.g. subtract the pointer with address 1 from the byte pointer with address 0xFFFFFFFF when you have 32-bit pointers, and it overflows the sign bit...)
No, it is not.
$5.7 [expr.add] (from n3225 - C++0x FCD)
When two pointers to elements of the same array object are subtracted, the result is the difference of the subscripts of the two array elements. The type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header (18.2). As with any other arithmetic overflow, if the result does not fit in the space provided, the behavior is undefined.
In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i − j provided the value fits in an object of type std::ptrdiff_t. Moreover, if the expression P points either to an element of an array object or one past the last element of an array object, and the expression Q points to the last element of the same array object, the expression ((Q)+1)-(P) has the same value as ((Q)-(P))+1 and as -((P)-((Q)+1)), and has the value zero if the expression P points one past the last element of the array object, even though the expression (Q)+1 does not point to an element of the array object. Unless both pointers point to elements of the same array object, or one past the last element of the array object, the behavior is undefined.
Note the number of times undefined appears in the paragraph. Also note that you can only subtract pointers if they point within the same object.
No, because there is no such thing as the difference between "any two pointers". You can only subtract pointers to elements of the same array (or the pointer to the location just past the end of an array).
To add a more explicit standard quote, ISO 9899:1999 §J.2/1 states:
The behavior is undefined in the following circumstances:
[...]
-- The result of subtracting two pointers is not representable in an object of type
ptrdiff_t (6.5.6).
It is entirely acceptable for ptrdiff_t to be the same size as pointer types, provided the overflow semantics are defined by the compiler so that any difference is still representable. There is no guarantee that a negative ptrdiff_t means that the second pointer lives at a lower address in memory than the first, or that ptrdiff_t is signed at all.
Over/underflow is mathematically well-defined for fixed-size integer arithmetic:
(1 - 0xFFFFFFFF) % (1<<32) =
(1 + -0xFFFFFFFF) % (1<<32) =
1 + (-0xFFFFFFFF % (1<<32)) = 2
This is the correct result!
Specifically, the result after over/underflow is an alias of the correct integer. In fact, every non-representable integer is aliased (undistinguishable) with one representable integer — count to infinity in fixed-size integers, and you will repeat yourself, round and round like a dial of an analog clock.
An N-bit integer represents any real integer modulo 2^N. In C, modulo 2^N is written as %(1<<32).
I believe C guarrantees mathematical correctness of over/underflow, but only for unsigned integers. Signed under/overflow is assumed to never happen (for the sake of optimization).
In practice, signed integers are two's complement, which makes no difference in addition or subtraction, so correct under/overflow behavior is guarranteed for signed integers too (although not by C).