Signed overflow is undefined. Unsigned overflow is defined as modulo arithmetic.
So my question is, is the following defined or undefined:
#include <assert.h>
#include <cstdint>
struct X { int x;/* ... anything ... */ };
X array[3] = { 2, 3, 4 /* or other init values that is compatible with X */};
X* element = array + 1;
std::uintptr_t plus1 = 1;
std::uintptr_t minus1 = 0-plus1;
int main()
{
printf("%p\n%p\n", element + plus1, array + 2);
printf("\n");
printf("%p\n%p\n", element + minus1, array);
assert(element + plus1 == array + 2);
assert(element + minus1 == array);
}
Though I state plus1/minus1, I really mean any +/- value. If I understand it correctly, this should work. Am I correct?
Pointer arithmetic is defined in the abstract machine.
In the abstract machine, ptr+x is only valid if ptr exists within an object such that its address is within -x of the edge.
This abstract machine does not care about the specific size of pointers or signed or unsigned integers. In this abstract machine, signed and unsigned integers have values that are real integers, or unspecified values.
minus1 with a 32 bit uintptr_t is equal to 0xffffffff, a large positive integer.
Does element point within an object that is large enough that 0xffffffff*sizeof(X) later it is still within the object? No it does not.
So element+minus1 is an undefined operation. Anything can happen.
On your hardware, a naive interpritation of pointer arithmetic into machine code may result in it wrapping around. But relying on this isn't safe.
For one thing, optimizers sometimes like proving things. If they prove element is greater than array's address, then no unsigned addition to element can possibly make it equal to array. So the compiler could optimize element+unsigned value == array to false.
Such optimizations can occur if you change optimization settings, upgrade your compiler, or completely innocuous things like change where it is inlined, or other code that is inlined, or heuristics at link time optimization, or the phase of the moon changing.
Relying on it working is dangerous, even when it does, as you now become responsible for auditing not the source code, but the machine code it generates.
Pointer arithmetic is only well defined as long as the pointers involved remain within a single array object or just past the end of the array. So technically, the expression element + minus1 is undefined behavior -- because minus1 is a very large value, it runs past the end of the array.
Now in practice, this is likely to work, but it is still technically undefined.
std::uintptr_t is an unsigned integral.
std::intptr_t is an signed integral.
And so overflow of std::uintptr_t is defined, whereas the one of std::intptr_t lead to UB.
In addition, pointer arithmetic is only valid within array (to one past the end of an array).
minus1 is the greatest number that std::uintptr_t can hold.
From http://en.cppreference.com/w/cpp/language/operator_arithmetic
If the pointer P points to the ith element of an array, then the expressions P+n, n+P, and P-n are pointers of the same type that point to the i+nth, i+nth, and i-nth element of the same array, respectively. The result of pointer addition may also be a one-past-the-end pointer (that is, pointer P such that the expression P-1 points to the last element of the array). Any other situations (that is, attempts to generate a pointer that isn't pointing at an element of the same array or one past the end) invoke undefined behavior.
element + minus1 is not the same as element - 1.
element + minus1 is outside valid range of array, and so lead to UB.
Related
This is a follow up to the following question. I was under the assumption, that the pointer arithmetic I originally used would cause undefined behavior. However I was told by a colleague, that the usage is actually well defined. The following is a simplified example:
typedef struct StructA {
int a;
} StructA ;
typedef struct StructB {
StructA a;
StructA* b;
} StructB;
int main() {
StructB* original = (StructB*)malloc(sizeof(StructB));
original->a.a = 5;
original->b = &original->a;
StructB* copy = (StructB*)malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
ptrdiff_t offset = (char*)copy - (char*)original;
StructA* a = (StructA*)((char*)(copy->b) + offset);
printf("%i\n", a->a);
free(copy)
}
According to §5.7 ¶5 of the C++11 spec:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
I assumed, that the following part of the code:
ptrdiff_t offset = (char*)copy - (char*)original;
StructA* a = (StructA*)((char*)(copy->b) + offset);
causes undefined behavior, since it:
subtracts two pointers, which point to different arrays
the resulting pointer of the offset calculation does not point into the same array anymore.
Does this cause undefined behavior, or do I misinterpret the C++ specification? Does the same apply in C as well?
Edit:
Following the comments I assume the following modification would still be undefined behavior because of the object usage after the lifetime has ended:
ptrdiff_t offset = (char*)(copy->b) - (char*)original;
StructA* a = (StructA*)((char*)copy + offset);
Would it be defined when working with indexes instead:
typedef struct StructB {
StructA a;
ptrdiff_t b_offset;
} StructB;
int main() {
StructB* original = (StructB*)malloc(sizeof(StructB));
original->a.a = 5;
original->b_offset = (char*)&(original->a) - (char*)original
StructB* copy = (StructB*)malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
StructA* a = (StructA*)((char*)copy + copy->b_offset);
printf("%i\n", a->a);
free(copy);
}
It is undefined behavior because there are severe restrictions on what can be done with pointer arithmetic. The edits that you have made and that were suggested do nothing to fix this.
Undefined Behavior in Addition
StructA* a = (StructA*)((char*)copy + offset);
First of all, this is undefined behavior due to the addition onto copy:
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
(4.1) If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
(4.2) Otherwise, if P points to an array element i of an array object x with n elements ([dcl.array]), the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) array element i+j of x if 0 ≤ i + j ≤ n and the expression P - J points to the (possibly-hypothetical) array element i − j of x if 0 ≤ i − j ≤ n.
(4.3) Otherwise, the behavior is undefined.
See https://eel.is/c++draft/expr.add#4
In short, performing pointer arithmetic on non-arrays and non-null-pointers is always undefined behavior. Even if copy or its members were arrays, adding onto a pointer so that it becomes:
two or more past the end of the array
at least one before the first element
is also undefined behavior.
Undefined Behavior in Subtraction
ptrdiff_t offset = (char*)original - (char*)(copy->b);
The subtraction of your two pointers is also undefined behavior:
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; [...]
(5.1) If P and Q both evaluate to null pointer values, the result is 0.
(5.2) Otherwise, if P and Q point to, respectively, array elements i and j of the same array object x, the expression P - Q has the value i − j.
(5.3) Otherwise, the behavior is undefined.
See https://eel.is/c++draft/expr.add#5
So subtracting pointers from one another, when they are not both null or pointers to elements of the same array is undefined behavior.
Undefined Behavior in C
The C standard has similar restrictions:
(8) [...] If the pointer operand points to an element of
an array object, and the array is large enough, the result points to an element offset from
the original element such that the difference of the subscripts of the resulting and original
array elements equals the integer expression.
(The standard does not mention what happens for non-array pointer addition)
(9) When two pointers are subtracted, both shall point to elements of the same array object,
or one past the last element of the array object; [...]
See §6.5.6 Additive Operators in the C11 standard (n1570).
Using Data Member Pointers Instead
A clean and type-safe solution in C++ would be to use data member pointers.
typedef struct StructB {
StructA a;
StructA StructB::*b_offset;
} StructB;
int main() {
StructB* original = (StructB*) malloc(sizeof(StructB));
original->a.a = 5;
original->b_offset = &StructB::a;
StructB* copy = (StructB*) malloc(sizeof(StructB));
memcpy(copy, original, sizeof(StructB));
free(original);
printf("%i\n", (copy->*(copy->b_offset)).a);
free(copy);
}
Notes
The standard citations are from a C++ draft. The C++11 which you have cited does not appear to have any looser restrictions on pointer arithmetic, it is just formatted differently. See C++11 standard (n3337).
The Standard explicitly provides that in situations it characterizes as Undefined Behavior, implementations may behave "in a documented fashion characteristic of the environment". According to the Rationale, the intention of such characterization was, among other things, to identify avenues of "conforming language extension"; the question of when implementations support such "popular extensions" was a Quality of Implementation issue best left to the marketplace.
Many implementations intended and/or configured for low-level programming on commonplace platforms extend the language by specifying that the following equivalences hold, for any pointers p and q of type T* and integer expression i:
The bit patterns of p, (uintptr_t)p, and (intptr_t)p are identical.
p+i is equivalent to (T*)((uintptr_t)p + (uintptr_t)i * sizeof (T))
p-i is equivalent to (T*)((uintptr_t)p - (uintptr_t)i * sizeof (T))
p-q is equivalent to ((uintptr_t)p - (uintptr_t)q) / sizeof (T) in all cases where the division would have no remainder.
p>q is equivalent to (uintptr_t)p > (uintptr_t)q and likewise for all other relational and comparison operators.
The Standard does not recognize any category of implementations that always uphold those equivalences, as distinct from those that do not, in part because they did not wish to portray as "inferior" implementations for unusual platforms where such upholding equivalence would be impractical. Instead, it expected that such implementations would be upheld on implementations where that would make sense, and programmers would know when they were targeting such implementations. Someone writing memory-management code for the 68000, or for small-model 8086 (where such equivalences would naturally hold) could write memory management code that would run interchangeably on other systems where those equivalences would hold, but someone writing memory-management code for large-model 8086 would need to design it explicitly for that platform because those equivalences do not hold (pointers are 32 bits, but individual objects are limited to 65520 bytes and most pointer operations only act upon the bottom 16 bits of a pointer).
Unfortunately, even on platforms where such equivalences would normally hold, some kinds of optimizations may yield corner-case behaviors that differ from those otherwise implied by those equivalences. Commercial compilers generally uphold the Spirit of C principle "don't prevent the programmer from doing what needs to be done", and can be configured to uphold the equivalences even when most optimizations are enabled. The gcc and clang C compilers, however, don't allow such control over semantics. When all optimizations are disabled, they will uphold those equivalences on commonplace platforms, but there is no other optimization setting that will prevent them from making inferences that would be inconsistent with them.
Consider the following code:
int* p1 = new int[100];
int* p2 = new int[100];
const ptrdiff_t ptrDiff = p1 - p2;
int* p1_42 = &(p1[42]);
int* p2_42 = p1_42 + ptrDiff;
Now, does the Standard guarantee that p2_42 points to p2[42]? If not, is it always true on Windows, Linux or webassembly heap?
To add the standard quote:
expr.add#5
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header ([support.types]).
(5.1)
If P and Q both evaluate to null pointer values, the result is 0.
(5.2)
Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.
(5.3)
Otherwise, the behavior is undefined.
[ Note: If the value i−j is not in the range of representable values of type std::ptrdiff_t, the behavior is undefined.
— end note
]
(5.1) does not apply as the pointers are not nullptrs. (5.2) does not apply because the pointers are not into the same array. So, we are left with (5.3) - UB.
const ptrdiff_t ptrDiff = p1 - p2;
This is undefined behavior. Subtraction between two pointers is well defined only if they point to elements in the same array. ([expr.add] ¶5.3).
When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header ([support.types]).
If P and Q both evaluate to null pointer values, the result is 0.
Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.
Otherwise, the behavior is undefined
And even if there was some hypothetical way to obtain this value in a legal way, even that summation is illegal, as even a pointer+integer summation is restricted to stay inside the boundaries of the array ([expr.add] ¶4.2)
When an expression J that has integral type is added to or subtracted from an expression P of pointer type, the result has the type of P.
If P evaluates to a null pointer value and J evaluates to 0, the result is a null pointer value.
Otherwise, if P points to element x[i] of an array object x with n elements,81 the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element x[i+j] if 0≤i+j≤n and the expression P - J points to the (possibly-hypothetical) element x[i−j] if 0≤i−j≤n.
Otherwise, the behavior is undefined.
The third line is Undefined Behavior, so the Standard allows anything after that.
It's only legal to subtract two pointers pointing to (or after) the same array.
Windows or Linux aren't really relevant; compilers and especially their optimizers are what breaks your program. For instance, an optimizer might recognize that p1 and p2 both point to the begin of an int[100] so p1-p2 has to be 0.
The Standard allows for implementations on platforms where memory is divided into discrete regions which cannot be reached from each other using pointer arithmetic. As a simple example, some platforms use 24-bit addresses that consist of an 8-bit bank number and a 16-bit address within a bank. Adding one to an address that identifies the last byte of a bank will yield a pointer to the first byte of that same bank, rather than the first byte of the next bank. This approach allows address arithmetic and offsets to be computed using 16-bit math rather than 24-bit math, but requires that no object span a bank boundary. Such a design would impose some extra complexity on malloc, and would likely result in more memory fragmentation than would otherwise occur, but user code wouldn't generally need to care about the partitioning of memory into banks.
Many platforms do not have such architectural restrictions, and some compilers which are designed for low-level programming on such platforms will allow address arithmetic to be performed between arbitrary pointers. The Standard notes that a common way of treating Undefined Behavior is "behaving during translation or program execution in a documented manner characteristic of the environment", and support for generalized pointer arithmetic in environments that support it would fit nicely under that category. Unfortunately, the Standard fails to provide any means of distinguishing implementations that behave in such useful fashion and those which don't.
As an example, consider the following structure:
struct S {
int a[4];
int b[4];
} s;
Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?
Personally, I feel that it must be UB in C++, whereas I'm not sure about C.
However, I failed to find anything relevant in the standards of C and C++ languages.
Update
There are several answers suggesting ways to make sure there is no padding
between fields in order to make the code work reliably. I'd like to emphasize
that if such code is UB, then absense of padding is not enough. If it is UB,
then the compiler is free to assume that accesses to S.a[i] and S.b[j] do not
overlap and the compiler is free to reorder such memory accesses. For example,
int x = s.b[2];
s.a[6] = 2;
return x;
can be transformed to
s.a[6] = 2;
int x = s.b[2];
return x;
which always returns 2.
Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?
No. Because accessing an array out of bound invoked undefined behaviour in C and C++.
C11 J.2 Undefined behavior
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond
the array object and is used as the operand of a unary * operator that
is evaluated (6.5.6).
An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression
a[1][7] given the declaration int a[4][5]) (6.5.6).
C++ standard draft section 5.7 Additive operators paragraph 5 says:
When an expression that has integral type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integral expression.
[...] If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.
Apart from the answer of #rsp (Undefined behavior for an array subscript that is out of range) I can add that it is not legal to access b via a because the C language does not specify how much padding space can be between the end of area allocated for a and the start of b, so even if you can run it on a particular implementation , it is not portable.
instance of struct:
+-----------+----------------+-----------+---------------+
| array a | maybe padding | array b | maybe padding |
+-----------+----------------+-----------+---------------+
The second padding may miss as well as the alignment of struct object is the alignment of a which is the same as the alignment of b but the C language also does not impose the second padding not to be there.
a and b are two different arrays, and a is defined as containing 4 elements. Hence, a[6] accesses the array out of bounds and is therefore undefined behaviour. Note that array subscript a[6] is defined as *(a+6), so the proof of UB is actually given by section "Additive operators" in conjunction with pointers". See the following section of the C11-standard (e.g. this online draft version) describing this aspect:
6.5.6 Additive operators
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i-n-th elements of the array object, provided they exist. Moreover, if
the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary *
operator that is evaluated.
The same argument applies to C++ (though not quoted here).
Further, though it is clearly undefined behaviour due to the fact of exceeding array bounds of a, note that the compiler might introduce padding between members a and b, such that - even if such pointer arithmetics were allowed - a+6 would not necessarily yield the same address as b+2.
Is it legal? No. As others mentioned, it invokes Undefined Behavior.
Will it work? That depends on your compiler. That's the thing about undefined behavior: it's undefined.
On many C and C++ compilers, the struct will be laid out such that b will immediately follow a in memory and there will be no bounds checking. So accessing a[6] will effectively be the same as b[2] and will not cause any sort of exception.
Given
struct S {
int a[4];
int b[4];
} s
and assuming no extra padding, the structure is really just a way of looking at a block of memory containing 8 integers. You could cast it to (int*) and ((int*)s)[6] would point to the same memory as s.b[2].
Should you rely on this sort of behavior? Absolutely not. Undefined means that the compiler doesn't have to support this. The compiler is free to pad the structure which could render the assumption that &(s.b[2]) == &(s.a[6]) incorrect. The compiler could also add bounds checking on the array access (although enabling compiler optimizations would probably disable such a check).
I've have experienced the effects of this in the past. It's quite common to have a struct like this
struct Bob {
char name[16];
char whatever[64];
} bob;
strcpy(bob.name, "some name longer than 16 characters");
Now bob.whatever will be " than 16 characters". (which is why you should always use strncpy, BTW)
As #MartinJames mentioned in a comment, if you need to guarantee that a and b are in contiguous memory (or at least able to be treated as such, (edit) unless your architecture/compiler uses an unusual memory block size/offset and forced alignment that would require padding to be added), you need to use a union.
union overlap {
char all[8]; /* all the bytes in sequence */
struct { /* (anonymous struct so its members can be accessed directly) */
char a[4]; /* padding may be added after this if the alignment is not a sub-factor of 4 */
char b[4];
};
};
You can't directly access b from a (e.g. a[6], like you asked), but you can access the elements of both a and b by using all (e.g. all[6] refers to the same memory location as b[2]).
(Edit: You could replace 8 and 4 in the code above with 2*sizeof(int) and sizeof(int), respectively, to be more likely to match the architecture's alignment, especially if the code needs to be more portable, but then you have to be careful to avoid making any assumptions about how many bytes are in a, b, or all. However, this will work on what are probably the most common (1-, 2-, and 4-byte) memory alignments.)
Here is a simple example:
#include <stdio.h>
union overlap {
char all[2*sizeof(int)]; /* all the bytes in sequence */
struct { /* anonymous struct so its members can be accessed directly */
char a[sizeof(int)]; /* low word */
char b[sizeof(int)]; /* high word */
};
};
int main()
{
union overlap testing;
testing.a[0] = 'a';
testing.a[1] = 'b';
testing.a[2] = 'c';
testing.a[3] = '\0'; /* null terminator */
testing.b[0] = 'e';
testing.b[1] = 'f';
testing.b[2] = 'g';
testing.b[3] = '\0'; /* null terminator */
printf("a=%s\n",testing.a); /* output: a=abc */
printf("b=%s\n",testing.b); /* output: b=efg */
printf("all=%s\n",testing.all); /* output: all=abc */
testing.a[3] = 'd'; /* makes printf keep reading past the end of a */
printf("a=%s\n",testing.a); /* output: a=abcdefg */
printf("b=%s\n",testing.b); /* output: b=efg */
printf("all=%s\n",testing.all); /* output: all=abcdefg */
return 0;
}
No, since accesing an array out of bounds invokes Undefined Behavior, both in C and C++.
Short Answer: No. You're in the land of undefined behavior.
Long Answer: No. But that doesn't mean that you can't access the data in other sketchier ways... if you're using GCC you can do something like the following (elaboration of dwillis's answer):
struct __attribute__((packed,aligned(4))) Bad_Access {
int arr1[3];
int arr2[3];
};
and then you could access via (Godbolt source+asm):
int x = ((int*)ba_pointer)[4];
But that cast violates strict aliasing so is only safe with g++ -fno-strict-aliasing. You can cast a struct pointer to a pointer to the first member, but then you're back in the UB boat because you're accessing outside the first member.
Alternatively, just don't do that. Save a future programmer (probably yourself) the heartache of that mess.
Also, while we're at it, why not use std::vector? It's not fool-proof, but on the back-end it has guards to prevent such bad behavior.
Addendum:
If you're really concerned about performance:
Let's say you have two same-typed pointers that you're accessing. The compiler will more than likely assume that both pointers have the chance to interfere, and will instantiate additional logic to protect you from doing something dumb.
If you solemnly swear to the compiler that you're not trying to alias, the compiler will reward you handsomely:
Does the restrict keyword provide significant benefits in gcc / g++
Conclusion: Don't be evil; your future self, and the compiler will thank you.
Jed Schaff’s answer is on the right track, but not quite correct. If the compiler inserts padding between a and b, his solution will still fail. If, however, you declare:
typedef struct {
int a[4];
int b[4];
} s_t;
typedef union {
char bytes[sizeof(s_t)];
s_t s;
} u_t;
You may now access (int*)(bytes + offsetof(s_t, b)) to get the address of s.b, no matter how the compiler lays out the structure. The offsetof() macro is declared in <stddef.h>.
The expression sizeof(s_t) is a constant expression, legal in an array declaration in both C and C++. It will not give a variable-length array. (Apologies for misreading the C standard before. I thought that sounded wrong.)
In the real world, though, two consecutive arrays of int in a structure are going to be laid out the way you expect. (You might be able to engineer a very contrived counterexample by setting the bound of a to 3 or 5 instead of 4 and then getting the compiler to align both a and b on a 16-byte boundary.) Rather than convoluted methods to try to get a program that makes no assumptions whatsoever beyond the strict wording of the standard, you want some kind of defensive coding, such as static assert(&both_arrays[4] == &s.b[0], "");. These add no run-time overhead and will fail if your compiler is doing something that would break your program, so long as you don’t trigger UB in the assertion itself.
If you want a portable way to guarantee that both sub-arrays are packed into a contiguous memory range, or split a block of memory the other way, you can copy them with memcpy().
The Standard does not impose any restrictions upon what implementations must do when a program tries to use an out-of-bounds array subscript in one structure field to access a member of another. Out-of-bounds accesses are thus "illegal" in strictly conforming programs, and programs which make use of such accesses cannot simultaneously be 100% portable and free of errors. On the other hand, many implementations do define the behavior of such code, and programs which are targeted solely at such implementations may exploit such behavior.
There are three issues with such code:
While many implementations lay out structures in predictable fashion, the Standard allows implementations to add arbitrary padding before any structure member other than the first. Code could use sizeof or offsetof to ensure that structure members are placed as expected, but the other two issues would remain.
Given something like:
if (structPtr->array1[x])
structPtr->array2[y]++;
return structPtr->array1[x];
it would normally be useful for a compiler to assume that the use of structPtr->array1[x] will yield the same value as the preceding use in the "if" condition, even though it would change the behavior of code that relies upon aliasing between the two arrays.
If array1[] has e.g. 4 elements, a compiler given something like:
if (x < 4) foo(x);
structPtr->array1[x]=1;
might conclude that since there would be no defined cases where x isn't less than 4, it could call foo(x) unconditionally.
Unfortunately, while programs can use sizeof or offsetof to ensure that there aren't any surprises with struct layout, there's no way by which they can test whether compilers promise to refrain from the optimizations of types #2 or #3. Further, the Standard is a little vague about what would be meant in a case like:
struct foo {char array1[4],array2[4]; };
int test(struct foo *p, int i, int x, int y, int z)
{
if (p->array2[x])
{
((char*)p)[x]++;
((char*)(p->array1))[y]++;
p->array1[z]++;
}
return p->array2[x];
}
The Standard is pretty clear that behavior would only be defined if z is in the range 0..3, but since the type of p->array in that expression is char* (due to decay) it's not clear the cast in the access using y would have any effect. On the other hand, since converting pointer to the first element of a struct to char* should yield the same result as converting a struct pointer to char*, and the converted struct pointer should be usable to access all bytes therein, it would seem the access using x should be defined for (at minimum) x=0..7 [if the offset of array2 is greater than 4, it would affect the value of x needed to hit members of array2, but some value of x could do so with defined behavior].
IMHO, a good remedy would be to define the subscript operator on array types in a fashion that does not involve pointer decay. In that case, the expressions p->array[x] and &(p->array1[x]) could invite a compiler to assume that x is 0..3, but p->array+x and *(p->array+x) would require a compiler to allow for the possibility of other values. I don't know if any compilers do that, but the Standard doesn't require it.
I think I understand the semantics of pointer arithmetic fairly well, but I only ever see examples when dealing with arrays. Does it have any other uses that can't be achieved by less opaque means? I'm sure you could find a way with clever casting to use it to access members of a struct, but I'm not sure why you'd bother. I'm mostly interested in C, but I'll tag with C++ because the answer probably applies there too.
Edit, based on answers received so far: I know pointers can be used in many non-array contexts. I'm specifically wondering about arithmetic on pointers, e.g. incrementing, taking a difference, etc.
Pointer arithmetic by definition in C happens only on arrays. However, as every object has a representation consisting of an overlaid unsigned char [sizeof object] array, it's also valid to perform pointer arithmetic on this representation. For example:
struct foo {
int a, b, c;
} bar;
/* Equivalent to: bar.c = 1; */
*(int *)((unsigned char *)&bar + offsetof(struct foo, c)) = 1;
Actually char * would work just as well.
If you follow the language standard to the letter, then pointer arithmetic is only defined when pointing to an array, and not in any other case.
A pointer may point to any element of an array, or one step past the end of the array.
From the top of my head I know it's used in XOR linked-lists (very nifty) and I've seen it used in very hacky recursions.
On the other hand, it's very hard to find uses since according to the standard pointer arithmic is only defined if within the bounds of an array.
a[n] is "just" syntactic sugar for *(a + n). For lulz, try the following
int a[2];
0[a] = 10;
1[a] = 20;
So one could argue that indexing and pointer arithmetic are merely interchangeable syntax.
Pointer arithmetic is only defined on arrays. Adding an integer to a pointer that does not point to an array element produces undefined behavior.
In embedded systems, pointers are used to represent addresses or locations. There may not be an array defined. (Although one could say that all of memory is one huge array.)
For example, a stack (holding variables and addresses) is manipulated by adding or subtracting values from the stack pointer. (In this case, the stack could be said to be an array based stack.)
Here's a case for pointer arithmetic outside of (strictly defined) arrays:
double d = 0.5;
unsigned char *bytes = (void *)&d;
for(size_t i = 0; i < sizeof d; i++)
printf("Byte %zu of d is %hhu\n", i, bytes[i]);
Why would you do this? I don't know. But if you want to look at the bitwise representation of an object (useful for things like memcpy and memcmp), you'll need to cast their addresses to unsigned char *s (or signed char *s if you like) and work with them byte-by-byte. (If your task isn't too difficult you can even write the code to work word-by-word, which most memcpy implementations will do. It's the same principle, though, just replace char with int32_t.)
Note that, in the standard, the exact values (or the number of values) that are printed are implementation-defined, but that this will always work as a way to access an object's internal bytewise representation. (It is not required to work for larger integer types, but almost always will - no processor I know of has had trap representations for integers in quite some time).
I've always wondered: isn't ptrdiff_t supposed to be able to hold the difference of any two pointers by definition? How come it fails when the two pointers are too far? (I'm not pointing at any particular language... I'm referring to all languages which have this type.)
(e.g. subtract the pointer with address 1 from the byte pointer with address 0xFFFFFFFF when you have 32-bit pointers, and it overflows the sign bit...)
No, it is not.
$5.7 [expr.add] (from n3225 - C++0x FCD)
When two pointers to elements of the same array object are subtracted, the result is the difference of the subscripts of the two array elements. The type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the <cstddef> header (18.2). As with any other arithmetic overflow, if the result does not fit in the space provided, the behavior is undefined.
In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i − j provided the value fits in an object of type std::ptrdiff_t. Moreover, if the expression P points either to an element of an array object or one past the last element of an array object, and the expression Q points to the last element of the same array object, the expression ((Q)+1)-(P) has the same value as ((Q)-(P))+1 and as -((P)-((Q)+1)), and has the value zero if the expression P points one past the last element of the array object, even though the expression (Q)+1 does not point to an element of the array object. Unless both pointers point to elements of the same array object, or one past the last element of the array object, the behavior is undefined.
Note the number of times undefined appears in the paragraph. Also note that you can only subtract pointers if they point within the same object.
No, because there is no such thing as the difference between "any two pointers". You can only subtract pointers to elements of the same array (or the pointer to the location just past the end of an array).
To add a more explicit standard quote, ISO 9899:1999 §J.2/1 states:
The behavior is undefined in the following circumstances:
[...]
-- The result of subtracting two pointers is not representable in an object of type
ptrdiff_t (6.5.6).
It is entirely acceptable for ptrdiff_t to be the same size as pointer types, provided the overflow semantics are defined by the compiler so that any difference is still representable. There is no guarantee that a negative ptrdiff_t means that the second pointer lives at a lower address in memory than the first, or that ptrdiff_t is signed at all.
Over/underflow is mathematically well-defined for fixed-size integer arithmetic:
(1 - 0xFFFFFFFF) % (1<<32) =
(1 + -0xFFFFFFFF) % (1<<32) =
1 + (-0xFFFFFFFF % (1<<32)) = 2
This is the correct result!
Specifically, the result after over/underflow is an alias of the correct integer. In fact, every non-representable integer is aliased (undistinguishable) with one representable integer — count to infinity in fixed-size integers, and you will repeat yourself, round and round like a dial of an analog clock.
An N-bit integer represents any real integer modulo 2^N. In C, modulo 2^N is written as %(1<<32).
I believe C guarrantees mathematical correctness of over/underflow, but only for unsigned integers. Signed under/overflow is assumed to never happen (for the sake of optimization).
In practice, signed integers are two's complement, which makes no difference in addition or subtraction, so correct under/overflow behavior is guarranteed for signed integers too (although not by C).