C++ Undefined Behavior when subtracting pointers

C++ Undefined Behavior when subtracting pointers - c++

The C++ standard says that subtracting pointers from non-array elements is UB:
int a, b;
&a - &b; // UB, since pointers don't point to the same array.
But if both pointers are casted to uintptr_t, then both exressions are no longer pointer expressions and subtracting them seems to be legal from the Standard's perspective:
int a, b;
reinterpret_cast<uintptr_t>(&a) - reinterpret_cast<uintptr_t>(&b);
Is this correct or am I missing something?

The subtraction of the integers is legal in the sense that behaviour is not undefined.
But the standard doesn't technically have guarantees about the values of the converted integers, and consequently you have no guarantees about the value of result of the subtraction either (except in relation to the unspecified integer values) - you could get a large value or a small value (or if the pointers had non-interconvertible types, then subtraction of converted intergers of different objects could even yield zero) depending on the system.
Furthermore, if you did some other pointer arithmetic that would result in a pointer value i.e. convert pointer to integer and add offset, then there is technically no guarantee that that converting the result of the addition back to a pointer type would yield the pointer value at that offset from the original. But it would probably work (assuming there actually exists an object of correct type at that address) except perhaps on systems using segmented memory or something more exotic.

You are correct.
The behaviour on subtracting pointers that do not point to elements of the same array is undefined. For this purpose a pointer is allowed to be one beyond the final element of an array, and an object counts as an array of length one.
But once you cast a pointer to a suitable type, e.g. to a std::uintptr_t (assuming your compiler supports it; it doesn't have to) you can apply any arithmetic you want to it, subject to the constrants imposed upon you by that type.
Although such rules may seem obtuse, they are related to a similar rule where you are not allowed to read a pointer that doesn't point to valid memory. All of this helps achieve greater portability of the language.

The UB allows an implementation to do anything. It does not prevent an implementation to simply compute the difference between address values and divide by the size. But it allows an implementation that would control whether both elements are member of the same array (or point one past end of the array) to raise an exception or crash.
Requiring both pointers to point to the same array allows the implementation to assume that any value between that (and correctly aligned...) is also valid and points inside the same array.

Related

Is alignof(T) is the same for all possible types? What about sizeof(T)?

Is alignof(T*) is the same value for all possible types T? What about sizeof(T*)?
Please answer based on what is allowed/specified by the standard and not what is the current situation in different compilers.

The standard doesn't say much about sizes and alignments of pointers, and thus they are not strictly restricted by the language.
Conversion from one valid pointer to function into another and back is guaranteed to produce the original value. As such, there must be the same minimum number of values that must be representable that is the maximum number of valid addresses for any given pointer to function type which gives lower bound to the size of all pointer to function types.
Conversions between object types have a similar guarantee, which however only applies when the original pointed type has stricter or equal alignment requirement. As a consequence, object pointers to highly aligned types require fewer representable values. If the alignment is high enough, then the type could in theory be smaller.
On systems where conversion between pointer to void and pointer to function is allowed (which is conditionally supported), the minimum number of representable values of pointers to functions and pointers to void must be the same.
But even so, some pointer types could be larger than the number of representable values it needs. Those extra bits would simply necessarily be unused. This would not be very practical.

How does compiler know how to increment different pointers?

I understand that in general , pointers to any data type will have the same size. On a 16 bit system, normally 2 bytes and and on a 32 bit system , 4 bytes.
Depending on what this pointer points to, if it is incremented , it will increment by a different number of bytes depending on if it's a char pointer, long pointer etc.
My query is how does the compiler know by how many bytes to increment this pointer. Isn't it just a variable stored in memory like any other? Are the pointers stored in some symbol table with information about how much they should be incremented by? Thanks

That is why there are data types. Each pointer variable will have an associated data type and that data type has a defined size (See about complete/incomplete type in footnote). Pointer arithmetic will take place based on the data type.
To add to that, for pointer arithmetic to happen, the pointer(s) should be (quoted from c11 standard)
pointer to a complete object type
So, the size of the "object" the pointer points to is known and defined.
Footnote: FWIW, that is why, pointer arithmetic on void pointers (incomplete type) is not allowed / defined in the standard. (Though GCC supports void pointer arithmetic via an extension.)

Re
” I understand that in general , pointers to any data type will have the same size
No. Different pointer sizes are unusual for ¹simple pointers to objects, but can occur on word-addressed machines. Then char* is the largest pointer, and void* is the same size.
C++14 §3.9.2/4
” An object of type cv void* shall have the same
representation and alignment requirements as cv char*.
All pointers to class type object are however the same size. For example, you wouldn't be able to use an array of pointers to base type, if this wasn't the case.
Re
” how does the compiler know by how many bytes to increment this pointer
It knows the size of the type of object pointed to.
If it does not know the size of the object pointed to, i.e. that type is incomplete, then you can't increment the pointer.
For example, if p is a void*, then you can't do ++p.
Notes:
¹ In addition to ordinary pointers to objects, there are function pointers, and pointers to members. The latter kind are more like offsets, that must be combined with some specification of a relevant object, to yield a reference.

The data type of the pointer variable defines how many bytes to be incremented.
for example:
1) on incrementing a character pointer, pointer is increment by 1 Byte.
2) Likewise, for a integer pointer, pointer is increment by 4 Bytes (for 32-bit system) and 8 Bytes (for 64-bit system)

Why C++ have the type array?

I am learning C++. I found that the pointer has the same function with array, like a[4], a can be both pointer and array. But as C++ defined, for different length of array, it is a different type. In fact when we pass an array to a function, it will be converted into pointer automatically, I think it is another proof that array can be replaced by pointer. So, my question is:
Why C++ don't replace all the array with pointer?

In early C it was decided to represent the size of an array as part of its type, available via the sizeof operator. C++ has to be backward compatible with that. There's much wrong with C++ arrays, but having size as part of the type is not one of the wrong things.
Regarding
” pointer has the same function with array, like a[4], a can be both pointer and array
no, this is just an implicit conversion, from array expression to pointer to first item of that array.
A weird as it sounds, C++ does not provide indexing of built-in arrays. There's indexing for pointers, and p[i] just means *(p+i) by definition, so you can also write that as *(i+p) and hence as i[p]. And thus also i[a], because it's really the pointer that's indexed. Weird indeed.
The implicit conversion, called a “decay”, loses information, and is one of the things that are wrong about C++ arrays.
The indexing of pointers is a second thing that's wrong (even if it makes a lot of sense at the assembly language and machine code level).
But it needs to continue to be that way for backward compatibility.
Why array decay is Bad™: this causes an array of T to often be represented by simply a pointer to T.
You can't see from such a pointer (e.g. as a formal argument) whether it points to a single T object or to the first item of an array of T.
But much worse, if T has a derived class TD, where sizeof(TD) > sizeof(T), and you form an array of TD, then you can pass that array to a formal argument that's pointer to T – because that array of TD decays to pointer to TD which converts implicitly to pointer to T. Now using that pointer to T as an array yields incorrect address computations, due to incorrect size assumption for the array items. And bang crash (if you're lucky), or perhaps just incorrect results (if you're not so lucky).

In C and C++, everything of a single type has the same size. An int[4] array is twice as big as an int[2] array, so they can't be of the same type.
But then you might ask, "Why should type imply size?" Well:
A local variable needs to take up a certain amount of memory. When you declare an array, it takes up memory that scales up with its length. When you declare a pointer, it is always the size of pointers on your machine.
Pointer arithmetic is determined by the size of the type it's pointing to: the distance between the address pointed to by p and that pointed to by p+1 is exactly the size of its type. If types didn't have fixed sizes, then p would need to carry around extra information, or C would have to give up arithmetic.
A function needs to know how big its arguments are, because functions are compiled to expect their variables to be in particular places, and having a parameter with an unknown size screws that up.
And you say, "Well, if I pass an array to a function, it just turns into a pointer anyway." True, but you can make new types that have arrays as members, and then you can pass THOSE types around. And in C++, you can in fact pass an array as an array.
int sum10(int (&arr)[10]){ //only takes int arrays of size 10
int result = 0;
for(int i=0; i<10; i++)
result += arr[i];
return result
}

You can't use pointers in place of array declarations without having to use malloc/free or new/delete to create and destroy memory on the heap. You can declare an array as a variable and it gets created on the stack and you do not have to worry about it's destruction.

Well, array is an easier was of dealing with data and manipulating them. However, In order to use pointers you need to have a clear memory address to point to. Also, both concepts are not different from each other when it comes to passing them to a function. Bothe pointers and arrays are passed by reference. Hope that helps

I'm not sure if i get your question but assuming you're new to coding:
when you declare an array int a[4] you let the compiler know you need 4*int memory, and what the compiler does is assign a the address of the 'start' of that 4*int size memory. when u later use a[x], [x] means to do (a + sizeof(int)*x) AND dereference that pointer address to get the int.
In other words, it's always a pointer being passed around instead of an 'array', which is just an abstraction that makes it easier for you to code.

Can an XOR linked list be implemented in C++ without causing undefined behavior?

An XOR linked list is a modified version of a normal doubly-linked list in which each node stores just one "pointer" instead of two. That "pointer" is composed of the XOR of the next and previous pointers. To traverse the list, two pointers are needed - one to the current node and one to the next or previous node. To traverse forward, the previous node's address is XORed with the "pointer" stored in the current node, revealing the true "next" pointer.
The C++ standard causes a bunch of operations on pointers and integers to result in undefined behavior - for example, you cannot guarantee that setting a particular bit in a number will not cause the hardware to trigger an interrupt, so in some cases the results of bit twiddling can be undefined.
My question is the following: is there a C++ implementation of an XOR linked list that does not result in undefined behavior?

My question is the following: is there a C++ implementation of an XOR linked list that does not result in undefined behavior?
If by "is there an implementation" you mean "has it already been written" then I don't know. If you mean "is it possible to write one" then yes it is but there might be some caveats about portability.
You can bit-twiddle to your heart's content if you convert both pointers to uintptr_t before you start, and store that type in the node instead of a pointer. Bitwise operations on unsigned types never result in undefined behavior.
However, uintptr_t is an optional type and so it's not entirely portable. There is no requirement that a C++ implementation actually has an integer type capable of representing an address. If the implementation doesn't have uintptr_t then the code is permitted to compile with a diagnostic, in which case its behavior is outside the scope of the standard. Not sure whether you consider that an infringement of "without UB" or not. I mean, seriously, a compiler that allows code which uses undefined types? ;-)
To avoid uintptr_t I think that you could do your bit-twiddling on an array of sizeof(node*) unsigned chars instead. Pointers are POD types and so can be copied, spindled and mutilated provided that the object representation is restored to its original condition before being used as a pointer.
Note also that if your C++ implementation has a garbage collector, then convert-to-integer / xor / xor-back-again / convert-to-pointer doesn't necessary stop the object being collected (because it results in an "unsafely derived pointer"). So for portability you must also ensure that the resulting pointers are valid. Two ways to do this:
call declare_reachable on them.
Use an implementation with relaxed pointer safety (it is implementation-defined whether this is the case, and you can test for it with get_pointer_safety() bearing in mind that a relaxed implementation is allowed to falsely claim that it's strict).
You might think that there's a third way (albeit one that defeats the purpose of the XOR linked list unless you happen to have it anyway):
keep a separate container of all the pointer values
This is not guaranteed to work. An unsafely derived pointer is invalid even if it happens to be equal to a safely-derived pointer (3.7.4.3/4). I was surprised too.

Is storing an invalid pointer automatically undefined behavior?

Obviously, dereferencing an invalid pointer causes undefined behavior. But what about simply storing an invalid memory address in a pointer variable?
Consider the following code:
const char* str = "abcdef";
const char* begin = str;
if (begin - 1 < str) { /* ... do something ... */ }
The expression begin - 1 evaluates to an invalid memory address. Note that we don't actually dereference this address - we simply use it in pointer arithmetic to test if it is valid. Nonetheless, we still have to load an invalid memory address into a register.
So, is this undefined behavior? I never thought it was, since a lot of pointer arithmetic seems to rely on this sort of thing, and a pointer is really nothing but an integer anyway. But recently I heard that even the act of loading an invalid pointer into a register is undefined behavior, since certain architectures will automatically throw a bus error or something if you do that. Can anyone point me to the relevant part of the C or C++ standard which settles this either way?

I have the C Draft Standard here, and it makes it undefined by omission. It defines the case of ptr + I at 6.5.6/8 for
If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression.
Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object.
Your case does not fit any of these. Neither is your array large enough to have -1 adjust the pointer to point to a different array element, nor does any of the result or original pointer point one-past-end.

Your code is undefined behavior for a different reason:
the expression begin - 1 does not yield an invalid pointer. It is undefined behavior. You are not allowed to perform pointer arithmetics beyond the bounds of the array you're working on. So it is the subtraction itself that is invalid, and not the act of storing the resulting pointer.

Some architectures have dedicated registers for holding pointers. Putting the value of an unmapped address into such a register is allowed to crash. Integer overflow/underflow is allowed to crash. Because C aims to work on a broad variety of platforms, pointers provide a mechanism for safely programming unsafe circuits.
If you know you won't be running on exotic hardware with such finicky characteristics, you don't need to worry about what is undefined by the language. It is well-defined by the platform.
Of course, the example is poor style and there isn't a good reason to do it.

Any use of an invalid pointer yields undefined behaviour. I don't have the C Standard here at work, but see 'invalid pointers' in the Rationale: http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf

$5.7/6 - "Unless both pointers point
to elements of the same array object,
or one past the last element of the
array object, the behavior is
undefined.75)"
Summary, it is undefined even if you do not dereference the pointer.

The correct answers have been given years ago, but I find it interesting that the C99 rationale [sec. 6.5.6, last 3 paragraphs] explains why the standard endorses adding 1 to a pointer that points to the last element of an array (p+1):
An important endorsement of widespread practice is the requirement that a pointer can always be incremented to just past the end of an array, with no fear of overflow or wraparound
and why p-1 is not endorsed:
In the case of p-1, on the other hand, an entire object would have to be allocated prior to the array of objects that p traverses, so decrement loops that run off the bottom of an array can fail. This restriction allows segmented architectures, for instance, to place objects at the start of a range of addressable memory.
So if the pointer p points to an object at the start of a range of addressable memory, which is endorsed by this comment, then p-1 would generate an underflow.
Note that integer overflow is the standard's example for undefined behavior [sec. 3.4.3], as it depends on the translation environment and the operating environment. I believe it is easy to see that this dependence on the environment extends to pointer underflow.
This is why the standard explicitly makes it undefined behavior [in 6.5.6/8], as noted by other answers here. To cite that sentence:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
See also [sec. 6.3.2.3, last 4 paragraphs] of the C99 rationale, which gives a more detailed description of how invalid pointers can be generated, and what effects that may have.

Yes, it's undefined behavior. See the accepted answer to this closely related question. Assigning an invalid pointer to a variable, comparing an invalid pointer, casting an invalid pointer triggers undefined behavior.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js