Is storing an invalid pointer automatically undefined behavior? - c++

Obviously, dereferencing an invalid pointer causes undefined behavior. But what about simply storing an invalid memory address in a pointer variable?
Consider the following code:
const char* str = "abcdef";
const char* begin = str;
if (begin - 1 < str) { /* ... do something ... */ }
The expression begin - 1 evaluates to an invalid memory address. Note that we don't actually dereference this address - we simply use it in pointer arithmetic to test if it is valid. Nonetheless, we still have to load an invalid memory address into a register.
So, is this undefined behavior? I never thought it was, since a lot of pointer arithmetic seems to rely on this sort of thing, and a pointer is really nothing but an integer anyway. But recently I heard that even the act of loading an invalid pointer into a register is undefined behavior, since certain architectures will automatically throw a bus error or something if you do that. Can anyone point me to the relevant part of the C or C++ standard which settles this either way?

I have the C Draft Standard here, and it makes it undefined by omission. It defines the case of ptr + I at 6.5.6/8 for
If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression.
Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object.
Your case does not fit any of these. Neither is your array large enough to have -1 adjust the pointer to point to a different array element, nor does any of the result or original pointer point one-past-end.

Your code is undefined behavior for a different reason:
the expression begin - 1 does not yield an invalid pointer. It is undefined behavior. You are not allowed to perform pointer arithmetics beyond the bounds of the array you're working on. So it is the subtraction itself that is invalid, and not the act of storing the resulting pointer.

Some architectures have dedicated registers for holding pointers. Putting the value of an unmapped address into such a register is allowed to crash. Integer overflow/underflow is allowed to crash. Because C aims to work on a broad variety of platforms, pointers provide a mechanism for safely programming unsafe circuits.
If you know you won't be running on exotic hardware with such finicky characteristics, you don't need to worry about what is undefined by the language. It is well-defined by the platform.
Of course, the example is poor style and there isn't a good reason to do it.

Any use of an invalid pointer yields undefined behaviour. I don't have the C Standard here at work, but see 'invalid pointers' in the Rationale: http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf

$5.7/6 - "Unless both pointers point
to elements of the same array object,
or one past the last element of the
array object, the behavior is
undefined.75)"
Summary, it is undefined even if you do not dereference the pointer.

The correct answers have been given years ago, but I find it interesting that the C99 rationale [sec. 6.5.6, last 3 paragraphs] explains why the standard endorses adding 1 to a pointer that points to the last element of an array (p+1):
An important endorsement of widespread practice is the requirement that a pointer can always be incremented to just past the end of an array, with no fear of overflow or wraparound
and why p-1 is not endorsed:
In the case of p-1, on the other hand, an entire object would have to be allocated prior to the array of objects that p traverses, so decrement loops that run off the bottom of an array can fail. This restriction allows segmented architectures, for instance, to place objects at the start of a range of addressable memory.
So if the pointer p points to an object at the start of a range of addressable memory, which is endorsed by this comment, then p-1 would generate an underflow.
Note that integer overflow is the standard's example for undefined behavior [sec. 3.4.3], as it depends on the translation environment and the operating environment. I believe it is easy to see that this dependence on the environment extends to pointer underflow.
This is why the standard explicitly makes it undefined behavior [in 6.5.6/8], as noted by other answers here. To cite that sentence:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
See also [sec. 6.3.2.3, last 4 paragraphs] of the C99 rationale, which gives a more detailed description of how invalid pointers can be generated, and what effects that may have.

Yes, it's undefined behavior. See the accepted answer to this closely related question. Assigning an invalid pointer to a variable, comparing an invalid pointer, casting an invalid pointer triggers undefined behavior.

Related

hex subtraction result is not shown as expected [duplicate]

I am using a code sample to check the distance between two integers like in the answer of this question.
int i = 0, j = 0;
std::cout << &i - &j;
From my understanding of the memory representation, these memory addresses of these two variables should be next to each other and the difference should be exactly 1.
To my surprise, running this code with MS compiler in VS2017 prints 3 and running the same code with GCC prints 1.
Why this happens, is something wrong with VS?
C++ standard does not make any requirements for C++ compilers to allocate variables with automatic storage duration in any particular way, including making them contiguous in memory. In fact, compiler may choose to not allocate any memory to a variable, optimizing it out completely.
That is why subtracting pointers makes sense only when they both point to memory inside the same array, or one element past the end of it. In all other situations, including yours, you get undefined behavior.
The pointer arithmetic you tried has undefined behavior:
If the pointer P points to the ith element of an array, and the
pointer Q points at the jth element of the same array, the
expression P-Q has the value i-j, if the value fits in std::ptrdiff_t.
Both operands must point to the elements of the same array (or one
past the end), otherwise the behavior is undefined. If the result does
not fit in std::ptrdiff_t, the behavior is undefined.

C++ Undefined Behavior when subtracting pointers

The C++ standard says that subtracting pointers from non-array elements is UB:
int a, b;
&a - &b; // UB, since pointers don't point to the same array.
But if both pointers are casted to uintptr_t, then both exressions are no longer pointer expressions and subtracting them seems to be legal from the Standard's perspective:
int a, b;
reinterpret_cast<uintptr_t>(&a) - reinterpret_cast<uintptr_t>(&b);
Is this correct or am I missing something?
The subtraction of the integers is legal in the sense that behaviour is not undefined.
But the standard doesn't technically have guarantees about the values of the converted integers, and consequently you have no guarantees about the value of result of the subtraction either (except in relation to the unspecified integer values) - you could get a large value or a small value (or if the pointers had non-interconvertible types, then subtraction of converted intergers of different objects could even yield zero) depending on the system.
Furthermore, if you did some other pointer arithmetic that would result in a pointer value i.e. convert pointer to integer and add offset, then there is technically no guarantee that that converting the result of the addition back to a pointer type would yield the pointer value at that offset from the original. But it would probably work (assuming there actually exists an object of correct type at that address) except perhaps on systems using segmented memory or something more exotic.
You are correct.
The behaviour on subtracting pointers that do not point to elements of the same array is undefined. For this purpose a pointer is allowed to be one beyond the final element of an array, and an object counts as an array of length one.
But once you cast a pointer to a suitable type, e.g. to a std::uintptr_t (assuming your compiler supports it; it doesn't have to) you can apply any arithmetic you want to it, subject to the constrants imposed upon you by that type.
Although such rules may seem obtuse, they are related to a similar rule where you are not allowed to read a pointer that doesn't point to valid memory. All of this helps achieve greater portability of the language.
The UB allows an implementation to do anything. It does not prevent an implementation to simply compute the difference between address values and divide by the size. But it allows an implementation that would control whether both elements are member of the same array (or point one past end of the array) to raise an exception or crash.
Requiring both pointers to point to the same array allows the implementation to assume that any value between that (and correctly aligned...) is also valid and points inside the same array.

Why two variables declared one after another are not next to each other in memory?

I am using a code sample to check the distance between two integers like in the answer of this question.
int i = 0, j = 0;
std::cout << &i - &j;
From my understanding of the memory representation, these memory addresses of these two variables should be next to each other and the difference should be exactly 1.
To my surprise, running this code with MS compiler in VS2017 prints 3 and running the same code with GCC prints 1.
Why this happens, is something wrong with VS?
C++ standard does not make any requirements for C++ compilers to allocate variables with automatic storage duration in any particular way, including making them contiguous in memory. In fact, compiler may choose to not allocate any memory to a variable, optimizing it out completely.
That is why subtracting pointers makes sense only when they both point to memory inside the same array, or one element past the end of it. In all other situations, including yours, you get undefined behavior.
The pointer arithmetic you tried has undefined behavior:
If the pointer P points to the ith element of an array, and the
pointer Q points at the jth element of the same array, the
expression P-Q has the value i-j, if the value fits in std::ptrdiff_t.
Both operands must point to the elements of the same array (or one
past the end), otherwise the behavior is undefined. If the result does
not fit in std::ptrdiff_t, the behavior is undefined.

C++ - operator -= on a pointer [duplicate]

This question already has answers here:
C/C++: Pointer Arithmetic
(7 answers)
Closed 5 years ago.
Say I have pointer array:
char* buf = new char[256];
what will happen to array pointer's value/size if i do
buf -= 100;
+, +=, - and -= move a pointer around.
The pointer char* ptr = new char[256] points at the start of a block of 256 bytes.
So ptr+10 is now a pointer poimting 10 in, and ptr+=10 is the same as ptr = ptr+10.
This is called pointer arithmetic.
In C++, poimter arithmetic is no longer valid if the result takes you out of the object the pointer is pointing within, or one-place-past-the-end. So ptr-0 to ptr+256 are the only valid places you are allowed to generate from ptr.
ptr-=100 is undefined behaviour.
In C++, most implementations currently active implement pointers as unsigned integer indexes into a flat address space at runtime. This still doesn't mean you can rely on this fact while doing pointer arithmetic. Each pointer has a range of validity, and going outside of it the C++ standard no longer defines what anything in your program does (UB).
Undefined Behaviour doesn't just mean "could segfault"; the compiler is free to do anything, and there are instances of compilers optimizing entire branches of code out because the only way to reach them required UB, or because it proved that if you reached them UB would occur. UB makes the correctness of your program bassically impossible to reason about.
That being said, 99/100+ times what will happen is that ptr-=100 now points to a different part of the heap than it did when initialized, and reading/writing to what it points at will result in getting junk, corrupting memory, and/or segfaulting. And doing a +=100 will bring ptr back to the valid range.
The block of memory won't be bothered by moving ptr, just ptr won't be pointing within it.
The standard says that even just trying to calculate a pointer that goes outside the actual boundaries of an array (and the "one past last" value, which is explicitly allowed, although not for dereferencing) is undefined behavior, i.e. anything can happen.
On some bizarre machines, even just this calculation may make the program crash (they had registers specifically for pointers, that trapped in case they pointed to non mapped pages/invalid addresses).
In practice, on non-patological machines calculating that pointer won't do much - you'll probably just obtain a pointer to an invalid memory location (which may crash your program when trying to dereference it) or, worse, to memory you don't own (so you may overwrite unrelated data).
The only case where that cose may be justified is if you have "insider knowledge" about the memory allocator (or you have actually replaced it with your own, e.g. providing your own override of the global new operator), and you know that the actual array starts before the returned address - possibly storing there extra data.

Why is out-of-bounds pointer arithmetic undefined behaviour?

The following example is from Wikipedia.
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // undefined behavior
If I never dereference p, then why is arr + 5 alone undefined behaviour? I expect pointers to behave as integers - with the exception that when dereferenced the value of a pointer is considered as a memory address.
That's because pointers don't behave like integers. It's undefined behavior because the standard says so.
On most platforms however (if not all), you won't get a crash or run into dubious behavior if you don't dereference the array. But then, if you don't dereference it, what's the point of doing the addition?
That said, note that an expression going one over the end of an array is technically 100% "correct" and guaranteed not to crash per §5.7 ¶5 of the C++11 spec. However, the result of that expression is unspecified (just guaranteed not to be an overflow); while any other expression going more than one past the array bounds is explicitly undefined behavior.
Note: That does not mean it is safe to read and write from an over-by-one offset. You likely will be editing data that does not belong to that array, and will cause state/memory corruption. You just won't cause an overflow exception.
My guess is that it's like that because it's not only dereferencing that's wrong. Also pointer arithmetics, comparing pointers, etc. So it's just easier to say don't do this instead of enumerating the situations where it can be dangerous.
The original x86 can have issues with such statements. On 16 bits code, pointers are 16+16 bits. If you add an offset to the lower 16 bits, you might need to deal with overflow and change the upper 16 bits. That was a slow operation and best avoided.
On those systems, array_base+offset was guaranteed not to overflow, if offset was in range (<=array size). But array+5 would overflow if array contained only 3 elements.
The consequence of that overflow is that you got a pointer which doesn't point behind the array, but before. And that might not even be RAM, but memory-mapped hardware. The C++ standard doesn't try to limit what happens if you construct pointers to random hardware components, i.e. it's Undefined Behavior on real systems.
If arr happens to be right at the end of the machine's memory space then arr+5 might be outside that memory space, so the pointer type might not be able to represent the value i.e. it might overflow, and overflow is undefined.
"Undefined behavior" doesn't mean it has to crash on that line of code, but it does mean that you can't make any guaranteed about the result. For example:
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // I guess this is allowed to crash, but that would be a rather
// unusual implementation choice on most machines.
*p; //may cause a crash, or it may read data out of some other data structure
assert(arr < p); // this statement may not be true
// (arr may be so close to the end of the address space that
// adding 5 overflowed the address space and wrapped around)
assert(p - arr == 5); //this statement may not be true
//the compiler may have assigned p some other value
I'm sure there are many other examples you can throw in here.
Some systems, very rare systems and I can't name one, will cause traps when you increment past boundaries like that. Further, it allows an implementation that provides boundary protection to exist...again though I can't think of one.
Essentially, you shouldn't be doing it and therefor there's no reason to specify what happens when you do. Specifying what happens puts unwarranted burden on the implementation provider.
This result you are seeing is because of the x86's segment-based memory protection. I find this protection to be justified as when you are incrementing the pointer address and storing, It means at future point of time in your code you will be dereferencing the pointer and using the value. So compiler wants to avoid such kind of situations where you will end up changing some other's memory location or deleting the memory which is being owned by some other guy in your code. To avoid such scenario's compiler has put the restriction.
In addition to hardware issues, another factor was the emergence of implementations which attempted to trap on various kinds of programming errors. Although many such implementations could be most useful if configured to trap on constructs which a program is known not to use, even though they are defined by the C Standard, the authors of the Standard did not want to define the behavior of constructs which would--in many programming fields--be symptomatic of errors.
In many cases, it will be much easier to trap on actions which use pointer arithmetic to compute address of unintended objects than to somehow record the fact that the pointers cannot be used to access the storage they identify, but could be modified so that they could access other storage. Except in the case of arrays within larger (two-dimensional) arrays, an implementation would be allowed to reserve space that's "just past" the end of every object. Given something like doSomethingWithItem(someArray+i);, an implementation could trap any attempt to pass any address which doesn't point to either an element of the array or the space just past the last element. If the allocation of someArray reserved space for an extra unused element, and doSomethingWithItem() only accesses the item to which it receives a pointer, the implementation could relatively inexpensively ensure that any non-trapped execution of the above code could--at worst--access otherwise-unused storage.
The ability to compute "just-past" addresses makes bounds checking more difficult than it otherwise would be (the most common erroneous situation about would be passing doSomethingWithItem() a pointer just past the end of the array, but behavior would be defined unless doSomethingWithItem would try to dereference that pointer--something the caller may be unable to prove). Because the Standard would allow compilers to reserve space just past the array in most cases, however, such allowance would allow implementations to limit the damage caused by untrapped errors--something that would likely not be practical if more generalized pointer arithmetic were allowed.