This question already has answers here:
C/C++: Pointer Arithmetic
(7 answers)
Closed 5 years ago.
Say I have pointer array:
char* buf = new char[256];
what will happen to array pointer's value/size if i do
buf -= 100;
+, +=, - and -= move a pointer around.
The pointer char* ptr = new char[256] points at the start of a block of 256 bytes.
So ptr+10 is now a pointer poimting 10 in, and ptr+=10 is the same as ptr = ptr+10.
This is called pointer arithmetic.
In C++, poimter arithmetic is no longer valid if the result takes you out of the object the pointer is pointing within, or one-place-past-the-end. So ptr-0 to ptr+256 are the only valid places you are allowed to generate from ptr.
ptr-=100 is undefined behaviour.
In C++, most implementations currently active implement pointers as unsigned integer indexes into a flat address space at runtime. This still doesn't mean you can rely on this fact while doing pointer arithmetic. Each pointer has a range of validity, and going outside of it the C++ standard no longer defines what anything in your program does (UB).
Undefined Behaviour doesn't just mean "could segfault"; the compiler is free to do anything, and there are instances of compilers optimizing entire branches of code out because the only way to reach them required UB, or because it proved that if you reached them UB would occur. UB makes the correctness of your program bassically impossible to reason about.
That being said, 99/100+ times what will happen is that ptr-=100 now points to a different part of the heap than it did when initialized, and reading/writing to what it points at will result in getting junk, corrupting memory, and/or segfaulting. And doing a +=100 will bring ptr back to the valid range.
The block of memory won't be bothered by moving ptr, just ptr won't be pointing within it.
The standard says that even just trying to calculate a pointer that goes outside the actual boundaries of an array (and the "one past last" value, which is explicitly allowed, although not for dereferencing) is undefined behavior, i.e. anything can happen.
On some bizarre machines, even just this calculation may make the program crash (they had registers specifically for pointers, that trapped in case they pointed to non mapped pages/invalid addresses).
In practice, on non-patological machines calculating that pointer won't do much - you'll probably just obtain a pointer to an invalid memory location (which may crash your program when trying to dereference it) or, worse, to memory you don't own (so you may overwrite unrelated data).
The only case where that cose may be justified is if you have "insider knowledge" about the memory allocator (or you have actually replaced it with your own, e.g. providing your own override of the global new operator), and you know that the actual array starts before the returned address - possibly storing there extra data.
Related
char data;
char *ptr = &data;
*ptr = 3;
*(ptr+1) = 5;
So I am studying pointer, and one thing I quite don't understand.
When you use pointer as an array, how do you know address ptr+1 or ptr+2 and so on is not being occupied by some other variable?
So for example the address ptr + 1 is already being used, and if I try to put 5 in it, is there a chance the program just crashes?
Or more extreme example something like ptr + 1000?
Or does the compiler make sure that it never happens?
The definition char data; reserves space for one char.
char *ptr = &data; sets ptr to point to that char.
The space at ptr+1 is not reserved for your use, and neither the C nor the C++ standards define what happens if you try to use it.
The definition char data[3]; would reserve space for three char. Then char *ptr = data; would set ptr to point to the first of these char. That is, ptr would have the address &data[0], and *ptr would be data[0].
Then ptr+1 would point to the next char; it would have the address &data[1], and *(ptr+1) would be data[1].
Generally, you should use pointer arithmetic only to access space that you know is reserved for your use. (There may be exceptions or clarifications to that in special-purpose code, such as code in the operating system kernel to deal with memory mapping or code in special-purpose hardware. You do not need to consider such possibilities in normal user programs.)
The compiler generally does not prevent you from accessing invalid addresses. It might, in some circumstances, where it detects a reference out of bounds. Generally, this only happens in current compilers with simple expressions where the full definitions are visible to the compiler and the references essentially use constant indices.
The operating system may prevent you from accessing some invalid addresses. However, it will only prevent you from accessing invalid addresses either because they are not mapped to your program at all or they are mapped to be read-only but you tried writing to them (or certain other combinations, such as attempting to read execute-only memory). The operating system will not prevent you from accessing addresses improperly that are mapped and accessible to your program. For example, calculating an improper pointer value and using it in an assignment to change memory can result in changing data your program needs for other functions.
You don't know if they are occupied by other variables or not.
This is undefined behavior, which means that the C standard does not specify what your program does. It could appear to "work", it could crash, it could give wrong results, or it could do anything else.
So don't do this.
This question already has answers here:
Accessing an array out of bounds gives no error, why?
(18 answers)
Closed 3 years ago.
This has been bugging me for quite some time. I have a pointer. I declare an array of type int.
int* data;
data = new int[5];
I believe this creates an array of int with size 5. So I'll be able to store values from data[0] to data[4].
Now I create an array the same way, but without size.
int* data;
data = new int;
I am still able to store values in data[2] or data[3]. But I created an array of size 1. How is this possible?
I understand that data is a pointer pointing to the first element of the array. Though I haven't allocated memory for the next elements, I still able to access them. How?
Thanks.
Normally, there is no need to allocate an array "manually" with new. It is just much more convenient and also much safer to use std::vector<int> instead. And leave the correct implementation of dynamic memory management to the authors of the standard library.
std::vector<int> optionally provides element access with bounds checking, via the at() method.
Example:
#include <vector>
int main() {
// create resizable array of integers and resize as desired
std::vector<int> data;
data.resize(5);
// element access without bounds checking
data[3] = 10;
// optionally: element access with bounds checking
// attempts to access out-of-range elements trigger runtime exception
data.at(10) = 0;
}
The default mode in C++ is usually to allow to shoot yourself in the foot with undefined behavior as you have seen in your case.
For reference:
https://en.cppreference.com/w/cpp/container/vector
https://en.cppreference.com/w/cpp/container/vector/at
https://en.cppreference.com/w/cpp/language/ub
Undefined, unspecified and implementation-defined behavior
What are all the common undefined behaviours that a C++ programmer should know about?
Also, in the second case you don't allocate an array at all, but a single object. Note that you must use the matching delete operator too.
int main() {
// allocate and deallocate an array
int *arr = new int[5];
delete[] arr;
// allocate and deallocate a single object
int *p = new int;
delete p;
}
For reference:
https://en.cppreference.com/w/cpp/language/new
https://en.cppreference.com/w/cpp/language/delete
How does delete[] know it's an array?
When you used new int then accessing data[i] where i!=0 has undefined behaviour.
But that doesn't mean the operation will fail immediately (or every time or even ever).
On most architectures its very likely that the memory addresses just beyond the end of the block you asked for are mapped to your process and you can access them.
If you're not writing to them it's no surprise you can access them (though you shouldn't).
Even if you write to them most memory allocators have a minimum allocation and behind the scenes you may well have been allocated space for more (4 is realistic) integers even though the code only requests 1.
You may also be overwriting some area of memory but never get tripped up. A common consequence of writing beyond the end of an array is to corrupt the free-memory store itself. The consequence may be catastrophe but may only exhibit itself in a later allocation possibly of a similar sized object.
It's a dreadful idea to rely on such behaviour but it's not very surprising that it appears to work.
C++ doesn't (typically or by default) perform strict range checking and accessing invalid array elements may work or at least appear to work initially.
This is why C and C++ can be plagued with bizarre and intermittent errors. Not all code that provokes undefined behaviour fails catastrophically in every execution.
Going outside the bounds of an array in C++ is undefined behavior, so anything can happen, including things that appear to work "correctly".
In practical implementation terms on common systems, you can think of "virtual" memory as a large "flat" space from 0 up to the size of a pointer, and pointers are into this space.
The "virtual" memory for a process is mapped to physical memory, page file, etc. Now, if you access an address that is not mapped, or try to write a read-only part, you will get an error, such as an access violation or segfault.
But this mapping is done for fairly large chunks for efficiency, such as for 4KiB "pages". The allocators in a process, such as new and delete (or the stack) will further split up these pages as required. So accessing other parts of a valid page are unlikely to raise an error.
This has the unfortunate result that it can be hard to detect such out of bounds access, use after free, etc. In many cases writes will succeed, only to corrupt some other seemingly unrelated object, which may cause a crash later, or incorrect program output, so best to be very careful about C and C++ memory management.
data = new int; // will be some virtual address
data[1000] = 5; // possibly the start of a 4K page potentially allowing a great deal beyond it
other_int = new int[5];
other_int[10] = 10;
data[10000] = 42; // with further pages beyond, so you can really make a mess of your programs memory
other_int[10] == 42; // perfectly possible to overwrite other things in unexpected ways
C++ provides many tools to help, such as std::string, std::vector and std::unique_ptr, and it is generally best to try and avoid manual new and delete entirely.
new int allocates 1 integer only. If you access offsets larger than 0, e.g. data[1] you override the memory.
int * is a pointer to something that's probably an int. When you allocate using new int , you're allocating one int and storing the address to the pointer. In reality, int * is just a pointer to some memory.
We can treat an int * as a pointer to a scalar element (i.e. new int) or an array of elements -- the language has no way of telling you what your pointer is really pointing to; a very good argument to stop using pointers and only using scalar values and std::vector.
When you say a[2], you well access the memory sizeof(int) after the value pointed to by a. If a is pointing to a scalar value, anything could be after a and reading it causes undefined behaviour (your program might actually crash -- this is an actual risk). Writing to that adress will most likley cause problems; it is not merely a risk, but something you should actively guard against -- i.e. use std::vector if you need an array and int or int& if you don't.
The expression a[b], where one of the operands is a pointer, is another way to write *(a+b). Let's for the sake of sanity assume that a is the pointer here (but since addition is commutative it can be the other way around! try it!); then the address in a is incremented by b times sizeof(*a), resulting in the address of the bth object after *a.
The resulting pointer is dereferenced, resulting in a "name" for the object whose address is a+b.
Note that a does not have to be an array; if it is one, it "decays" to a pointer before the operator [] is applied. The operation is taking place on a typed pointer. If that pointer is invalid, or if the memory at a+b does not in fact hold an object of the type of *a, or even if that object is unrelated to *a (e.g., because it is not in the same array or structure), the behavior is undefined.
In the real world, "normal" programs do not do any bounds checking but simply add the offset to the pointer and access that memory location. (Accessing out-of-bounds memory is, of course, one of the more common bugs in C and C++, and one of the reasons these languages are not without restrictions recommended for high-security applications.)
If the index b is small, the memory is probably accessible by your program. For plain old data like int the most likely result is then that you simply read or write the memory in that location. This is what happened to you.
Since you overwrite unrelated data (which may in fact be used by other variables in your program) the results are often surprising in more complex programs. Such errors can be hard to find, and there are tools out there to detect such out-of-bounds access.
For larger indices you'll at some point end up in memory which is not assigned to your program, leading to an immediate crash on modern systems like Windows NT and up, and unpredictable results on architectures without memory management.
I am still able to store values in data[2] or data[3]. But I created an array of size 1. How is this possible?
The behaviour of the program is undefined.
Also, you didn't create an array of size 1, but a single non-array object instead. The difference is subtle.
I know NULL (0x00000000) is a pointer to nothing because the OS doesn't allow the process to allocate any memory at this location. But if I use 0x00000001 (Magic number or code-pointer), is it safe to assume as well that the OS wont allow memory to be allocated here?
If so then until where is it safe to assume that?
Standard (first)
The Standard only guarantees that 0 is a sentinel value as far as pointers go. The underlying memory representation is no way guaranteed; it's implementation defined.
Using a pointer set to that sentinel value for anything else than reading the pointer state or writing a new state (which includes dereferencing or pointer arithmetic) is undefined behavior.
Virtual Memory
In the days of virtual memory (ie, each process gets its own memory space, independent from the others), a null pointer is most often indeed represented as 0 in the process memory space. I don't know of any other architectures actually, though I imagine that in mainframes it may not be so.
Unix
In the Unix world, it is typical to reserve all the address space below 0x8000 for null values. The memory is not allocated, really, it is just protected (ie, placed in a special mode), so that the OS will trigger a segmentation fault should you ever try to read it or write to it.
The idea of using such a range is that a null pointer is not necessarily used as is. For example if you use a std::pair<int, int>* p = 0; which is null, and call p->second, then the compiler will perform the arithmetic necessary to point to second (ie +4 generally) and attempt to access the memory at 0x4 directly. The problem is obviously compounded by arrays.
In practice, this 0x8000 limit should be practical enough to detect most issues (and avoid memory corruption or others). In this case, this means that you avoid the undefined behavior and get a "proper" crash. However, should you be using a large array you could overshoot it, so it's not a silver bullet.
The particular limit of your implementation or compiler/runtime stack can be determined either through documentation or by successive trials. There might even be a way to tweak it.
You should not assume anything about the actual values of pointers. Especially, the null pointer is not required to be represented by a zero address, even though the literal 0 does look like a zero.
The only valid range is supposed to be range allocated to you by the OS.ANYTHING else should be denied by the OS.
An exception to that rule is the shared memory.
The C++ standard doesn't "reserve" any pointer addresses other than zero (null). So it is not safe to use 1 or any other value as a "magic" pointer value. Of course, in practice, some implementations of c++ probably do not every use certain values. But you don't get any guarantees from the language definition.
I will try to give a broad view about this:
you probably will never ever access the real memory addresses because of the multiple sandboxing mechanism that every modern OS has and puts in place.
What is a NULL pointer from the software viewpoint ? a NULL pointer is a pointer variable that stores a value that the programmer pick as a meaningfull value and this value is used as a label with the following meaning "this pointer goes nowhere". a NULL pointer does not point to 0x000000 by definition, the definition of a NULL pointer it's not about where that pointer will point to but the value of this macro called NULL and this value will be the value of this NULL pointer.
in C you can assume that NULL == 0, only in C NULL is a macro that defines NULL as an int that is equal to 0, in C++ you do not have this liberty
there are types, labels and values ( in better terms, representations of values not real values ) for every variables, at least for primitives values, the same is for the pointers, if you are speaking about void pointers you are speaking about pointers that contains a memory address ( just like any pointer ) and the only special thing about this pointers is that they need a cast in C++ to be decoded, safely and effectively; it's a big mistake if you think about void* as pointers that points to nowhere or to 0 or to NULL or to 0x0000000
by the way, i still don't get your problem ...
A modern OS is likely to reserve at least one page for NULL pointer. So 0x1 (or 0x4 if you want 32-bit alignment) is likely to work.
But remember this is not guaranteed by C/C++ language. You would have to rely on your OS and compiler for such behavior.
Further more, there's no guarantee about the actual value of the NULL pointer. It may or may not be all zeros. If it's not, your trick won't work at all.
The following example is from Wikipedia.
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // undefined behavior
If I never dereference p, then why is arr + 5 alone undefined behaviour? I expect pointers to behave as integers - with the exception that when dereferenced the value of a pointer is considered as a memory address.
That's because pointers don't behave like integers. It's undefined behavior because the standard says so.
On most platforms however (if not all), you won't get a crash or run into dubious behavior if you don't dereference the array. But then, if you don't dereference it, what's the point of doing the addition?
That said, note that an expression going one over the end of an array is technically 100% "correct" and guaranteed not to crash per §5.7 ¶5 of the C++11 spec. However, the result of that expression is unspecified (just guaranteed not to be an overflow); while any other expression going more than one past the array bounds is explicitly undefined behavior.
Note: That does not mean it is safe to read and write from an over-by-one offset. You likely will be editing data that does not belong to that array, and will cause state/memory corruption. You just won't cause an overflow exception.
My guess is that it's like that because it's not only dereferencing that's wrong. Also pointer arithmetics, comparing pointers, etc. So it's just easier to say don't do this instead of enumerating the situations where it can be dangerous.
The original x86 can have issues with such statements. On 16 bits code, pointers are 16+16 bits. If you add an offset to the lower 16 bits, you might need to deal with overflow and change the upper 16 bits. That was a slow operation and best avoided.
On those systems, array_base+offset was guaranteed not to overflow, if offset was in range (<=array size). But array+5 would overflow if array contained only 3 elements.
The consequence of that overflow is that you got a pointer which doesn't point behind the array, but before. And that might not even be RAM, but memory-mapped hardware. The C++ standard doesn't try to limit what happens if you construct pointers to random hardware components, i.e. it's Undefined Behavior on real systems.
If arr happens to be right at the end of the machine's memory space then arr+5 might be outside that memory space, so the pointer type might not be able to represent the value i.e. it might overflow, and overflow is undefined.
"Undefined behavior" doesn't mean it has to crash on that line of code, but it does mean that you can't make any guaranteed about the result. For example:
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // I guess this is allowed to crash, but that would be a rather
// unusual implementation choice on most machines.
*p; //may cause a crash, or it may read data out of some other data structure
assert(arr < p); // this statement may not be true
// (arr may be so close to the end of the address space that
// adding 5 overflowed the address space and wrapped around)
assert(p - arr == 5); //this statement may not be true
//the compiler may have assigned p some other value
I'm sure there are many other examples you can throw in here.
Some systems, very rare systems and I can't name one, will cause traps when you increment past boundaries like that. Further, it allows an implementation that provides boundary protection to exist...again though I can't think of one.
Essentially, you shouldn't be doing it and therefor there's no reason to specify what happens when you do. Specifying what happens puts unwarranted burden on the implementation provider.
This result you are seeing is because of the x86's segment-based memory protection. I find this protection to be justified as when you are incrementing the pointer address and storing, It means at future point of time in your code you will be dereferencing the pointer and using the value. So compiler wants to avoid such kind of situations where you will end up changing some other's memory location or deleting the memory which is being owned by some other guy in your code. To avoid such scenario's compiler has put the restriction.
In addition to hardware issues, another factor was the emergence of implementations which attempted to trap on various kinds of programming errors. Although many such implementations could be most useful if configured to trap on constructs which a program is known not to use, even though they are defined by the C Standard, the authors of the Standard did not want to define the behavior of constructs which would--in many programming fields--be symptomatic of errors.
In many cases, it will be much easier to trap on actions which use pointer arithmetic to compute address of unintended objects than to somehow record the fact that the pointers cannot be used to access the storage they identify, but could be modified so that they could access other storage. Except in the case of arrays within larger (two-dimensional) arrays, an implementation would be allowed to reserve space that's "just past" the end of every object. Given something like doSomethingWithItem(someArray+i);, an implementation could trap any attempt to pass any address which doesn't point to either an element of the array or the space just past the last element. If the allocation of someArray reserved space for an extra unused element, and doSomethingWithItem() only accesses the item to which it receives a pointer, the implementation could relatively inexpensively ensure that any non-trapped execution of the above code could--at worst--access otherwise-unused storage.
The ability to compute "just-past" addresses makes bounds checking more difficult than it otherwise would be (the most common erroneous situation about would be passing doSomethingWithItem() a pointer just past the end of the array, but behavior would be defined unless doSomethingWithItem would try to dereference that pointer--something the caller may be unable to prove). Because the Standard would allow compilers to reserve space just past the array in most cases, however, such allowance would allow implementations to limit the damage caused by untrapped errors--something that would likely not be practical if more generalized pointer arithmetic were allowed.
Obviously, dereferencing an invalid pointer causes undefined behavior. But what about simply storing an invalid memory address in a pointer variable?
Consider the following code:
const char* str = "abcdef";
const char* begin = str;
if (begin - 1 < str) { /* ... do something ... */ }
The expression begin - 1 evaluates to an invalid memory address. Note that we don't actually dereference this address - we simply use it in pointer arithmetic to test if it is valid. Nonetheless, we still have to load an invalid memory address into a register.
So, is this undefined behavior? I never thought it was, since a lot of pointer arithmetic seems to rely on this sort of thing, and a pointer is really nothing but an integer anyway. But recently I heard that even the act of loading an invalid pointer into a register is undefined behavior, since certain architectures will automatically throw a bus error or something if you do that. Can anyone point me to the relevant part of the C or C++ standard which settles this either way?
I have the C Draft Standard here, and it makes it undefined by omission. It defines the case of ptr + I at 6.5.6/8 for
If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression.
Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object.
Your case does not fit any of these. Neither is your array large enough to have -1 adjust the pointer to point to a different array element, nor does any of the result or original pointer point one-past-end.
Your code is undefined behavior for a different reason:
the expression begin - 1 does not yield an invalid pointer. It is undefined behavior. You are not allowed to perform pointer arithmetics beyond the bounds of the array you're working on. So it is the subtraction itself that is invalid, and not the act of storing the resulting pointer.
Some architectures have dedicated registers for holding pointers. Putting the value of an unmapped address into such a register is allowed to crash. Integer overflow/underflow is allowed to crash. Because C aims to work on a broad variety of platforms, pointers provide a mechanism for safely programming unsafe circuits.
If you know you won't be running on exotic hardware with such finicky characteristics, you don't need to worry about what is undefined by the language. It is well-defined by the platform.
Of course, the example is poor style and there isn't a good reason to do it.
Any use of an invalid pointer yields undefined behaviour. I don't have the C Standard here at work, but see 'invalid pointers' in the Rationale: http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf
$5.7/6 - "Unless both pointers point
to elements of the same array object,
or one past the last element of the
array object, the behavior is
undefined.75)"
Summary, it is undefined even if you do not dereference the pointer.
The correct answers have been given years ago, but I find it interesting that the C99 rationale [sec. 6.5.6, last 3 paragraphs] explains why the standard endorses adding 1 to a pointer that points to the last element of an array (p+1):
An important endorsement of widespread practice is the requirement that a pointer can always be incremented to just past the end of an array, with no fear of overflow or wraparound
and why p-1 is not endorsed:
In the case of p-1, on the other hand, an entire object would have to be allocated prior to the array of objects that p traverses, so decrement loops that run off the bottom of an array can fail. This restriction allows segmented architectures, for instance, to place objects at the start of a range of addressable memory.
So if the pointer p points to an object at the start of a range of addressable memory, which is endorsed by this comment, then p-1 would generate an underflow.
Note that integer overflow is the standard's example for undefined behavior [sec. 3.4.3], as it depends on the translation environment and the operating environment. I believe it is easy to see that this dependence on the environment extends to pointer underflow.
This is why the standard explicitly makes it undefined behavior [in 6.5.6/8], as noted by other answers here. To cite that sentence:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
See also [sec. 6.3.2.3, last 4 paragraphs] of the C99 rationale, which gives a more detailed description of how invalid pointers can be generated, and what effects that may have.
Yes, it's undefined behavior. See the accepted answer to this closely related question. Assigning an invalid pointer to a variable, comparing an invalid pointer, casting an invalid pointer triggers undefined behavior.