Why is overflow in C++ arrays operating system/compiler dependent? [duplicate]

Why is overflow in C++ arrays operating system/compiler dependent? [duplicate] - c++

This question already has answers here:
Undefined, unspecified and implementation-defined behavior
(9 answers)
Closed 2 years ago.
I have a question regarding the C++ arrays. I am learning C++ and in the "Arrays" section I came across with something that I could not understand.
Suppose we have an integer array with 5 element. The instructor says, in the next line, if we try to get input for the fifth index of the array we cannot know what will happen (e.g. crash or whatnot) because it is operating system and compiler dependent.
int testArray[5] {0};
std::cin >> testArray[5];
Can you enlighten me why this situation is OS/compiler dependent?

C++ is a systems language. It doesn't mandate any checking on operations: a program which is implemented correctly would be impeded by unnecessary checks, e.g., to verify that an array access is within bounds. The checks would consist of a conditional potentially needed on every access (unless the optimizer can see that some checks can be elided) and potentially also extra memory as the system would need to represent the size of the array for the checks.
As there are no checks, you may end up manipulating bytes outside the memory allocated for the array in question. That may override values used for something else or it may access memory beyond a page boundary. In the latter case the system may be set up to cause segmentation fault. In the former case it may access otherwise unused memory and there may be no ill effect. ... or it may modify the stack in a way preventing the program from correctly returning from a function. However, how the stack is laid out depends on many aspects like the CPU used, the calling conventions on the given system, some compiler preferences. To avoid impeding correct programs there is no mandated behavior when illegal operations are performed. Instead the behavior the program become undefined and anything is OK to happen.
Having behavior undefined is kind of bad if you cause it. The simple solution is not to cause undefined behavior (yes, I know, that is easier said than done). However, leaving behavior undefined is often good performance and it also enables different implementation to actually define the behavior to do something helpful. For example, it makes it legal for implementations to check for undefined behavior and report these cases. Doing something to detect undefined behavior is, e.g., done by -fsanitize=undefined provided by some compilers (I don't think that would cause all kinds of undefined behavior to be detected but it certainly detects some kinds of undefined behavior).

Arrays are not meant to be indexed outside of. C++ made the decision to not actually check these indexes, which means that you can read out of the bounds of an array. It simply tells you that you shouldn't do that, and delegates the out-of-bounds checks to the programmer for added speed.
Since compilers and operating systems and many other factors can determine how memory is laid out, reading out of the bounds of an array can do basically anything, including giving garbage values, segfaulting, or summon nasal demons.

Related

Why does this line of code cause a computer to crash?

Why does this line of code cause the computer to crash? What happens on memory-specific level?
for(int *p=0; ;*(p++)=0)
;
I have found the "answer" on Everything2, but I want a specific technical answer.

This code simply formally sets an integer pointer to null, then writes to the integer pointed by it a 0 and increments the pointer, looping forever.
The null pointer is not pointing to anything, so writing a 0 to it is undefined behavior (i.e. the standard doesn't say what should happen). Also you're not allowed to use pointer arithmetic outside arrays and so even just the increment is also undefined behavior.
Undefined behavior means that the compiler and library authors don't need to care at all about these cases and still the system is a valid C/C++ implementation. If a programmer does anything classified as undefined behavior then whatever happens happens and s/he cannot blame compiler and library authors. A programmer entering the undefined behavior realm cannot expect an error message or a crash, but cannot complain if getting one (even one million executed instructions later).
On systems where the null pointer is represented as zeros and there is no support for memory protection the effect or such a loop could be of starting wiping all the addressable memory, until some vital part of memory like an interrupt table is corrupted or until the code writes zeros on the code itself, self-destroying. On other systems with memory protection (most common desktop systems today) execution may instead simply stop at the very first write operation.

Undoubtedly, the cause of the problem is that p has not been assigned a reasonable address.
By not properly initializing a pointer before writing to where it points, it is probably going to do Bad Things™.
It could merely segfault, or it could overwrite something important, like the function's return address where a segfault wouldn't occur until the function attempts to return.
In the 1980s, a theoretician I worked with wrote a program for the 8086 to, once a second, write one word of random data at a randomly computed address. The computer was a process controller with watchdog protection and various types of output. The question was: How long would the system run before it ceased usefully functioning? The answer was hours and hours! This was a vivid demonstration that most of memory is rarely accessed.

It may cause an OS to crash, or it may do any number of other things. You are invoking undefined behavior. You don't own the memory at address 0 and you don't own the memory past it. You're just trouncing on memory that doesn't belong to you.

It works by overwriting all the memory at all the addresses, starting from 0 and going upwards. Eventually it will overwrite something important.
On any modern system, this will only crash your program, not the entire computer. CPUs designed since, oh, 1985 or so, have this feature called virtual memory which allows the OS to redirect your program's memory addresses. Most addresses aren't directed anywhere at all, which means trying to access them will just crash your program - and the ones that are directed somewhere will be directed to memory that is allocated to your program, so you can only crash your own program by messing with them.
On much older systems (older than 1985, remember!), there was no such protection and this loop could access memory addresses allocated to other programs and the OS.

The loop is not necessary to explain what's wrong. We can simply look at only the first iteration.
int *p = 0; // Declare a null pointer
*p = 0; // Write to null pointer, causing UB
The second line is causing undefined behavior, since that's what happens when you write to a null pointer.

Why is out-of-bounds pointer arithmetic undefined behaviour?

The following example is from Wikipedia.
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // undefined behavior
If I never dereference p, then why is arr + 5 alone undefined behaviour? I expect pointers to behave as integers - with the exception that when dereferenced the value of a pointer is considered as a memory address.

That's because pointers don't behave like integers. It's undefined behavior because the standard says so.
On most platforms however (if not all), you won't get a crash or run into dubious behavior if you don't dereference the array. But then, if you don't dereference it, what's the point of doing the addition?
That said, note that an expression going one over the end of an array is technically 100% "correct" and guaranteed not to crash per §5.7 ¶5 of the C++11 spec. However, the result of that expression is unspecified (just guaranteed not to be an overflow); while any other expression going more than one past the array bounds is explicitly undefined behavior.
Note: That does not mean it is safe to read and write from an over-by-one offset. You likely will be editing data that does not belong to that array, and will cause state/memory corruption. You just won't cause an overflow exception.
My guess is that it's like that because it's not only dereferencing that's wrong. Also pointer arithmetics, comparing pointers, etc. So it's just easier to say don't do this instead of enumerating the situations where it can be dangerous.

The original x86 can have issues with such statements. On 16 bits code, pointers are 16+16 bits. If you add an offset to the lower 16 bits, you might need to deal with overflow and change the upper 16 bits. That was a slow operation and best avoided.
On those systems, array_base+offset was guaranteed not to overflow, if offset was in range (<=array size). But array+5 would overflow if array contained only 3 elements.
The consequence of that overflow is that you got a pointer which doesn't point behind the array, but before. And that might not even be RAM, but memory-mapped hardware. The C++ standard doesn't try to limit what happens if you construct pointers to random hardware components, i.e. it's Undefined Behavior on real systems.

If arr happens to be right at the end of the machine's memory space then arr+5 might be outside that memory space, so the pointer type might not be able to represent the value i.e. it might overflow, and overflow is undefined.

"Undefined behavior" doesn't mean it has to crash on that line of code, but it does mean that you can't make any guaranteed about the result. For example:
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // I guess this is allowed to crash, but that would be a rather
// unusual implementation choice on most machines.
*p; //may cause a crash, or it may read data out of some other data structure
assert(arr < p); // this statement may not be true
// (arr may be so close to the end of the address space that
// adding 5 overflowed the address space and wrapped around)
assert(p - arr == 5); //this statement may not be true
//the compiler may have assigned p some other value
I'm sure there are many other examples you can throw in here.

Some systems, very rare systems and I can't name one, will cause traps when you increment past boundaries like that. Further, it allows an implementation that provides boundary protection to exist...again though I can't think of one.
Essentially, you shouldn't be doing it and therefor there's no reason to specify what happens when you do. Specifying what happens puts unwarranted burden on the implementation provider.

This result you are seeing is because of the x86's segment-based memory protection. I find this protection to be justified as when you are incrementing the pointer address and storing, It means at future point of time in your code you will be dereferencing the pointer and using the value. So compiler wants to avoid such kind of situations where you will end up changing some other's memory location or deleting the memory which is being owned by some other guy in your code. To avoid such scenario's compiler has put the restriction.

In addition to hardware issues, another factor was the emergence of implementations which attempted to trap on various kinds of programming errors. Although many such implementations could be most useful if configured to trap on constructs which a program is known not to use, even though they are defined by the C Standard, the authors of the Standard did not want to define the behavior of constructs which would--in many programming fields--be symptomatic of errors.
In many cases, it will be much easier to trap on actions which use pointer arithmetic to compute address of unintended objects than to somehow record the fact that the pointers cannot be used to access the storage they identify, but could be modified so that they could access other storage. Except in the case of arrays within larger (two-dimensional) arrays, an implementation would be allowed to reserve space that's "just past" the end of every object. Given something like doSomethingWithItem(someArray+i);, an implementation could trap any attempt to pass any address which doesn't point to either an element of the array or the space just past the last element. If the allocation of someArray reserved space for an extra unused element, and doSomethingWithItem() only accesses the item to which it receives a pointer, the implementation could relatively inexpensively ensure that any non-trapped execution of the above code could--at worst--access otherwise-unused storage.
The ability to compute "just-past" addresses makes bounds checking more difficult than it otherwise would be (the most common erroneous situation about would be passing doSomethingWithItem() a pointer just past the end of the array, but behavior would be defined unless doSomethingWithItem would try to dereference that pointer--something the caller may be unable to prove). Because the Standard would allow compilers to reserve space just past the array in most cases, however, such allowance would allow implementations to limit the damage caused by untrapped errors--something that would likely not be practical if more generalized pointer arithmetic were allowed.

Why does strcpy "work" when writing to malloc'ed memory that is not large enough? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why does this intentionally incorrect use of strcpy not fail horribly?
Below see below code:
char* stuff = (char*)malloc(2);
strcpy(stuff,"abc");
cout<<"The size of stuff is : "<<strlen(stuff);
Even though I assigned 2 bytes to stuff, why does strcpy still work and the output of strlen is 3. Shouldn't this throw something like index out of bounds?

C and C++ don't do automatic bounds checking like Java and C# do. This code will overwrite stuff in memory past the end of the string, corrupting whatever was there. That can lead to strange behavior or crashes later, so it's good to be cautious about such things.
Accessing past the end of an array is deemed "undefined behavior" by the C and C++ standards. That means the standard doesn't specify what must happen when a program does that, so a program that triggers UB is in never-never-land where anything might happen. It might continue to work with no apparent problems. It might crash immediately. It might crash later when doing something else that shouldn't have been a problem. It might misbehave but not crash. Or velociraptors might come and eat you. Anything can happen.
Writing past the end of an array is called a buffer overflow, by the way, and it's a common cause of security flaws. If that "abc" string were actually user input, a skilled attacker could put bytes into it that end up overwriting something like the function's return pointer, which can be used to make the program run different code than it should, and do different things than it should.

you just over write heap memory, no crash usually, but bad things can happen later. C does not prevent you from shooting your own foot, no such thing as array out of bounds.

No, your char pointer now points to a character of length 3. Generally this would not cause any problems, but you might overwrite some critical memory region and cause the program to crash(you can expect to see a segmentation fault then). Specially when you are performing such operations over a large amount of memory

　　here is the implementation of "strcpy"
char *strcpy(char *strDestination, const char *strSource)
　　{
　　assert(strDestination && strSource);
　　char *strD=strDestination;
　　while ((*strDestination++=*strSource++)!='\0')
　　NULL;
　　return strD;
　　}
you should ensure the destination have enough space. However,what it is,it is.

strcpy does not check for sufficient space in strDestination before copying strSource,
ALso it does not perform bounds checking, and thus risks overrunning from or to. it is a potential cause of buffer overruns.

Why compiler does not complain about accessing elements beyond the bounds of a dynamic array? [duplicate]

This question already has answers here:
Accessing an array out of bounds gives no error, why?
(18 answers)
Closed 7 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I am defining an array of size 9. but when I access the array index 10 it is not giving any error.
int main() {
bool* isSeedPos = new bool[9];
isSeedPos[10] = true;
}
I expected to get a compiler error, because there is no array element isSeedPos[10] in my array.
Why don't I get an error?

It's not a problem.
There is no bound-check in C++ arrays. You are able to access elements beyond the array's limit (but this will usually cause an error).
If you want to use an array, you have to check that you are not out of bounds yourself (you can keep the sizee in a separate variable, as you did).
Of course, a better solution would be to use the standard library containers such as std::vector.
With std::vector you can either
use the myVector.at(i)method to get the ith element (which will throw an exception if you are out of bounds)
use myVector[i] with the same syntax as C-style arrays, but you have to do bound-checking yourself ( e.g. try if (i < myVector.size()) ... before accessing it)
Also note that in your case, std::vector<bool> is a specialized version implemented so that each booltakes only one bit of memory (therefore it uses less memory than an array of bool, which may or may not be what you want).

Use std::vector instead. Some implementations will do bounds checking in debug mode.

No, the compiler is not required to emit a diagnostic for this case. The compiler does not perform bounds checking for you.
It is your responsibility to make sure that you don't write broken code like this, because the compiler will not error on it.

Unlike in other languages like java and python, array access is not bound-checked in C or C++. That makes accessing arrays faster. It is your responsibility to make sure that you stay within bounds.
However, in such a simple case such as this, some compilers can detect the error at compile time.
Also, some tools such as valgrind can help you detect such errors at run time.

What compiler/debugger are you using?
MSVC++ would complain about it and tell you that you write out of bounds of an array.
But it is not required to do it by the standard.
It can crash anytime, it causes undefined behaviour.

Primitive arrays do not do bounds-checking. If you want bounds-checking, you should use std::vector instead. You are accessing invalid memory after the end of array, and purely by luck it is working.

There is no runtime checking on the index you are giving, accessing element 10 is incorrect but possible. Two things can happen:
if you are "unlucky", this will not crash and will return some data located after your array.
if you are "lucky", the data after the array is not allocated by your program, so access to the requested address is forbidden. This will be detected by the operating system and will produce a "segmentation fault".

There is no rule stateing that the memory access is checked in c, plain and simple. When you ask for an array of bool's it might be faster for the Operating system to give you a 16bit og 32bit array, instead of a 9bit one. This means that you might not even be writing or reading into someone elses space.
C++ is fast, and one of the reasons that it is fast is becaurse there are very few checks on what you are doing, if you ask for some memory, then the programming language will assume that you know what you are doing, and if the operating system does not complain, then everything will run.

There is no problem! You are just accessing memory that you shouldn't access. You get access to memory after the array.

isSeedPos doesn't know how big the array is. It is just a pointer to a position in memory. When you point to isSeepPos[10] the behaviour is undefined. Chances are sooner or later this will cause a segfault, but there is no requirement for a crash, and there is certainly no standard error checking.

Writing to that position is dangerous.
But the compiler will let you do it - Effectively you're writing one-past the last byte of memory assigned to that array = not a good thing.
C++ isn't a lot like many other languages - It assumes that you know what you are doing!

Both C and C++ let you write to arbitrary areas of memory. This is because they originally derived from (and are still used for) low-level programming where you may legitimately want to write to a memory mapped peripheral, or similar, and because it's more efficient to omit bounds checking when the programmer already knows the value will be within (eg. for a loop 0 to N over an array, he/she knows 0 and N are within the bounds, so checking each intermediate value is superfluous).
However, in truth, nowadays you rarely want to do that. If you use the arr[i] syntax, you essentially always want to write to the array declared in arr, and never do anything else. But you still can if you want to.
If you do write to arbitrary memory (as you do in this case) either it will be part of your program, and it will change some other critical data without you knowing (either now, or later when you make a change to the code and have forgotten what you were doing); or it will write to memory not allocated to your program and the OS will shut it down to prevent worse problems.
Nowadays:
Many compilers will spot it if you make an obvious mistake like this one
There are tools which will test if your program writes to unallocated memory
You can and should use std::vector instead, which is there for the 99% of the time you want bounds checking. (Check whether you're using at() or [] to access it)

This is not Java. In C or C++ there is no bounds checking; it's pure luck that you can write to that index.

Array index out of bound behavior

Why does C/C++ differentiates in case of array index out of bound
#include <stdio.h>
int main()
{
int a[10];
a[3]=4;
a[11]=3;//does not give segmentation fault
a[25]=4;//does not give segmentation fault
a[20000]=3; //gives segmentation fault
return 0;
}
I understand that it's trying to access memory allocated to process or thread in case of a[11] or a[25] and it's going out of stack bounds in case of a[20000].
Why doesn't compiler or linker give an error, aren't they aware of the array size? If not then how does sizeof(a) work correctly?

The problem is that C/C++ doesn't actually do any boundary checking with regards to arrays. It depends on the OS to ensure that you are accessing valid memory.
In this particular case, you are declaring a stack based array. Depending upon the particular implementation, accessing outside the bounds of the array will simply access another part of the already allocated stack space (most OS's and threads reserve a certain portion of memory for stack). As long as you just happen to be playing around in the pre-allocated stack space, everything will not crash (note i did not say work).
What's happening on the last line is that you have now accessed beyond the part of memory that is allocated for the stack. As a result you are indexing into a part of memory that is not allocated to your process or is allocated in a read only fashion. The OS sees this and sends a seg fault to the process.
This is one of the reasons that C/C++ is so dangerous when it comes to boundary checking.

The segfault is not an intended action of your C program that would tell you that an index is out of bounds. Rather, it is an unintended consequence of undefined behavior.
In C and C++, if you declare an array like
type name[size];
You are only allowed to access elements with indexes from 0 up to size-1. Anything outside of that range causes undefined behavior. If the index was near the range, most probably you read your own program's memory. If the index was largely out of range, most probably your program will be killed by the operating system. But you can't know, anything can happen.
Why does C allow that? Well, the basic gist of C and C++ is to not provide features if they cost performance. C and C++ has been used for ages for highly performance critical systems. C has been used as a implementation language for kernels and programs where access out of array bounds can be useful to get fast access to objects that lie adjacent in memory. Having the compiler forbid this would be for naught.
Why doesn't it warn about that? Well, you can put warning levels high and hope for the compiler's mercy. This is called quality of implementation (QoI). If some compiler uses open behavior (like, undefined behavior) to do something good, it has a good quality of implementation in that regard.
[js#HOST2 cpp]$ gcc -Wall -O2 main.c
main.c: In function 'main':
main.c:3: warning: array subscript is above array bounds
[js#HOST2 cpp]$
If it instead would format your hard disk upon seeing the array accessed out of bounds - which would be legal for it - the quality of implementation would be rather bad. I enjoyed to read about that stuff in the ANSI C Rationale document.

You generally only get a segmentation fault if you try to access memory your process doesn't own.
What you're seeing in the case of a[11] (and a[10] by the way) is memory that your process does own but doesn't belong to the a[] array. a[25000] is so far from a[], it's probably outside your memory altogether.
Changing a[11] is far more insidious as it silently affects a different variable (or the stack frame which may cause a different segmentation fault when your function returns).

C isn't doing this. The OS's virtual memeory subsystem is.
In the case where you are only slightly out-of-bound you are addressing memeory that is allocated for your program (on the stack call stack in this case). In the case where you are far out-of-bounds you are addressing memory not given over to your program and the OS is throwing a segmentation fault.
On some systems there is also a OS enforced concept of "writeable" memory, and you might be trying to write to memeory that you own but is marked unwriteable.

Just to add what other people are saying, you cannot rely on the program simply crashing in these cases, there is no gurantee of what will happen if you attempt to access a memory location beyond the "bounds of the array." It's just the same as if you did something like:
int *p;
p = 135;
*p = 14;
That is just random; this might work. It might not. Don't do it. Code to prevent these sorts of problems.

As litb mentioned, some compilers can detect some out-of-bounds array accesses at compile time. But bounds checking at compile time won't catch everything:
int a[10];
int i = some_complicated_function();
printf("%d\n", a[i]);
To detect this, runtime checks would have to be used, and they're avoided in C because of their performance impact. Even with knowledge of a's array size at compile time, i.e. sizeof(a), it can't protect against that without inserting a runtime check.

As I understand the question and comments, you understand why bad things can happen when you access memory out of bounds, but you're wondering why your particular compiler didn't warn you.
Compilers are allowed to warn you, and many do at the highest warning levels. However the standard is written to allow people to run compilers for all sorts of devices, and compilers with all sorts of features so the standard requires the least it can while guaranteeing people can do useful work.
There are a few times the standard requires that a certain coding style will generate a diagnostic. There are several other times where the standard does not require a diagnostic. Even when a diagnostic is required I'm not aware of any place where the standard says what the exact wording should be.
But you're not completely out in the cold here. If your compiler doesn't warn you, Lint may. Additionally, there are a number of tools to detect such problems (at run time) for arrays on the heap, one of the more famous being Electric Fence (or DUMA). But even Electric Fence doesn't guarantee it will catch all overrun errors.

That's not a C issue its an operating system issue. You're program has been granted a certain memory space and anything you do inside of that is fine. The segmentation fault only happens when you access memory outside of your process space.
Not all operating systems have seperate address spaces for each proces, in which case you can corrupt the state of another process or of the operating system with no warning.

C philosophy is always trust the programmer. And also not checking bounds allows the program to run faster.

As JaredPar said, C/C++ doesn't always perform range checking. If your program accesses a memory location outside your allocated array, your program may crash, or it may not because it is accessing some other variable on the stack.
To answer your question about sizeof operator in C:
You can reliably use sizeof(array)/size(array[0]) to determine array size, but using it doesn't mean the compiler will perform any range checking.
My research showed that C/C++ developers believe that you shouldn't pay for something you don't use, and they trust the programmers to know what they are doing. (see accepted answer to this: Accessing an array out of bounds gives no error, why?)
If you can use C++ instead of C, maybe use vector? You can use vector[] when you need the performance (but no range checking) or, more preferably, use vector.at() (which has range checking at the cost of performance). Note that vector doesn't automatically increase capacity if it is full: to be safe, use push_back(), which automatically increases capacity if necessary.
More information on vector: http://www.cplusplus.com/reference/vector/vector/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js