Memory allocation mental map clarification - c++

I want to clarify my mental map on memory allocation.
Lets suppose I have the following Array:
int arr [] = {1,2,3};
Lets suppose each integer will occupy 4 bytes in memory.
Such that the memory addresses of the integers could be :
HHH01 HHH05 HHH09
Will the memory chunk of arr be a superset of the memory chunks of each integer?

Strictly speaking, IIRC, the answer to your question is undefined and that's important because dipping into undefined behavior leads to some of the hardest and most obscure to track down bugs. Pointers and arrays don't necessarily have to map in memory in any specific fashion within the CPP standard. As long as they can properly perform the necessary arithmetic to find and de-reference to the proper elements, etc... anything beyond that should be considered safely abstracted away.
With that said... I think the answer to the question for most (if not all?) practical purposes is that your understanding is correct. If you were to cout << &(arr[0]);cout << &(arr[2]) you'd get the addresses you expect in any compiler that I've used and the amount of memory allocated will be the amount you'd expect. Doing cout << &(arr[3]) will even give you a valid address, though the data that's actually stored in arr[3] would be garbage. The only thing to be aware of is that different compilers and operating systems could provide different sizes and alignments of those elements. It's possible that if you were to check the size of int it could tell you that it's 2 bytes, but when you start looking at the addresses of the elements in an array printed out by a compiler they've got a 4-byte spacing.
In the end, while this can be interesting from an academic point of view... making actual use of it should be avoided at essentially all costs. If you start trying to manually access memory locations, it's likely to come back and bite you or whoever has to maintain your code down the line somewhere.

Related

What is the purpose of allocating a specific amount of memory for arrays in C++?

I'm a student taking a class on Data Structures in C++ this semester and I came across something that I don't quite understand tonight. Say I were to create a pointer to an array on the heap:
int* arrayPtr = new int [4];
I can access this array using pointer syntax
int value = *(arrayPtr + index);
But if I were to add another value to the memory position immediately after the end of the space allocated for the array, I would then be able to access it
*(arrayPtr + 4) = 0;
int nextPos = *(arrayPtr + 4);
//the value of nextPos will be 0, or whatever value I previously filled that space with
The position in memory of *(arrayPtr + 4) is past the end of the space allocated for the array. But as far as I understand, the above still would not cause any problems. So aside from it being a requirement of C++, why even give arrays a specific size when declaring them?
When you go past the end of allocated memory, you are actually accessing memory of some other object (or memory that is free right now, but that could change later). So, it will cause you problems. Especially if you'll try to write something to it.
I can access this array using pointer syntax
int value = *(arrayPtr + index);
Yeah, but don't. Use arrayPtr[index]
The position in memory of *(arrayPtr + 4) is past the end of the space allocated for the array. But as far as I understand, the above still would not cause any problems.
You understand wrong. Oh so very wrong. You're invoking undefined behavior and undefined behavior is undefined. It may work for a week, then break one day next week and you'll be left wondering why. If you don't know the collection size in advance use something dynamic like a vector instead of an array.
Yes, in C/C++ you can access memory outside of the space you claim to have allocated. Sometimes. This is what is referred to as undefined behavior.
Basically, you have told the compiler and the memory management system that you want space to store four integers, and the memory management system allocated space for you to store four integers. It gave you a pointer to that space. In the memory manager's internal accounting, those bytes of ram are now occupied, until you call delete[] arrayPtr;.
However, the memory manager has not allocated that next byte for you. You don't have any way of knowing, in general, what that next byte is, or who it belongs to.
In a simple example program like your example, which just allocates a few bytes, and doesn't allocate anything else, chances are, that next byte belongs to your program, and isn't occupied. If that array is the only dynamically allocated memory in your program, then it's probably, maybe safe to run over the end.
But in a more complex program, with multiple dynamic memory allocations and deallocations, especially near the edges of memory pages, you really have no good way of knowing what any bytes outside of the memory you asked for contain. So when you write to bytes outside of the memory you asked for in new you could be writing to basically anything.
This is where undefined behavior comes in. Because you don't know what's in that space you wrote to, you don't know what will happen as a result. Here's some examples of things that could happen:
The memory was not allocated when you wrote to it. In that case, the data is fine, and nothing bad seems to happen. However, if a later memory allocation uses that space, anything you tried to put there will be lost.
The memory was allocated when you wrote to it. In that case, congratulations, you just overwrote some random bytes from some other data structure somewhere else in your program. Imagine replacing a variable somewhere in one of your objects with random data, and consider what that would mean for your program. Maybe a list somewhere else now has the wrong count. Maybe a string now has some random values for the first few characters, or is now empty because you replaced those characters with zeroes.
The array was allocated at the edge of a page, so the next bytes don't belong to your program. The address is outside your program's allocation. In this case, the OS detects you accessing random memory that isn't yours, and terminates your program immediately with SIGSEGV.
Basically, undefined behavior means that you are doing something illegal, but because C/C++ is designed to be fast, the language designers don't include an explicit check to make sure you don't break the rules, like other languages (e.g. Java, C#). They just list the behavior of breaking the rules as undefined, and then the people who make the compilers can have the output be simpler, faster code, since no array bounds checks are made, and if you break the rules, it's your own problem.
So yes, this sometimes works, but don't ever rely on it.
It would not cause any problems in a a purely abstract setting, where you only worry about whether the logic of the algorithm is sound. In that case there's no reason to declare the size of an array at all. However, your computer exists in the physical world, and only has a limited amount of memory. When you're allocating memory, you're asking the operating system to let you use some of the computer's finite memory. If you go beyond that, the operating system should stop you, usually by killing your process/program.
Yes, you must write it as arrayptr[index] because the position in memory of *(arrayptr + 4) is past the end of the space which you have allocated for the array. Its the flaw in C++ that the array size cant be extended once allocated.

dangers of heap overflows?

I have a question about heap overflows.
I understand that if a stack variable overruns it's buffer, it could overwrite the EIP and ESP values and, for example, make the program jump to a place where the coder did not expect it to jump.
This seems, as I understand, to behave like this because of the backward little endian storing (where f.e. the characters in an array are stored "backwards", from last to first).
If you on the other hand put that array into the heap, which grows contra the stack, and you would overflow it, would it just write random garbage into empty memory space then? (unless you where on a solaris which as far as I know has a big endian system,side note)
Would this basicly be a danger since it would just write into "empty space"?
So no aimed jumping to adresses and areas the code was not designed for?
Am I getting this wrong?
To specify my question:
I am writing a program where the user is meant to pass a string argument and a flag when executing it via command line, and I want to know if the user could perform a hack with this string argument when it is put on the heap with the malloc function.
If you on the other hand put that array into the heap, which grows contra the stack, and you would overflow it, would it just write random garbage into empty memory space then?
You are making a couple of assumptions:
You are assuming that the heap is at the end of the main memory segment. That ain't necessarily so.
You are assuming that the object in the heap is at the end of the heap. That ain't necessarily so. (In fact, it typically isn't so ...)
Here's an example that is likely to cause problems no matter how the heap is implemented:
char *a = malloc(100);
char *b = malloc(100);
char *c = malloc(100);
for (int i = 0; i < 200; i++) {
b[i] = 'Z';
}
Writing beyond the end of b is likely to trample either a or c ... or some other object in the heap, or the free list.
Depending on what objects you trample, you may overwrite function pointers, or you may do other damage that results in segmentation faults, unpredictable behaviour and so on. These things could be used for code injection, to cause the code to malfunction in other ways that are harmful from a security standpoint ... or just to implement a denial of service attack by crashing the target application / service.
There are various ways heap overflow could lead to code execution:
Most obvious - you overflow into another object that contains function pointers and get to overwrite one of them.
Slightly less obvious - the object you overflow into doesn't itself contain function pointers, but it contains pointers that will be used for writing, and you get to overwrite one of them to point to a function pointer so that a subsequent write overwrites a function pointer.
Exploiting heap bookkeeping structures - by overwriting the data that the heap allocator itself uses to track size and status of allocated/free blocks, you trick it into overwriting something valuable elsewhere in memory.
Etc.
For some advanced techniques, see:
http://packetstormsecurity.com/files/view/40638/MallocMaleficarum.txt
Even if you can't overwrite a return address, how do you feel about an attacker modifying the rest of your data? This shouldn't thrill you.
To answer your question generally: it is a very bad idea to let the user copy data anywhere without checking its size. You should absolutely never do that, especially on purpose.
If the user means no harm, they may crash your program, either by overwriting useful data, or by causing a page fault. If your user is malicious, you're potentially letting them hijack your system. Both are highly undesirable.
Endianness does not matter to buffer overflows. Big endian machines are just as vulnerable as little-endian machines. The only difference will be the byte order of the malicious data.
You may be thinking instead of the direction the stack grows in, which is independent of endianness. In the case where it grows up, you won't be able to hijack the return address of the function that declares the buffer. However, if you pass that buffer address to any other function, and this function overflows instead, an attacker may change this function's return address. This would be the case, for instance, if you called memcpy of scanf or any other function to modify your buffer (assuming that the compiler didn't inline them).
The stack usually grows downwards. In this case, an attacker can use an overflow to hijack the return address of the function that declares it.
In other words, neither the stack configuration nor endianness offer meaningful protection against stack buffer overflows.
As for the heap:
If you on the other hand put that array into the heap, which grows contra the stack, and you would overflow it, would it just write random garbage into empty memory space then?
The answer, as almost always, is it depends, but probably not. The 32-bit implementation of malloc in glibc keeps bookkeeping structure at the end of the buffer (or at least, used to). By overflowing onto the bookkeeping structures with the correct incantations, when the allocation was freed, you could cause free to write four arbitrary bytes at an arbitrary location. This is a lot of power. This kind of exploit comes up regularly in capture-the-flag competitions and is very exploitable.

Why does my dynamically allocated array get initialized to 0?

I have some code that creates a dynamically allocated array with
int *Array = new int[size];
From what I understand, Array should be a pointer to the first item of Array in memory. When using gdb, I can call x Array to examine the value at the first memory location, x Array+1 to examine the second, etc. I expect to have junk values left over from whatever application was using those spots in memory prior to mine. However, using x Array returns 0x00000000 for all those spots. What am I doing wrong? Is my code initializing all of the values of the Array to zero?
EDIT: For the record, I ask because my program is an attempt to implement this: http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time/. I want to make sure that my algorithm isn't incrementing through the array to initialize every element to 0.
In most modern OSes, the OS gives zeroed pages to applications, as opposed to letting information seep between unrelated processes. That's important for security reasons, for example. Back in the old DOS days, things were a bit more casual. Today, with memory protected OSes, the OS generally gives you zeros to start with.
So, if this new happens early in your program, you're likely to get zeros. You'd be crazy to rely on that though; it's undefined behavior if you do.
If you keep allocating, filling, and freeing memory, eventually new will return memory that isn't zeroed. Rather, it'll contain remnants of your process' own earlier scribblings.
And there's no guarantee that any particular call to new, even at the beginning of your program, will return memory filled with zeros. You're just likely to see that for calls to new early in your program. Don't let that mislead you.
I expect to have junk values left over from whatever application was using those spots
It's certainly possible but by no means guaranteed. Particularly in debug builds, you're just as likely to have the runtime zero out that memory (or fill it with some recognisable bit pattern) instead, to help you debug things if you use the memory incorrectly.
And, really, "those spots" is a rather loose term, given virtual addressing.
The important thing is that, no, your code is not setting all those values to zero.

What's the advantage of malloc?

What is the advantage of allocating a memory for some data. Instead we could use an array of them.
Like
int *lis;
lis = (int*) malloc ( sizeof( int ) * n );
/* Initialize LIS values for all indexes */
for ( i = 0; i < n; i++ )
lis[i] = 1;
we could have used an ordinary array.
Well I don't understand exactly how malloc works, what is actually does. So explaining them would be more beneficial for me.
And suppose we replace sizeof(int) * n with just n in the above code and then try to store integer values, what problems might i be facing? And is there a way to print the values stored in the variable directly from the memory allocated space, for example here it is lis?
Your question seems to rather compare dynamically allocated C-style arrays with variable-length arrays, which means that this might be what you are looking for: Why aren't variable-length arrays part of the C++ standard?
However the c++ tag yields the ultimate answer: use std::vector object instead.
As long as it is possible, avoid dynamic allocation and responsibility for ugly memory management ~> try to take advantage of objects with automatic storage duration instead. Another interesting reading might be: Understanding the meaning of the term and the concept - RAII (Resource Acquisition is Initialization)
"And suppose we replace sizeof(int) * n with just n in the above code and then try to store integer values, what problems might i be facing?"
- If you still consider n to be the amount of integers that it is possible to store in this array, you will most likely experience undefined behavior.
More fundamentally, I think, apart from the stack vs heap and variable vs constant issues (and apart from the fact that you shouldn't be using malloc() in C++ to begin with), is that a local array ceases to exist when the function exits. If you return a pointer to it, that pointer is going to be useless as soon as the caller receives it, whereas memory dynamically allocated with malloc() or new will still be valid. You couldn't implement a function like strdup() using a local array, for instance, or sensibly implement a linked representation list or tree.
The answer is simple. Local1 arrays are allocated on your stack, which is a small pre-allocated memory for your program. Beyond a couple thousand data, you can't really do much on a stack. For higher amounts of data, you need to allocate memory out of your stack.
This is what malloc does.
malloc allocates a piece of memory as big as you ask it. It returns a pointer to the start of that memory, which could be treated similar to an array. If you write beyond the size of that memory, the result is undefined behavior. This means everything could work alright, or your computer may explode. Most likely though you'd get a segmentation fault error.
Reading values from the memory (for example for printing) is the same as reading from an array. For example printf("%d", list[5]);.
Before C99 (I know the question is tagged C++, but probably you're learning C-compiled-in-C++), there was another reason too. There was no way you could have an array of variable length on the stack. (Even now, variable length arrays on the stack are not so useful, since the stack is small). That's why for variable amount of memory, you needed the malloc function to allocate memory as large as you need, the size of which is determined at runtime.
Another important difference between local arrays, or any local variable for that matter, is the life duration of the object. Local variables are inaccessible as soon as their scope finishes. malloced objects live until they are freed. This is essential in practically all data structures that are not arrays, such as linked-lists, binary search trees (and variants), (most) heaps etc.
An example of malloced objects are FILEs. Once you call fopen, the structure that holds the data related to the opened file is dynamically allocated using malloc and returned as a pointer (FILE *).
1 Note: Non-local arrays (global or static) are allocated before execution, so they can't really have a length determined at runtime.
I assume you are asking what is the purpose of c maloc():
Say you want to take an input from user and now allocate an array of that size:
int n;
scanf("%d",&n);
int arr[n];
This will fail because n is not available at compile time. Here comes malloc()
you may write:
int n;
scanf("%d",&n);
int* arr = malloc(sizeof(int)*n);
Actually malloc() allocate memory dynamically in the heap area
Some older programming environments did not provide malloc or any equivalent functionality at all. If you needed dynamic memory allocation you had to code it yourself on top of gigantic static arrays. This had several drawbacks:
The static array size put a hard upper limit on how much data the program could process at any one time, without being recompiled. If you've ever tried to do something complicated in TeX and got a "capacity exceeded, sorry" message, this is why.
The operating system (such as it was) had to reserve space for the static array all at once, whether or not it would all be used. This phenomenon led to "overcommit", in which the OS pretends to have allocated all the memory you could possibly want, but then kills your process if you actually try to use more than is available. Why would anyone want that? And yet it was hyped as a feature in mid-90s commercial Unix, because it meant that giant FORTRAN simulations that potentially needed far more memory than your dinky little Sun workstation had, could be tested on small instance sizes with no trouble. (Presumably you would run the big instance on a Cray somewhere that actually had enough memory to cope.)
Dynamic memory allocators are hard to implement well. Have a look at the jemalloc paper to get a taste of just how hairy it can be. (If you want automatic garbage collection it gets even more complicated.) This is exactly the sort of thing you want a guru to code once for everyone's benefit.
So nowadays even quite barebones embedded environments give you some sort of dynamic allocator.
However, it is good mental discipline to try to do without. Over-use of dynamic memory leads to inefficiency, of the kind that is often very hard to eliminate after the fact, since it's baked into the architecture. If it seems like the task at hand doesn't need dynamic allocation, perhaps it doesn't.
However however, not using dynamic memory allocation when you really should have can cause its own problems, such as imposing hard upper limits on how long strings can be, or baking nonreentrancy into your API (compare gethostbyname to getaddrinfo).
So you have to think about it carefully.
we could have used an ordinary array
In C++ (this year, at least), arrays have a static size; so creating one from a run-time value:
int lis[n];
is not allowed. Some compilers allow this as a non-standard extension, and it's due to become standard next year; but, for now, if we want a dynamically sized array we have to allocate it dynamically.
In C, that would mean messing around with malloc; but you're asking about C++, so you want
std::vector<int> lis(n, 1);
to allocate an array of size n containing int values initialised to 1.
(If you like, you could allocate the array with new int[n], and remember to free it with delete [] lis when you're finished, and take extra care not to leak if an exception is thrown; but life's too short for that nonsense.)
Well I don't understand exactly how malloc works, what is actually does. So explaining them would be more beneficial for me.
malloc in C and new in C++ allocate persistent memory from the "free store". Unlike memory for local variables, which is released automatically when the variable goes out of scope, this persists until you explicitly release it (free in C, delete in C++). This is necessary if you need the array to outlive the current function call. It's also a good idea if the array is very large: local variables are (typically) stored on a stack, with a limited size. If that overflows, the program will crash or otherwise go wrong. (And, in current standard C++, it's necessary if the size isn't a compile-time constant).
And suppose we replace sizeof(int) * n with just n in the above code and then try to store integer values, what problems might i be facing?
You haven't allocated enough space for n integers; so code that assumes you have will try to access memory beyond the end of the allocated space. This will cause undefined behaviour; a crash if you're lucky, and data corruption if you're unlucky.
And is there a way to print the values stored in the variable directly from the memory allocated space, for example here it is lis?
You mean something like this?
for (i = 0; i < len; ++i) std::cout << lis[i] << '\n';

Why is out-of-bounds pointer arithmetic undefined behaviour?

The following example is from Wikipedia.
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // undefined behavior
If I never dereference p, then why is arr + 5 alone undefined behaviour? I expect pointers to behave as integers - with the exception that when dereferenced the value of a pointer is considered as a memory address.
That's because pointers don't behave like integers. It's undefined behavior because the standard says so.
On most platforms however (if not all), you won't get a crash or run into dubious behavior if you don't dereference the array. But then, if you don't dereference it, what's the point of doing the addition?
That said, note that an expression going one over the end of an array is technically 100% "correct" and guaranteed not to crash per §5.7 ¶5 of the C++11 spec. However, the result of that expression is unspecified (just guaranteed not to be an overflow); while any other expression going more than one past the array bounds is explicitly undefined behavior.
Note: That does not mean it is safe to read and write from an over-by-one offset. You likely will be editing data that does not belong to that array, and will cause state/memory corruption. You just won't cause an overflow exception.
My guess is that it's like that because it's not only dereferencing that's wrong. Also pointer arithmetics, comparing pointers, etc. So it's just easier to say don't do this instead of enumerating the situations where it can be dangerous.
The original x86 can have issues with such statements. On 16 bits code, pointers are 16+16 bits. If you add an offset to the lower 16 bits, you might need to deal with overflow and change the upper 16 bits. That was a slow operation and best avoided.
On those systems, array_base+offset was guaranteed not to overflow, if offset was in range (<=array size). But array+5 would overflow if array contained only 3 elements.
The consequence of that overflow is that you got a pointer which doesn't point behind the array, but before. And that might not even be RAM, but memory-mapped hardware. The C++ standard doesn't try to limit what happens if you construct pointers to random hardware components, i.e. it's Undefined Behavior on real systems.
If arr happens to be right at the end of the machine's memory space then arr+5 might be outside that memory space, so the pointer type might not be able to represent the value i.e. it might overflow, and overflow is undefined.
"Undefined behavior" doesn't mean it has to crash on that line of code, but it does mean that you can't make any guaranteed about the result. For example:
int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // I guess this is allowed to crash, but that would be a rather
// unusual implementation choice on most machines.
*p; //may cause a crash, or it may read data out of some other data structure
assert(arr < p); // this statement may not be true
// (arr may be so close to the end of the address space that
// adding 5 overflowed the address space and wrapped around)
assert(p - arr == 5); //this statement may not be true
//the compiler may have assigned p some other value
I'm sure there are many other examples you can throw in here.
Some systems, very rare systems and I can't name one, will cause traps when you increment past boundaries like that. Further, it allows an implementation that provides boundary protection to exist...again though I can't think of one.
Essentially, you shouldn't be doing it and therefor there's no reason to specify what happens when you do. Specifying what happens puts unwarranted burden on the implementation provider.
This result you are seeing is because of the x86's segment-based memory protection. I find this protection to be justified as when you are incrementing the pointer address and storing, It means at future point of time in your code you will be dereferencing the pointer and using the value. So compiler wants to avoid such kind of situations where you will end up changing some other's memory location or deleting the memory which is being owned by some other guy in your code. To avoid such scenario's compiler has put the restriction.
In addition to hardware issues, another factor was the emergence of implementations which attempted to trap on various kinds of programming errors. Although many such implementations could be most useful if configured to trap on constructs which a program is known not to use, even though they are defined by the C Standard, the authors of the Standard did not want to define the behavior of constructs which would--in many programming fields--be symptomatic of errors.
In many cases, it will be much easier to trap on actions which use pointer arithmetic to compute address of unintended objects than to somehow record the fact that the pointers cannot be used to access the storage they identify, but could be modified so that they could access other storage. Except in the case of arrays within larger (two-dimensional) arrays, an implementation would be allowed to reserve space that's "just past" the end of every object. Given something like doSomethingWithItem(someArray+i);, an implementation could trap any attempt to pass any address which doesn't point to either an element of the array or the space just past the last element. If the allocation of someArray reserved space for an extra unused element, and doSomethingWithItem() only accesses the item to which it receives a pointer, the implementation could relatively inexpensively ensure that any non-trapped execution of the above code could--at worst--access otherwise-unused storage.
The ability to compute "just-past" addresses makes bounds checking more difficult than it otherwise would be (the most common erroneous situation about would be passing doSomethingWithItem() a pointer just past the end of the array, but behavior would be defined unless doSomethingWithItem would try to dereference that pointer--something the caller may be unable to prove). Because the Standard would allow compilers to reserve space just past the array in most cases, however, such allowance would allow implementations to limit the damage caused by untrapped errors--something that would likely not be practical if more generalized pointer arithmetic were allowed.