Why in C++ recursion the addresses of stack variables grow backwards? [duplicate] - c++

I am preparing some training materials in C and I want my examples to fit the typical stack model.
What direction does a C stack grow in Linux, Windows, Mac OSX (PPC and x86), Solaris, and most recent Unixes?

Stack growth doesn't usually depend on the operating system itself, but on the processor it's running on. Solaris, for example, runs on x86 and SPARC. Mac OSX (as you mentioned) runs on PPC and x86. Linux runs on everything from my big honkin' System z at work to a puny little wristwatch.
If the CPU provides any kind of choice, the ABI / calling convention used by the OS specifies which choice you need to make if you want your code to call everyone else's code.
The processors and their direction are:
x86: down.
SPARC: selectable. The standard ABI uses down.
PPC: down, I think.
System z: in a linked list, I kid you not (but still down, at least for zLinux).
ARM: selectable, but Thumb2 has compact encodings only for down (LDMIA = increment after, STMDB = decrement before).
6502: down (but only 256 bytes).
RCA 1802A: any way you want, subject to SCRT implementation.
PDP11: down.
8051: up.
Showing my age on those last few, the 1802 was the chip used to control the early shuttles (sensing if the doors were open, I suspect, based on the processing power it had :-) and my second computer, the COMX-35 (following my ZX80).
PDP11 details gleaned from here, 8051 details from here.
The SPARC architecture uses a sliding window register model. The architecturally visible details also include a circular buffer of register-windows that are valid and cached internally, with traps when that over/underflows. See here for details. As the SPARCv8 manual explains, SAVE and RESTORE instructions are like ADD instructions plus register-window rotation. Using a positive constant instead of the usual negative would give an upward-growing stack.
The afore-mentioned SCRT technique is another - the 1802 used some or it's sixteen 16-bit registers for SCRT (standard call and return technique). One was the program counter, you could use any register as the PC with the SEP Rn instruction. One was the stack pointer and two were set always to point to the SCRT code address, one for call, one for return. No register was treated in a special way. Keep in mind these details are from memory, they may not be totally correct.
For example, if R3 was the PC, R4 was the SCRT call address, R5 was the SCRT return address and R2 was the "stack" (quotes as it's implemented in software), SEP R4 would set R4 to be the PC and start running the SCRT call code.
It would then store R3 on the R2 "stack" (I think R6 was used for temp storage), adjusting it up or down, grab the two bytes following R3, load them into R3, then do SEP R3 and be running at the new address.
To return, it would SEP R5 which would pull the old address off the R2 stack, add two to it (to skip the address bytes of the call), load it into R3 and SEP R3 to start running the previous code.
Very hard to wrap your head around initially after all the 6502/6809/z80 stack-based code but still elegant in a bang-your-head-against-the-wall sort of way. Also one of the big selling features of the chip was a full suite of 16 16-bit registers, despite the fact you immediately lost 7 of those (5 for SCRT, two for DMA and interrupts from memory). Ahh, the triumph of marketing over reality :-)
System z is actually quite similar, using its R14 and R15 registers for call/return.

In C++ (adaptable to C) stack.cc:
static int
find_stack_direction ()
{
static char *addr = 0;
auto char dummy;
if (addr == 0)
{
addr = &dummy;
return find_stack_direction ();
}
else
{
return ((&dummy > addr) ? 1 : -1);
}
}

The advantage of growing down is in older systems the stack was typically at the top of memory. Programs typically filled memory starting from the bottom thus this sort of memory management minimized the need to measure and place the bottom of the stack somewhere sensible.

Just a small addition to the other answers, which as far as I can see have not touched this point:
Having the stack grow downwards makes all addresses within the stack have a positive offset relative to the stack pointer. There's no need for negative offsets, as they would only point to unused stack space. This simplifies accessing stack locations when the processor supports stackpointer-relative addressing.
Many processors have instructions that allow accesses with a positive-only offset relative to some register. Those include many modern architectures, as well as some old ones. For example, the ARM Thumb ABI provides for stackpointer-relative accesses with a positive offset encoded within a single 16-bit instruction word.
If the stack grew upwards, all useful offsets relative to the stackpointer would be negative, which is less intuitive and less convenient. It also is at odds with other applications of register-relative addressing, for example for accessing fields of a struct.

Stack grows down on x86 (defined by the architecture, pop increments stack pointer, push decrements.)

In MIPS and many modern RISC architectures (like PowerPC, RISC-V, SPARC...) there are no push and pop instructions. Those operations are explicitly done by manually adjusting the stack pointer then load/store the value relatively to the adjusted pointer. All registers (except the zero register) are general purpose so in theory any register can be a stack pointer, and the stack can grow in any direction the programmer wants
That said, the stack typically grows down on most architectures, probably to avoid the case when the stack and program data or heap data grows up and clash to each other. There's also the great addressing reasons mentioned sh-'s answer. Some examples: MIPS ABIs grows downwards and use $29 (A.K.A $sp) as the stack pointer, RISC-V ABI also grows downwards and use x2 as the stack pointer
In Intel 8051 the stack grows up, probably because the memory space is so tiny (128 bytes in original version) that there's no heap and you don't need to put the stack on top so that it'll be separated from the heap growing from bottom
You can find more information about the stack usage in various architectures in https://en.wikipedia.org/wiki/Calling_convention
See also
Why does the stack grow downward?
What are the advantages to having the stack grow downward?
Why do stacks typically grow downwards?
Does stack grow upward or downward?

On most systems, stack grows down, and my article at https://gist.github.com/cpq/8598782 explains WHY it grows down. It is simple: how to layout two growing memory blocks (heap and stack) in a fixed chunk of memory? The best solution is to put them on the opposite ends and let grow towards each other.

It grows down because the memory allocated to the program has the "permanent data" i.e. code for the program itself at the bottom, then the heap in the middle. You need another fixed point from which to reference the stack, so that leaves you the top. This means the stack grows down, until it is potentially adjacent to objects on the heap.

This macro should detect it at runtime without UB:
#define stk_grows_up_eh() stk_grows_up__(&(char){0})
_Bool stk_grows_up__(char *ParentsLocal);
__attribute((__noinline__))
_Bool stk_grows_up__(char *ParentsLocal) {
return (uintptr_t)ParentsLocal < (uintptr_t)&ParentsLocal;
}

Related

Why stack get backwards? [duplicate]

I know that in the architectures I'm personally familiar with (x86, 6502, etc), the stack typically grows downwards (i.e. every item pushed onto the stack results in a decremented SP, not an incremented one).
I'm wondering about the historical rationale for this. I know that in a unified address space, it's convenient to start the stack on the opposite end of the data segment (say) so there's only a problem if the two sides collide in the middle. But why does the stack traditionally get the top part? Especially given how this is the opposite of the "conceptual" model?
(And note that in the 6502 architecture, the stack also grows downwards, even though it is bounded to a single 256-byte page, and this direction choice seems arbitrary.)
As to the historic rationale, I can't say for certain (because I didn't design them). My thoughts on the matter are that early CPUs got their original program counter set to 0 and it was a natural desire to start the stack at the other end and grow downwards, since their code naturally grows upward.
As an aside, note that this setting of the program counter to 0 on reset is not the case for all early CPUs. For example, the Motorola 6809 would fetch the program counter from addresses 0xfffe/f so you could start running at an arbitrary location, depending on what was supplied at that address (usually, but by no means limited to, ROM).
One of the first things some historical systems would do would be to scan memory from the top until it found a location that would read back the same value written, so that it would know the actual RAM installed (e.g., a z80 with 64K address space didn't necessarily have 64K or RAM, in fact 64K would have been massive in my early days). Once it found the top actual address, it would set the stack pointer appropriately and could then start calling subroutines. This scanning would generally be done by the CPU running code in ROM as part of start-up.
With regard to the stacks growth, not all of them grow downwards, see this answer for details.
One good explanation I heard was that some machines in the past could only have unsigned offsets, so you'd want to the stack to grow downward so you could hit your locals without having to lose the extra instruction to fake a negative offset.
Stanley Mazor (4004 and 8080 architect) explains how stack growth direction was chosen for 8080 (and eventually for 8086) in "Intel Microprocessors: 8008 to 8086":
The stack pointer was chosen to run "downhill" (with the stack advancing toward lower memory) to simplify indexing into the stack from the user's program (positive indexing) and to simplify displaying the contents of the stack from a front panel.
One possible reason might be that it simplifies alignment. If you place a local variable on the stack which must be placed on a 4-byte boundary, you can simply subtract the size of the object from the stack pointer, and then zero out the two lower bits to get a properly aligned address. If the stack grows upwards, ensuring alignment becomes a bit trickier.
IIRC the stack grows downwards because the heap grows upwards. It could have been the other way around.
I believe it's purely a design decision. Not all of them grow downward -- see this SO thread for some good discussion on the direction of stack growth on different architectures.
I'm not certain but I did some programming for the VAX/VMS back in the days. I seem to remember one part of memory (the heap??) going up and the stack going down. When the two met, then you were out of memory.
I believe the convention began with the IBM 704 and its infamous "decrement register". Modern speech would call it an offset field of the instruction, but the point is they went down, not up.
Just 2c more:
Beyond all the historic rationale mentioned, I'm quite certain there's no reason which is valid in modern processors. All processors can take signed offsets, and maximizing the heap/stack distance is rather moot ever since we started dealing with multiple threads.
I personally consider this a security design flaw. If, say, the designers of the x64 architecture would have reversed the stack growth direction, most stack buffer overflows would have been eliminated - which is kind of a big deal. (since strings grow upward).
Because then a POP uses the same addressing mode that is commonly used to scan through strings and arrays
An instruction that pops a value off of a stack needs to do two things: read the value out of memory, and adjust the stack pointer. There are four possible design choices for this operation:
Preincrement the stack pointer first, then read the value. This implies that the stack will grow "downwards" (towards lower memory addresses).
Predecrement the stack pointer first, then read the value. This implies that the stack will grow "upwards" (towards higher memory addresses).
Read the value first, then postincrement the stack pointer. This implies that the stack will grow downwards.
Read the value first, then postdecrement the stack pointer. This implies that the stack will grow upwards.
In many computer languages (particularly C), strings and arrays are passed to functions as pointers to their first element. A very common operation is to read the elements of the string or array in order, starting with the first element. Such an operation needs only the postincrement addressing mode described above.
Furthermore, reading the elements of a string or array is more common than writing the elements. Indeed, there are many standard library functions that perform no writing at all (e.g. strlen(), strchr(), strcmp())!
Therefore, if you have a limited number of addressing modes in your instruction set design, the most useful addressing mode would be a read that postincrements. This results in not only the most useful string and array operations, but also a POP instruction that grows the stack downward.
The second-most-useful addressing mode would then be a post-decrement write, which can be used for the matching PUSH instruction.
Indeed, the PDP-11 had postincrement and predecrement addressing modes, which produced a downward-growing stack. Even the VAX did not have preincrement or postdecrement.
One advantage of descending stack growth in a minimal embedded system is that a single chunk of RAM can be redundantly mapped into both page O and page 1, allowing zero page variables to be assigned starting at 0x000 and the stack growing downwards from 0x1FF, maximizing the amount it would have to grow before overwriting variables.
One of the original design goals of the 6502 was that it could be combined with, for example, a 6530, resulting in a two-chip microcontroller system with 1 KB of program ROM, timer, I/O, and 64 bytes of RAM shared between stack and page zero variables. By comparison, the minimal embedded system of that time based on an 8080 or 6800 would be four or five chips.

How to safely implement "Using Uninitialized Memory For Fun And Profit"?

I would like to build a dense integer set in C++ using the trick described at https://research.swtch.com/sparse . This approach achieves good performance by allowing itself to read uninitialized memory.
How can I implement this data structure without triggering undefined behavior, and without running afoul of tools like Valgrand or ASAN?
Edit: It seems like responders are focusing on the word "uninitialized" and interpreting it in the context of the language standard. That was probably a poor word choice on my part - here "uninitialized" means only that its value is not important to the correct functioning of the algorithm. It's obviously possible to implement this data structure safely (LLVM does it in SparseMultiSet). My question is what is the best and most performant way to do so?
I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.
Ignore the standard, implement it in the usual way
This is the one crazy trick language lawyers hate! Don't freak out yet though - the solutions following this one won't break the rules, so just skip this part if you are of the rules-stickler variety.
The standard makes most uses of uninitialized values undefined and the few loopholes it does allow (e.g., copying one undefined value to another) don't really give you enough rope to actually implement what you want - even in C which is slightly less restrictive (see for example this answer covering C11, which explains that while accessing an indeterminiate value may not directly trigger UB anything that results is also indeterminate and indeed the value may appear to chance from access to access).
So you just implement it anyway, with the knowledge that most or all currently compilers will just compile it to the expected code, and know that your code is not standards compliant.
At least in my test all of gcc, clang and icc didn't take advantage of the illegal access to do anything crazy. Of course, the test is not comprehensive and even if you could construct one, the behavior could change in a new version of the compiler.
You would be safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it nearly impossible (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed.
Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.
If you take this approach, tools like valgrind are fairly likely to report a uninitialized read error.
Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.
More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.
Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.
Use assembly
What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.
Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member function, for example1, looks like:
sparse_array::is_member(unsigned long):
mov rax, QWORD PTR [rdi+16]
mov rdx, QWORD PTR [rax+rsi*8]
xor eax, eax
cmp rdx, QWORD PTR [rdi]
jnb .L1
mov rax, QWORD PTR [rdi+8]
cmp QWORD PTR [rax+rdx*8], rsi
sete al
Rely on calloc magic
If you use calloc2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc may simply call malloc and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk and mmap, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.
As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.
In that case, implementing your sparse set on top of calloc could be just as fast as the nominally uninitialized version, while being safe and standards compliant.
Calloc Caveats
You should of course test to ensure that calloc is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:
calloc(N)
if (can satisfy a request for N bytes from allocated-then-freed memory)
memset those bytes to zero and return them
else
ask the OS for memory, return it directly because it is zeroed
Basically, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free or delete or whatever. The memory in that pool has arbitrary values, and if a calloc can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.
On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set objects, you will generally just be drawing from the internal malloc pools and will pay the zeroing costs. If you have a long-lived sparse_set objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.
The good news is that if you don't want to rely on the calloc behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.
Use Lazy Initialization
For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.
First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse array.
For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.
Now you simply check the corresponding bit in this array before accessing sparse. This adds some small cost to accessing the sparse array, but doesn't change the overall complexity, and the check is still quite fast.
For example, your is_member function now looks like:
bool sparse_set::is_member(size_t i){
bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
return init && sparse[i] < n && dense[sparse[i]] == i;
}
The generated assembly on x86 (gcc) now starts with:
mov rax, QWORD PTR [rdi+24]
mov rdx, rsi
shr rdx, 8
mov rdx, QWORD PTR [rax+rdx*8]
xor eax, eax
bt rdx, rsi
jnc .L2
...
.L2:
ret
That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).
In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member is about the worse case for this approach since some functions (e.g., clear) aren't affected at all, and others (e.g., iterate) can batch up the is_init check and only do it once every INIT_COVERAGE elements (meaning the overhead would be again ~1% for the example values).
Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse array, so you may great reduce misses to DRAM or outer cache levels.
The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.
Really Lazy Solution
It is worth noting that there is a similarity between the calloc and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.
The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.
You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense array at all: just allocate your sparse array with mmap as PROT_NONE, so you fault whenever you read from an un-allocated page in a sparse array. You catch the fault and allocate the page in the sparse array with a sentinel value indicating "not present" for every element.
This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i checks any more.
The downsides are:
Your code is really not portable since you need to implement the fault-handling logic, which is usually platform specific.
You can't control the granule size: it must be at page granularity (or some multiple thereof). If your set is very sparse (say less than 1 out of 4096 elements occupied) and uniformly distributed, you end up paying a high initialization cost since you need to handle a fault and initialize a full page of values for every element.
Misses (i.e., non-insert accesses to set elements that don't exist) either need to allocate a page even if no elements will exist in that range, or will be very slow (incurring a fault) every time.
1 This implementation has no "range checking" - i.e., it doesn't check if i is greater than MAX_ELEM - depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.
2 Really, the only requirement that you use something that calls calloc under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]() just do the allocation followed by a memset. gcc does optimize malloc followed by memset into calloc, but that's not useful if you are trying to avoid the use of C routines anyway!
3 Precisely, you need 1 extra bit to track every 128 bits of the sparse array.
If we reword your question:
What code reads from uninitialized memory without tripping tools designed to catch reads from uninitialized memory?
Then the answer becomes clear -- it is not possible. Any way of doing this that you could find represents a bug in Valgrind that would be fixed.
Maybe it's possible to get the same performance without UB, but the restrictions you put on your question "I would like to... use the trick... allowing itself to read uninitialized memory" guarantee UB. Any competing method avoiding UB will not be using the trick that you so love.
Valgrind does not complain if you just read uninitialised memory.
Valgrind will complain if you use this data in a way that influences
the visible behaviour of the program, e.g. using this data as input in a syscall, or doing a jump based on this data, or using this data to commpute another address. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.uninitvals for more info.
So, it might very well be that you will have no problem with Valgrind.
If valgrind still complain but your algorithm is correct even when using this uninit data, then you can use client requests to declare this memory as initialised.

How to offload memory offset calculation from runtime in C/C++?

I am implementing a simple VM, and currently I am using runtime arithmetic to calculate individual program object addresses as offsets from base pointers.
I asked a couple of questions on the subject today, but I seem to be going slowly nowhere.
I learned a couple of things thou, from question one -
Object and struct member access and address offset calculation -
I learned that modern processors have virtual addressing capabilities, allowing to calculate memory offsets without any additional cycles devoted to arithmetic.
And from question two - Are address offsets resolved during compile time in C/C++? - I learned that there is no guarantee for this happening when doing the offsets manually.
By now it should be clear that what I want to achieve is to take advantage of the virtual memory addressing features of the hardware and offload those from the runtime.
I am using GCC, as for platform - I am developing on x86 in windows, but since it is a VM I'd like to have it efficiently running on all platforms supported by GCC.
So ANY information on the subject is welcome and will be very appreciated.
Thanks in advance!
EDIT: Some overview on my program code generation - during the design stage the program is build as a tree hierarchy, which is then recursively serialized into one continuous memory block, along with indexing objects and calculating their offset from the beginning of the program memory block.
EDIT 2: Here is some pseudo code of the VM:
switch *instruction
case 1: call_fn1(*(instruction+1)); instruction += (1+sizeof(parameter1)); break;
case 2: call_fn2(*(instruction+1), *(instruction+1+sizeof(parameter1));
instruction += (1+sizeof(parameter1)+sizeof(parameter2); break;
case 3: instruction += *(instruction+1); break;
Case 1 is a function that takes one parameter, which is found immediately after the instruction, so it is passed as an offset of 1 byte from the instruction. The instruction pointer is incremented by 1 + the size of the first parameter to find the next instruction.
Case 2 is a function that takes two parameters, same as before, first parameter passed as 1 byte offset, second parameter passed as offset of 1 byte plus the size of the first parameter. The instruction pointer is then incremented by the size of the instruction plus sizes of both parameters.
Case 3 is a goto statement, the instruction pointer is incremented by an offset which immediately follows the goto instruction.
EDIT 3: To my understanding, the OS will provide each process with its own dedicated virtual memory addressing space. If so, does this mean the first address is always ... well zero, so the offset from the first byte of the memory block is actually the very address of this element? If memory address is dedicated to every process, and I know the offset of my program memory block AND the offset of every program object from the first byte of the memory block, then are the object addresses resolved during compile time?
Problem is those offsets are not available during the compilation of the C code, they become known during the "compilation" phase and translation to bytecode. Does this mean there is no way to do object memory address calculation for "free"?
How is this done in Java for example, where only the virtual machine is compiled to machine code, does this mean the calculation of object addresses takes a performance penalty because of runtime arithmetics?
Here's an attempt to shed some light on how the linked questions and answers apply to this situation.
The answer to the first question mixes two different things, the first is the addressing modes in X86 instruction and the second is virtual-to-physical address mapping. The first is something that is done by compilers and the second is something that is (typically) set up by the operating system. In your case you should only be worrying about the first.
Instructions in X86 assembly have great flexibility in how they access a memory address. Instructions that read or write memory have the address calculated according to the following formula:
segment + base + index * size + offset
The segment portion of the address is almost always the default DS segment and can usually be ignored. The base portion is given by one of the general purpose registers or the stack pointer. The index part is given by one of the general purpose registers and the size is either 1, 2, 4, or 8. Finally the offset is a constant value embedded in the instruction. Each of these components is optional, but obviously at least one must be given.
This addressing capability is what is generally meant when talking about computing addresses without explicit arithmetic instructions. There is a special instruction that one of the commenters mentioned: LEA which does the address calculation but instead of reading or writing memory, stores the computed address in a register.
For the code you included in the question, it is quite plausible that the compiler would use these addressing modes to avoid explicit arithmetic instructions.
As an example, the current value of the instruction variable could be held in the ESI register. Additionally, each of sizeof(parameter1) and sizeof(parameter2) are compile time constants. In the standard X86 calling conventions function arguments are pushed in reverse order (so the first argument is at the top of the stack) so the assembly codes might look something like
case1:
PUSH [ESI+1]
CALL fn1
ADD ESP,4 ; drop arguments from stack
ADD ESI,5
JMP end_switch
case2:
PUSH [ESI+5]
PUSH [ESI+1]
CALL fn2
ADD ESP,8 ; drop arguments from stack
ADD ESI,9
JMP end_swtich
case3:
MOV ESI,[ESI+1]
JMP end_switch
end_switch:
this is assuming that the size of both parameters is 4 bytes. Of course the actual code is up to the compiler and it is reasonable to expect that the compiler will output fairly efficient code as long as you ask for some level optimization.
You have a data item X in the VM, at relative address A, and an instruction that says (for instance) push X, is that right? And you want to be able to execute this instruction without having to add A to the base address of the VM's data area.
I have written a VM that solves this problem by mapping the VM's data area to a fixed Virtual Address. The compiler knows this Virtual Address, and so can adjust A at compile time. Would this solution work for you? Can you change the compiler yourself?
My VM runs on a smart card, and I have complete control over the OS, so it's a very different environment from yours. But Windows does have some facilities for allocating memory at a fixed address -- the VirtualAlloc function, for instance. You might like to try this out. If you do try it out, you might find that Windows is allocating regions that clash with your fixed-address data area, so you will probably have to load by hand any DLLs that you use, after you have allocated the VM's data area.
But there will probably be unforeseen problems to overcome, and it might not be worth the trouble.
Playing with virtual address translation, page tables or TLBs is something that can only be done at the OS kernel level, and is unportable between platforms and processor families. Furthermore hardware address translation on most CPU ISAs is usually supported only at the level of certain page sizes.
To answer my own question, based on the many responses I got.
Turns out what I want to achieve is not really possible in my situation, getting memory address calculations for free is attainable only when specific requirements are met and requires compilation to machine specific instructions.
I am developing a visual element, lego style drag and drop programming environment for educational purposes, which relies on a simple VM to execute the program code. I was hoping to maximize performance, but it is just not possible in my scenario. It is not that big of a deal thou, because program elements can also generate their C code equivalent, which can then be compiled conventionally to maximize performance.
Thanks to everyone who responded and clarified a matter that wasn't really clear to me!

what's on the stack when a function is called?

I can only imagine
1) parameters;
2) local variables;
what else?
1) function return address?
2) function name?
It really depends on platform and architecture, but typically:
Function return address
Saved values of caller's CPU registers - most importantly, caller's stack frame pointer value
Variables allocated with alloca().
Sometimes - extra stuff for exception handling, this is VERY platform-dependent.
Sometimes - guard values to detect stack clobbering
Function name is never in the stack, to the best of my knowledge, unless your code places it there.
I think that a picture really is a thousand words.
It depends on the calling convention; for Unix, you typically look up this information in the SYSV ABI (Application Binary Interface).
You may find:
Return address (if the machine is a popular Intel architecture). On more modern architectures, the return address is passed in a register.
Callee-saves registers—these are registers that "belong" to the caller which the callee has chosen to borrow and must therefore save and restore.
Any incoming parameters that could not be passed in registers. In IA-32, no parameters are passed in registers; they all go on the stack. In x86-64, up to six integer and six floating-point parameters can be passed in registers, so it is seldom necessary to use the stack for that purpose.
You may or may not find a saved stack pointer or frame pointer. Most modern calling conventions go without a frame pointer in order to save an extra registers. In this design, the size of each frame is known at compile time, so restoring the old stack pointer is just a matter of adding a constant. But it makes it harder to implement alloca().
The older Intel calling conventions use both stack pointer and frame pointer, which burns an extra register, but it simplifies alloca() and also stack unwinding.
Local variables with storage class auto are allocated on the stack.
The stack may contain compiler temporaries that hold values which are "spilled" if the hardware does not provide enough registers to hold all the intermediate results of computations. (This happens if at any point the number of live intermediate results—the ones that will be needed later in the program—exceeds the number of registers available to the compiler for storing intermediate results.)
You may find variables allocated with alloca().
You may find metadata that says which PC ranges are in scope for which exception handlers, or other very platform-dependent exception stuff.
C and C++ do not support garbage collection, but in a language that does, you will often find metadata that identifies where in the stack frame you will find pointers.
Finally, the stack may contain "padding" used to ensure that the stack pointer is aligned on an 8-byte or 16-byte boundary.
Calling conventions are complex beasts, and stack-frame layout is not for the faint of heart!

What happens when a computer program runs?

I know the general theory but I can't fit in the details.
I know that a program resides in the secondary memory of a computer. Once the program begins execution it is entirely copied to the RAM. Then the processor retrive a few instructions (it depends on the size of the bus) at a time, puts them in registers and executes them.
I also know that a computer program uses two kinds of memory: stack and heap, which are also part of the primary memory of the computer. The stack is used for non-dynamic memory, and the heap for dynamic memory (for example, everything related to the new operator in C++)
What I can't understand is how those two things connect. At what point is the stack used for the execution of the instructions? Instructions go from the RAM, to the stack, to the registers?
It really depends on the system, but modern OSes with virtual memory tend to load their process images and allocate memory something like this:
+---------+
| stack | function-local variables, return addresses, return values, etc.
| | often grows downward, commonly accessed via "push" and "pop" (but can be
| | accessed randomly, as well; disassemble a program to see)
+---------+
| shared | mapped shared libraries (C libraries, math libs, etc.)
| libs |
+---------+
| hole | unused memory allocated between the heap and stack "chunks", spans the
| | difference between your max and min memory, minus the other totals
+---------+
| heap | dynamic, random-access storage, allocated with 'malloc' and the like.
+---------+
| bss | Uninitialized global variables; must be in read-write memory area
+---------+
| data | data segment, for globals and static variables that are initialized
| | (can further be split up into read-only and read-write areas, with
| | read-only areas being stored elsewhere in ROM on some systems)
+---------+
| text | program code, this is the actual executable code that is running.
+---------+
This is the general process address space on many common virtual-memory systems. The "hole" is the size of your total memory, minus the space taken up by all the other areas; this gives a large amount of space for the heap to grow into. This is also "virtual", meaning it maps to your actual memory through a translation table, and may be actually stored at any location in actual memory. It is done this way to protect one process from accessing another process's memory, and to make each process think it's running on a complete system.
Note that the positions of, e.g., the stack and heap may be in a different order on some systems (see Billy O'Neal's answer below for more details on Win32).
Other systems can be very different. DOS, for instance, ran in real mode, and its memory allocation when running programs looked much differently:
+-----------+ top of memory
| extended | above the high memory area, and up to your total memory; needed drivers to
| | be able to access it.
+-----------+ 0x110000
| high | just over 1MB->1MB+64KB, used by 286s and above.
+-----------+ 0x100000
| upper | upper memory area, from 640kb->1MB, had mapped memory for video devices, the
| | DOS "transient" area, etc. some was often free, and could be used for drivers
+-----------+ 0xA0000
| USER PROC | user process address space, from the end of DOS up to 640KB
+-----------+
|command.com| DOS command interpreter
+-----------+
| DOS | DOS permanent area, kept as small as possible, provided routines for display,
| kernel | *basic* hardware access, etc.
+-----------+ 0x600
| BIOS data | BIOS data area, contained simple hardware descriptions, etc.
+-----------+ 0x400
| interrupt | the interrupt vector table, starting from 0 and going to 1k, contained
| vector | the addresses of routines called when interrupts occurred. e.g.
| table | interrupt 0x21 checked the address at 0x21*4 and far-jumped to that
| | location to service the interrupt.
+-----------+ 0x0
You can see that DOS allowed direct access to the operating system memory, with no protection, which meant that user-space programs could generally directly access or overwrite anything they liked.
In the process address space, however, the programs tended to look similar, only they were described as code segment, data segment, heap, stack segment, etc., and it was mapped a little differently. But most of the general areas were still there.
Upon loading the program and necessary shared libs into memory, and distributing the parts of the program into the right areas, the OS begins executing your process wherever its main method is at, and your program takes over from there, making system calls as necessary when it needs them.
Different systems (embedded, whatever) may have very different architectures, such as stackless systems, Harvard architecture systems (with code and data being kept in separate physical memory), systems which actually keep the BSS in read-only memory (initially set by the programmer), etc. But this is the general gist.
You said:
I also know that a computer program uses two kinds of memory: stack and heap, which are also part of the primary memory of the computer.
"Stack" and "heap" are just abstract concepts, rather than (necessarily) physically distinct "kinds" of memory.
A stack is merely a last-in, first-out data structure. In the x86 architecture, it can actually be addressed randomly by using an offset from the end, but the most common functions are PUSH and POP to add and remove items from it, respectively. It is commonly used for function-local variables (so-called "automatic storage"), function arguments, return addresses, etc. (more below)
A "heap" is just a nickname for a chunk of memory that can be allocated on demand, and is addressed randomly (meaning, you can access any location in it directly). It is commonly used for data structures that you allocate at runtime (in C++, using new and delete, and malloc and friends in C, etc).
The stack and heap, on the x86 architecture, both physically reside in your system memory (RAM), and are mapped through virtual memory allocation into the process address space as described above.
The registers (still on x86), physically reside inside the processor (as opposed to RAM), and are loaded by the processor, from the TEXT area (and can also be loaded from elsewhere in memory or other places depending on the CPU instructions that are actually executed). They are essentially just very small, very fast on-chip memory locations that are used for a number of different purposes.
Register layout is highly dependent on the architecture (in fact, registers, the instruction set, and memory layout/design, are exactly what is meant by "architecture"), and so I won't expand upon it, but recommend you take an assembly language course to understand them better.
Your question:
At what point is the stack used for the execution of the instructions? Instructions go from the RAM, to the stack, to the registers?
The stack (in systems/languages that have and use them) is most often used like this:
int mul( int x, int y ) {
return x * y; // this stores the result of MULtiplying the two variables
// from the stack into the return value address previously
// allocated, then issues a RET, which resets the stack frame
// based on the arg list, and returns to the address set by
// the CALLer.
}
int main() {
int x = 2, y = 3; // these variables are stored on the stack
mul( x, y ); // this pushes y onto the stack, then x, then a return address,
// allocates space on the stack for a return value,
// then issues an assembly CALL instruction.
}
Write a simple program like this, and then compile it to assembly (gcc -S foo.c if you have access to GCC), and take a look. The assembly is pretty easy to follow. You can see that the stack is used for function local variables, and for calling functions, storing their arguments and return values. This is also why when you do something like:
f( g( h( i ) ) );
All of these get called in turn. It's literally building up a stack of function calls and their arguments, executing them, and then popping them off as it winds back down (or up ;). However, as mentioned above, the stack (on x86) actually resides in your process memory space (in virtual memory), and so it can be manipulated directly; it's not a separate step during execution (or at least is orthogonal to the process).
FYI, the above is the C calling convention, also used by C++. Other languages/systems may push arguments onto the stack in a different order, and some languages/platforms don't even use stacks, and go about it in different ways.
Also note, these aren't actual lines of C code executing. The compiler has converted them into machine language instructions in your executable. They are then (generally) copied from the TEXT area into the CPU pipeline, then into the CPU registers, and executed from there. [This was incorrect. See Ben Voigt's correction below.]
Sdaz has gotten a remarkable number of upvotes in a very short time, but sadly is perpetuating a misconception about how instructions move through the CPU.
The question asked:
Instructions go from the RAM, to the stack, to the registers?
Sdaz said:
Also note, these aren't actual lines of C code executing. The compiler has converted them into machine language instructions in your executable. They are then (generally) copied from the TEXT area into the CPU pipeline, then into the CPU registers, and executed from there.
But this is wrong. Except for the special case of self-modifying code, instructions never enter the datapath. And they are not, cannot be, executed from the datapath.
The x86 CPU registers are:
General registers
EAX EBX ECX EDX
Segment registers
CS DS ES FS GS SS
Index and pointers
ESI EDI EBP EIP ESP
Indicator
EFLAGS
There are also some floating-point and SIMD registers, but for the purposes of this discussion we'll classify those as part of the coprocessor and not the CPU. The memory-management unit inside the CPU also has some registers of its own, we'll again treat that as a separate processing unit.
None of these registers are used for executable code. EIP contains the address of the executing instruction, not the instruction itself.
Instructions go through a completely different path in the CPU from data (Harvard architecture). All current machines are Harvard architecture inside the CPU. Most these days are also Harvard architecture in the cache. x86 (your common desktop machine) are Von Neumann architecture in the main memory, meaning data and code are intermingled in RAM. That's beside the point, since we're talking about what happens inside the CPU.
The classic sequence taught in computer architecture is fetch-decode-execute. The memory controller looks up the instruction stored at the address EIP. The bits of the instruction go through some combinational logic to create all the control signals for the different multiplexers in the processor. And after some cycles, the arithmetic logic unit arrives at a result, which is clocked into the destination. Then the next instruction is fetched.
On a modern processor, things work a little differently. Each incoming instruction is translated into a whole series of microcode instructions. This enable pipelining, because the resources used by the first microinstruction aren't needed later, so they can begin working on the first microinstruction from the next instruction.
To top it off, terminology is slightly confused because register is an electrical engineering term for a collection of D-flipflops. And instructions (or especially microinstructions) may very well be stored temporarily in such a collection of D-flipflops. But this is not what is meant when a computer scientist or software engineer or run-of-the-mill developer uses the term register. They mean the datapath registers as listed above, and these are not used for transporting code.
The names and number of datapath registers vary for other CPU architectures, such as ARM, MIPS, Alpha, PowerPC, but all of them execute instructions without passing them through the ALU.
The exact layout of the memory while a process is executing is completely dependent on the platform which you're using. Consider the following test program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int stackValue = 0;
int *addressOnStack = &stackValue;
int *addressOnHeap = malloc(sizeof(int));
if (addressOnStack > addressOnHeap)
{
puts("The stack is above the heap.");
}
else
{
puts("The heap is above the stack.");
}
}
On Windows NT (and it's children), this program is going to generally produce:
The heap is above the stack
On POSIX boxes, it's going to say:
The stack is above the heap
The UNIX memory model is quite well explained here by #Sdaz MacSkibbons, so I won't reiterate that here. But that is not the only memory model. The reason POSIX requires this model is the sbrk system call. Basically, on a POSIX box, to get more memory, a process merely tells the Kernel to move the divider between the "hole" and the "heap" further into the "hole" region. There is no way to return memory to the operating system, and the operating system itself does not manage your heap. Your C runtime library has to provide that (via malloc).
This also has implications for the kind of code actually used in POSIX binaries. POSIX boxes (almost universally) use the ELF file format. In this format, the operating system is responsible for communications between libraries in different ELF files. Therefore, all the libraries use position-independent code (That is, the code itself can be loaded into different memory addresses and still operate), and all calls between libraries are passed through a lookup table to find out where control needs to jump for cross library function calls. This adds some overhead and can be exploited if one of the libraries changes the lookup table.
Windows' memory model is different because the kind of code it uses is different. Windows uses the PE file format, which leaves the code in position-dependent format. That is, the code depends on where exactly in virtual memory the code is loaded. There is a flag in the PE spec which tells the OS where exactly in memory the library or executable would like to be mapped when your program runs. If a program or library cannot be loaded at it's preferred address, the Windows loader must rebase the library/executable -- basically, it moves the position-dependent code to point at the new positions -- which doesn't require lookup tables and cannot be exploited because there's no lookup table to overwrite. Unfortunately, this requires very complicated implementation in the Windows loader, and does have considerable startup time overhead if an image needs to be rebased. Large commercial software packages often modify their libraries to start purposely at different addresses to avoid rebasing; windows itself does this with it's own libraries (e.g. ntdll.dll, kernel32.dll, psapi.dll, etc. -- all have different start addresses by default)
On Windows, virtual memory is obtained from the system via a call to VirtualAlloc, and it is returned to the system via VirtualFree (Okay, technically VirtualAlloc farms out to NtAllocateVirtualMemory, but that's an implementation detail) (Contrast this to POSIX, where memory cannot be reclaimed). This process is slow (and IIRC, requires that you allocate in physical page sized chunks; typically 4kb or more). Windows also provides it's own heap functions (HeapAlloc, HeapFree, etc.) as part of a library known as RtlHeap, which is included as a part of Windows itself, upon which the C runtime (that is, malloc and friends) is typically implemented.
Windows also has quite a few legacy memory allocation APIs from the days when it had to deal with old 80386s, and these functions are now built on top of RtlHeap. For more information about the various APIs that control memory management in Windows, see this MSDN article: http://msdn.microsoft.com/en-us/library/ms810627 .
Note also that this means on Windows a single process an (and usually does) have more than one heap. (Typically, each shared library creates it's own heap.)
(Most of this information comes from "Secure Coding in C and C++" by Robert Seacord)
The stack
In X86 architercture the CPU executes operations with registers. The stack is only used for convenience reasons. You can save the content of your registers to stack before calling a subroutine or a system function and then load them back to continue your operation where you left. (You could to it manually without the stack, but it is a frequently used function so it has CPU support). But you can do pretty much anything without the stack in a PC.
For example an integer multiplication:
MUL BX
Multiplies AX register with BX register. (The result will be in DX and AX, DX containing the higher bits).
Stack based machines (like JAVA VM) use the stack for their basic operations. The above multiplication:
DMUL
This pops two values from the top of the stack and multiplies tem, then pushes the result back to the stack. Stack is essential for this kind of machines.
Some higher level programming languages (like C and Pascal) use this later method for passing parameters to functions: the parameters are pushed to the stack in left to right order and popped by the function body and the return values are pushed back. (This is a choice that the compiler manufacturers make and kind of abuses the way the X86 uses the stack).
The heap
The heap is an other concept that exists only in the realm of the compilers. It takes the pain of handling the memory behind your variables away, but it is not a function of the CPU or the OS, it is just a choice of housekeeping the memory block wich is given out by the OS. You could do this manyually if you want.
Accessing system resources
The operating system has a public interface how you can access its functions. In DOS parameters are passed in registers of the CPU. Windows uses the stack for passing parameters for OS functions (the Windows API).