What happens when a computer program runs?

What happens when a computer program runs? - c++

I know the general theory but I can't fit in the details.
I know that a program resides in the secondary memory of a computer. Once the program begins execution it is entirely copied to the RAM. Then the processor retrive a few instructions (it depends on the size of the bus) at a time, puts them in registers and executes them.
I also know that a computer program uses two kinds of memory: stack and heap, which are also part of the primary memory of the computer. The stack is used for non-dynamic memory, and the heap for dynamic memory (for example, everything related to the new operator in C++)
What I can't understand is how those two things connect. At what point is the stack used for the execution of the instructions? Instructions go from the RAM, to the stack, to the registers?

It really depends on the system, but modern OSes with virtual memory tend to load their process images and allocate memory something like this:
+---------+
| stack | function-local variables, return addresses, return values, etc.
| | often grows downward, commonly accessed via "push" and "pop" (but can be
| | accessed randomly, as well; disassemble a program to see)
+---------+
| shared | mapped shared libraries (C libraries, math libs, etc.)
| libs |
+---------+
| hole | unused memory allocated between the heap and stack "chunks", spans the
| | difference between your max and min memory, minus the other totals
+---------+
| heap | dynamic, random-access storage, allocated with 'malloc' and the like.
+---------+
| bss | Uninitialized global variables; must be in read-write memory area
+---------+
| data | data segment, for globals and static variables that are initialized
| | (can further be split up into read-only and read-write areas, with
| | read-only areas being stored elsewhere in ROM on some systems)
+---------+
| text | program code, this is the actual executable code that is running.
+---------+
This is the general process address space on many common virtual-memory systems. The "hole" is the size of your total memory, minus the space taken up by all the other areas; this gives a large amount of space for the heap to grow into. This is also "virtual", meaning it maps to your actual memory through a translation table, and may be actually stored at any location in actual memory. It is done this way to protect one process from accessing another process's memory, and to make each process think it's running on a complete system.
Note that the positions of, e.g., the stack and heap may be in a different order on some systems (see Billy O'Neal's answer below for more details on Win32).
Other systems can be very different. DOS, for instance, ran in real mode, and its memory allocation when running programs looked much differently:
+-----------+ top of memory
| extended | above the high memory area, and up to your total memory; needed drivers to
| | be able to access it.
+-----------+ 0x110000
| high | just over 1MB->1MB+64KB, used by 286s and above.
+-----------+ 0x100000
| upper | upper memory area, from 640kb->1MB, had mapped memory for video devices, the
| | DOS "transient" area, etc. some was often free, and could be used for drivers
+-----------+ 0xA0000
| USER PROC | user process address space, from the end of DOS up to 640KB
+-----------+
|command.com| DOS command interpreter
+-----------+
| DOS | DOS permanent area, kept as small as possible, provided routines for display,
| kernel | *basic* hardware access, etc.
+-----------+ 0x600
| BIOS data | BIOS data area, contained simple hardware descriptions, etc.
+-----------+ 0x400
| interrupt | the interrupt vector table, starting from 0 and going to 1k, contained
| vector | the addresses of routines called when interrupts occurred. e.g.
| table | interrupt 0x21 checked the address at 0x21*4 and far-jumped to that
| | location to service the interrupt.
+-----------+ 0x0
You can see that DOS allowed direct access to the operating system memory, with no protection, which meant that user-space programs could generally directly access or overwrite anything they liked.
In the process address space, however, the programs tended to look similar, only they were described as code segment, data segment, heap, stack segment, etc., and it was mapped a little differently. But most of the general areas were still there.
Upon loading the program and necessary shared libs into memory, and distributing the parts of the program into the right areas, the OS begins executing your process wherever its main method is at, and your program takes over from there, making system calls as necessary when it needs them.
Different systems (embedded, whatever) may have very different architectures, such as stackless systems, Harvard architecture systems (with code and data being kept in separate physical memory), systems which actually keep the BSS in read-only memory (initially set by the programmer), etc. But this is the general gist.
You said:
I also know that a computer program uses two kinds of memory: stack and heap, which are also part of the primary memory of the computer.
"Stack" and "heap" are just abstract concepts, rather than (necessarily) physically distinct "kinds" of memory.
A stack is merely a last-in, first-out data structure. In the x86 architecture, it can actually be addressed randomly by using an offset from the end, but the most common functions are PUSH and POP to add and remove items from it, respectively. It is commonly used for function-local variables (so-called "automatic storage"), function arguments, return addresses, etc. (more below)
A "heap" is just a nickname for a chunk of memory that can be allocated on demand, and is addressed randomly (meaning, you can access any location in it directly). It is commonly used for data structures that you allocate at runtime (in C++, using new and delete, and malloc and friends in C, etc).
The stack and heap, on the x86 architecture, both physically reside in your system memory (RAM), and are mapped through virtual memory allocation into the process address space as described above.
The registers (still on x86), physically reside inside the processor (as opposed to RAM), and are loaded by the processor, from the TEXT area (and can also be loaded from elsewhere in memory or other places depending on the CPU instructions that are actually executed). They are essentially just very small, very fast on-chip memory locations that are used for a number of different purposes.
Register layout is highly dependent on the architecture (in fact, registers, the instruction set, and memory layout/design, are exactly what is meant by "architecture"), and so I won't expand upon it, but recommend you take an assembly language course to understand them better.
Your question:
At what point is the stack used for the execution of the instructions? Instructions go from the RAM, to the stack, to the registers?
The stack (in systems/languages that have and use them) is most often used like this:
int mul( int x, int y ) {
return x * y; // this stores the result of MULtiplying the two variables
// from the stack into the return value address previously
// allocated, then issues a RET, which resets the stack frame
// based on the arg list, and returns to the address set by
// the CALLer.
}
int main() {
int x = 2, y = 3; // these variables are stored on the stack
mul( x, y ); // this pushes y onto the stack, then x, then a return address,
// allocates space on the stack for a return value,
// then issues an assembly CALL instruction.
}
Write a simple program like this, and then compile it to assembly (gcc -S foo.c if you have access to GCC), and take a look. The assembly is pretty easy to follow. You can see that the stack is used for function local variables, and for calling functions, storing their arguments and return values. This is also why when you do something like:
f( g( h( i ) ) );
All of these get called in turn. It's literally building up a stack of function calls and their arguments, executing them, and then popping them off as it winds back down (or up ;). However, as mentioned above, the stack (on x86) actually resides in your process memory space (in virtual memory), and so it can be manipulated directly; it's not a separate step during execution (or at least is orthogonal to the process).
FYI, the above is the C calling convention, also used by C++. Other languages/systems may push arguments onto the stack in a different order, and some languages/platforms don't even use stacks, and go about it in different ways.
Also note, these aren't actual lines of C code executing. The compiler has converted them into machine language instructions in your executable. They are then (generally) copied from the TEXT area into the CPU pipeline, then into the CPU registers, and executed from there. [This was incorrect. See Ben Voigt's correction below.]

Sdaz has gotten a remarkable number of upvotes in a very short time, but sadly is perpetuating a misconception about how instructions move through the CPU.
The question asked:
Instructions go from the RAM, to the stack, to the registers?
Sdaz said:
Also note, these aren't actual lines of C code executing. The compiler has converted them into machine language instructions in your executable. They are then (generally) copied from the TEXT area into the CPU pipeline, then into the CPU registers, and executed from there.
But this is wrong. Except for the special case of self-modifying code, instructions never enter the datapath. And they are not, cannot be, executed from the datapath.
The x86 CPU registers are:
General registers
EAX EBX ECX EDX
Segment registers
CS DS ES FS GS SS
Index and pointers
ESI EDI EBP EIP ESP
Indicator
EFLAGS
There are also some floating-point and SIMD registers, but for the purposes of this discussion we'll classify those as part of the coprocessor and not the CPU. The memory-management unit inside the CPU also has some registers of its own, we'll again treat that as a separate processing unit.
None of these registers are used for executable code. EIP contains the address of the executing instruction, not the instruction itself.
Instructions go through a completely different path in the CPU from data (Harvard architecture). All current machines are Harvard architecture inside the CPU. Most these days are also Harvard architecture in the cache. x86 (your common desktop machine) are Von Neumann architecture in the main memory, meaning data and code are intermingled in RAM. That's beside the point, since we're talking about what happens inside the CPU.
The classic sequence taught in computer architecture is fetch-decode-execute. The memory controller looks up the instruction stored at the address EIP. The bits of the instruction go through some combinational logic to create all the control signals for the different multiplexers in the processor. And after some cycles, the arithmetic logic unit arrives at a result, which is clocked into the destination. Then the next instruction is fetched.
On a modern processor, things work a little differently. Each incoming instruction is translated into a whole series of microcode instructions. This enable pipelining, because the resources used by the first microinstruction aren't needed later, so they can begin working on the first microinstruction from the next instruction.
To top it off, terminology is slightly confused because register is an electrical engineering term for a collection of D-flipflops. And instructions (or especially microinstructions) may very well be stored temporarily in such a collection of D-flipflops. But this is not what is meant when a computer scientist or software engineer or run-of-the-mill developer uses the term register. They mean the datapath registers as listed above, and these are not used for transporting code.
The names and number of datapath registers vary for other CPU architectures, such as ARM, MIPS, Alpha, PowerPC, but all of them execute instructions without passing them through the ALU.

The exact layout of the memory while a process is executing is completely dependent on the platform which you're using. Consider the following test program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int stackValue = 0;
int *addressOnStack = &stackValue;
int *addressOnHeap = malloc(sizeof(int));
if (addressOnStack > addressOnHeap)
{
puts("The stack is above the heap.");
}
else
{
puts("The heap is above the stack.");
}
}
On Windows NT (and it's children), this program is going to generally produce:
The heap is above the stack
On POSIX boxes, it's going to say:
The stack is above the heap
The UNIX memory model is quite well explained here by #Sdaz MacSkibbons, so I won't reiterate that here. But that is not the only memory model. The reason POSIX requires this model is the sbrk system call. Basically, on a POSIX box, to get more memory, a process merely tells the Kernel to move the divider between the "hole" and the "heap" further into the "hole" region. There is no way to return memory to the operating system, and the operating system itself does not manage your heap. Your C runtime library has to provide that (via malloc).
This also has implications for the kind of code actually used in POSIX binaries. POSIX boxes (almost universally) use the ELF file format. In this format, the operating system is responsible for communications between libraries in different ELF files. Therefore, all the libraries use position-independent code (That is, the code itself can be loaded into different memory addresses and still operate), and all calls between libraries are passed through a lookup table to find out where control needs to jump for cross library function calls. This adds some overhead and can be exploited if one of the libraries changes the lookup table.
Windows' memory model is different because the kind of code it uses is different. Windows uses the PE file format, which leaves the code in position-dependent format. That is, the code depends on where exactly in virtual memory the code is loaded. There is a flag in the PE spec which tells the OS where exactly in memory the library or executable would like to be mapped when your program runs. If a program or library cannot be loaded at it's preferred address, the Windows loader must rebase the library/executable -- basically, it moves the position-dependent code to point at the new positions -- which doesn't require lookup tables and cannot be exploited because there's no lookup table to overwrite. Unfortunately, this requires very complicated implementation in the Windows loader, and does have considerable startup time overhead if an image needs to be rebased. Large commercial software packages often modify their libraries to start purposely at different addresses to avoid rebasing; windows itself does this with it's own libraries (e.g. ntdll.dll, kernel32.dll, psapi.dll, etc. -- all have different start addresses by default)
On Windows, virtual memory is obtained from the system via a call to VirtualAlloc, and it is returned to the system via VirtualFree (Okay, technically VirtualAlloc farms out to NtAllocateVirtualMemory, but that's an implementation detail) (Contrast this to POSIX, where memory cannot be reclaimed). This process is slow (and IIRC, requires that you allocate in physical page sized chunks; typically 4kb or more). Windows also provides it's own heap functions (HeapAlloc, HeapFree, etc.) as part of a library known as RtlHeap, which is included as a part of Windows itself, upon which the C runtime (that is, malloc and friends) is typically implemented.
Windows also has quite a few legacy memory allocation APIs from the days when it had to deal with old 80386s, and these functions are now built on top of RtlHeap. For more information about the various APIs that control memory management in Windows, see this MSDN article: http://msdn.microsoft.com/en-us/library/ms810627 .
Note also that this means on Windows a single process an (and usually does) have more than one heap. (Typically, each shared library creates it's own heap.)
(Most of this information comes from "Secure Coding in C and C++" by Robert Seacord)

The stack
In X86 architercture the CPU executes operations with registers. The stack is only used for convenience reasons. You can save the content of your registers to stack before calling a subroutine or a system function and then load them back to continue your operation where you left. (You could to it manually without the stack, but it is a frequently used function so it has CPU support). But you can do pretty much anything without the stack in a PC.
For example an integer multiplication:
MUL BX
Multiplies AX register with BX register. (The result will be in DX and AX, DX containing the higher bits).
Stack based machines (like JAVA VM) use the stack for their basic operations. The above multiplication:
DMUL
This pops two values from the top of the stack and multiplies tem, then pushes the result back to the stack. Stack is essential for this kind of machines.
Some higher level programming languages (like C and Pascal) use this later method for passing parameters to functions: the parameters are pushed to the stack in left to right order and popped by the function body and the return values are pushed back. (This is a choice that the compiler manufacturers make and kind of abuses the way the X86 uses the stack).
The heap
The heap is an other concept that exists only in the realm of the compilers. It takes the pain of handling the memory behind your variables away, but it is not a function of the CPU or the OS, it is just a choice of housekeeping the memory block wich is given out by the OS. You could do this manyually if you want.
Accessing system resources
The operating system has a public interface how you can access its functions. In DOS parameters are passed in registers of the CPU. Windows uses the stack for passing parameters for OS functions (the Windows API).

Related

Why in C++ recursion the addresses of stack variables grow backwards? [duplicate]

I am preparing some training materials in C and I want my examples to fit the typical stack model.
What direction does a C stack grow in Linux, Windows, Mac OSX (PPC and x86), Solaris, and most recent Unixes?

Stack growth doesn't usually depend on the operating system itself, but on the processor it's running on. Solaris, for example, runs on x86 and SPARC. Mac OSX (as you mentioned) runs on PPC and x86. Linux runs on everything from my big honkin' System z at work to a puny little wristwatch.
If the CPU provides any kind of choice, the ABI / calling convention used by the OS specifies which choice you need to make if you want your code to call everyone else's code.
The processors and their direction are:
x86: down.
SPARC: selectable. The standard ABI uses down.
PPC: down, I think.
System z: in a linked list, I kid you not (but still down, at least for zLinux).
ARM: selectable, but Thumb2 has compact encodings only for down (LDMIA = increment after, STMDB = decrement before).
6502: down (but only 256 bytes).
RCA 1802A: any way you want, subject to SCRT implementation.
PDP11: down.
8051: up.
Showing my age on those last few, the 1802 was the chip used to control the early shuttles (sensing if the doors were open, I suspect, based on the processing power it had :-) and my second computer, the COMX-35 (following my ZX80).
PDP11 details gleaned from here, 8051 details from here.
The SPARC architecture uses a sliding window register model. The architecturally visible details also include a circular buffer of register-windows that are valid and cached internally, with traps when that over/underflows. See here for details. As the SPARCv8 manual explains, SAVE and RESTORE instructions are like ADD instructions plus register-window rotation. Using a positive constant instead of the usual negative would give an upward-growing stack.
The afore-mentioned SCRT technique is another - the 1802 used some or it's sixteen 16-bit registers for SCRT (standard call and return technique). One was the program counter, you could use any register as the PC with the SEP Rn instruction. One was the stack pointer and two were set always to point to the SCRT code address, one for call, one for return. No register was treated in a special way. Keep in mind these details are from memory, they may not be totally correct.
For example, if R3 was the PC, R4 was the SCRT call address, R5 was the SCRT return address and R2 was the "stack" (quotes as it's implemented in software), SEP R4 would set R4 to be the PC and start running the SCRT call code.
It would then store R3 on the R2 "stack" (I think R6 was used for temp storage), adjusting it up or down, grab the two bytes following R3, load them into R3, then do SEP R3 and be running at the new address.
To return, it would SEP R5 which would pull the old address off the R2 stack, add two to it (to skip the address bytes of the call), load it into R3 and SEP R3 to start running the previous code.
Very hard to wrap your head around initially after all the 6502/6809/z80 stack-based code but still elegant in a bang-your-head-against-the-wall sort of way. Also one of the big selling features of the chip was a full suite of 16 16-bit registers, despite the fact you immediately lost 7 of those (5 for SCRT, two for DMA and interrupts from memory). Ahh, the triumph of marketing over reality :-)
System z is actually quite similar, using its R14 and R15 registers for call/return.

In C++ (adaptable to C) stack.cc:
static int
find_stack_direction ()
{
static char *addr = 0;
auto char dummy;
if (addr == 0)
{
addr = &dummy;
return find_stack_direction ();
}
else
{
return ((&dummy > addr) ? 1 : -1);
}
}

The advantage of growing down is in older systems the stack was typically at the top of memory. Programs typically filled memory starting from the bottom thus this sort of memory management minimized the need to measure and place the bottom of the stack somewhere sensible.

Just a small addition to the other answers, which as far as I can see have not touched this point:
Having the stack grow downwards makes all addresses within the stack have a positive offset relative to the stack pointer. There's no need for negative offsets, as they would only point to unused stack space. This simplifies accessing stack locations when the processor supports stackpointer-relative addressing.
Many processors have instructions that allow accesses with a positive-only offset relative to some register. Those include many modern architectures, as well as some old ones. For example, the ARM Thumb ABI provides for stackpointer-relative accesses with a positive offset encoded within a single 16-bit instruction word.
If the stack grew upwards, all useful offsets relative to the stackpointer would be negative, which is less intuitive and less convenient. It also is at odds with other applications of register-relative addressing, for example for accessing fields of a struct.

Stack grows down on x86 (defined by the architecture, pop increments stack pointer, push decrements.)

In MIPS and many modern RISC architectures (like PowerPC, RISC-V, SPARC...) there are no push and pop instructions. Those operations are explicitly done by manually adjusting the stack pointer then load/store the value relatively to the adjusted pointer. All registers (except the zero register) are general purpose so in theory any register can be a stack pointer, and the stack can grow in any direction the programmer wants
That said, the stack typically grows down on most architectures, probably to avoid the case when the stack and program data or heap data grows up and clash to each other. There's also the great addressing reasons mentioned sh-'s answer. Some examples: MIPS ABIs grows downwards and use $29 (A.K.A $sp) as the stack pointer, RISC-V ABI also grows downwards and use x2 as the stack pointer
In Intel 8051 the stack grows up, probably because the memory space is so tiny (128 bytes in original version) that there's no heap and you don't need to put the stack on top so that it'll be separated from the heap growing from bottom
You can find more information about the stack usage in various architectures in https://en.wikipedia.org/wiki/Calling_convention
See also
Why does the stack grow downward?
What are the advantages to having the stack grow downward?
Why do stacks typically grow downwards?
Does stack grow upward or downward?

On most systems, stack grows down, and my article at https://gist.github.com/cpq/8598782 explains WHY it grows down. It is simple: how to layout two growing memory blocks (heap and stack) in a fixed chunk of memory? The best solution is to put them on the opposite ends and let grow towards each other.

It grows down because the memory allocated to the program has the "permanent data" i.e. code for the program itself at the bottom, then the heap in the middle. You need another fixed point from which to reference the stack, so that leaves you the top. This means the stack grows down, until it is potentially adjacent to objects on the heap.

This macro should detect it at runtime without UB:
#define stk_grows_up_eh() stk_grows_up__(&(char){0})
_Bool stk_grows_up__(char *ParentsLocal);
__attribute((__noinline__))
_Bool stk_grows_up__(char *ParentsLocal) {
return (uintptr_t)ParentsLocal < (uintptr_t)&ParentsLocal;
}

Where is a ordinary variable defined inside a device function placed?

In CUDA, I understand that the variable would be placed in shared memory if it was defined as __ shared __ and one would be placed in constant memory if it was defined as __ constant __.Also, those being allocated memory using cudamalloc() are put in GPU global memory. But
where are those variable without prefixs like __ shared __ , __ constant __ and register placed? For example, the variable i as follow:
__device__ void func(){
int i=0;
return;
}

Automatic variables, i.e. variables without memory space specification within the scope of functions, are placed in one of the following locations:
When optimized away:
1.1 Nowhere - if the variable isn't actually necessary. This actually happens a lot, since CUDA functions are often inlined, with some variables becoming copies of a variable in the calling function. Example (note the x from foo() in the compilation of bar() - completely gone).
1.2 Immediate values in the program's compiled code - if the variable's value is constant, and doesn't get updated, its value may simply be "plugged" into the code. Here's an example with two variables taking constants, which are replaced with the constant which is their sum.
When not optimized away:
2.1 Registers - If your variable can't be optimized-away, the better alternative is to keep it in a hardware register on (one of the symmetric multiprocessor core on) the GPU. Example (the variables x and y are placed in registers %r1 and %r2).
the best and most performant option, which the compiler
2.2 'Local' memory - The 'local memory' of a CUDA thread is an area in global device memory which is (in principle) accessible only by that thread.
Now, obviously, local memory is much slower to use. When will the compiler choose it, then?
The CUDA Programming Guide gives us the answer:
When the automatic variable is too large to fit in the register file for the current thread (each thread typically gets between 63 and 255 4-byte registers).
When the automatic variable is an array, or has an array member, which is indexed with a non-constant offset. Unfortunately, NVIDIA GPU multiprocessors don't support register indexing.
When the kernel is overusing its available quota of registers is already full with other variables or uses by the compiled code - even if the variable itself is very small. This is referred to as register spilling.

Local variables are either placed in hardware registers or local memory (which is effectively global memory).
In your example, however, variable i will be removed by the compiler because it is unused.

GPUs have a bunch of space dedicated for many registers stored directly in GPU computing units (ie. Streaming Multiprocessors). Registers are not stored in memory unless there is some register spilling happening (typically when you use too many registers in a given kernel). Register have no address unlike all memory bytes. The same thing happen for CPU except that the number of CPU registers is usually much smaller than on GPU. For example, an Intel Skylake core has 180 integer register and 168 vector registers while the instruction set architecture is limited to something like 16 integer/vector registers. Note that in case of register spilling, the value of registers is temporary stored in local memory (typically in the L1 cache if possible). Here is the overall memory hierarchy of a basic (Nvidia Fermi) GPU:
For more information, consider reading: Local Memory and Register Spilling.

What is stored in Code memory and Data memory

Can some one please explain difference between Code and Data memory. I know code is stored in Flash and Data is stored in RAM but i am confused.
#include <iostream>
using namespace std;
int main()
{
int a =10, b=20;
int c = a+b;
return 0;
}
Here a,b,c are stored in data memory(RAM), but whats get stored in Code memory? Is this entire code is stored in Code memory? if yes, then does this mean we are storing a,b,c in both data and code memory.

In your example, many scenarios based on the optimization level of your compiler.
Constants placed in "code memory"
In the code below:
int a =10, b=20;
int c = a+b;
return 0;
The variables a and b are constants, they don't change. A compiler could optimize this and optimize them to be:
int c = 10 + 20;
So the values 10 and 20 can be placed into code memory, eliminating the variables a and b.
Registers not Memory
The compiler is allowed to assign the variables a and b to registers. Registers are within the processor, so don't take up any RAM or memory space. Registers are not part of the code space either.
(This can happen because there are no statements that require the addresses of a or b).
All code dropped
On higher optimization settings, the compiler can delete all your code and replace with a return 0.
The variables a and b are not changed.
The variable c is changed but not used by any other statements.
Your program has no effect (nothing is printed, there are no external actions like writing to hardware).
Thus your program can be reduced down to return 0;.
Code Memory vs. Data Memory
In general, processor instructions are placed in a segment you will call "code memory". This may actually reside in RAM and not in Flash or ROM. For example, on a PC, your code could be loaded from the hard drive into RAM and executed in RAM. Similarly with Flash, your code could be loaded from Flash into RAM and executed in RAM.
Constants, like numbers, can be placed into a Read-Only segment or in the Code Segment. Many processors can load constants from the Code Segment (see ARM and Intel assembly instructions). The Read-Only segment can live on a Read Only device, (ROM or Flash) or may live in RAM (or on a device like hard drive). All you can guarantee, is that the code will not write to the Read-Only segment.
Data Memory is different. The C++ language has at least 3 areas of "data" memory (where variables live): 1) Local (a.k.a. stack), where short lifetime variables reside; 2) Dynamic memory (a.k.a. heap), allocated by using new or malloc and 3) Automatic/Global variables. These memory areas can be placed anywhere, as long as the memory has read and write capabilities. They don't need to be fast, only read & write (for example, the hard drive can be used as data memory).
Memory organization is more complicated than having Code, Stack and Heap. In the embedded systems world, memory can be place in non-standard locations and there may be a need to have more detailed memory segments so they can be placed in different areas. For example, an embedded system may want to place the constants into Flash so that they can be changed easily (even though they may be more efficiently accessed in the Code Segment). Some code may want to be placed into the Boot Area of the processor (which is programmed by the processor manufacturer). Some embedded systems may have non-volatile memory (e.g. battery backed memory), which can behave like Read-Only memory.
Trust Your Compiler
Trust in your compiler to place code, data and variables in the most efficient areas as possible. Your compiler knows your platform and will make the best decisions for you. If you need to change your compiler's settings, you can, but you should really know what you are doing and why you need to change them. Most PC platforms load code from a hard drive (or SSD) into RAM and execute the code from RAM. Embedded systems are different and depend on the hardware devices. Code may be run from flash because the platform has minimal RAM. Some may store the code compressed in a serial access read-only device and have to decompress into RAM before executing. In these situations, the compilers are configured for these specializations. So, trust in your compiler and let it place the code and data into the correct segments and locations.

A quick oversimplification:
Code memory stores the sequence of machine language instructions compiled from your C++ piece of code (the ROM).
Actual data that is created and manipulated by the program is instead stored in the RAM, which can be understood as made of stack and heap: data is stored in the slower, but larger heap, while its addresses there are retained by pointers in the stack. The stack is hosted in the faster memory registers.
Pointers in the stack retrieve data in the heap when told so by the current instruction line in the ROM, or more in general when needed.

Optimal Way of Using mlockall() for Real-time Application (nanosecond sensitive)

I am reading mlockall()'s manpage: http://man7.org/linux/man-pages/man2/mlock.2.html
It mentions
Real-time processes that are using mlockall() to prevent delays on page
faults should reserve enough locked stack pages before entering the time-
critical section, so that no page fault can be caused by function calls. This
can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages. This way, enough pages will be mapped for the
stack and can be locked into RAM. The dummy writes ensure that not even copy-
on-write page faults can occur in the critical section.
I am a bit confused by this statement:
This can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages.
All the automatic variables (variables on stack) are created "on the fly" on the stack when the function is called. So how can I achieve what the last statement says?
For example, let's say I have this function:
void foo() {
char a;
uint16_t b;
std::deque<int64_t> c;
// do something with those variables
}
Or does it mean before I call any function, I should call a function like this in main():
void reserveStackPages() {
int64_t stackPage[4096/8 * 1024 * 1024];
memset(stackPage, 0, sizeof(stackPage));
}
If yes, does it make a difference if I first allocate the stackPage variable on heap, write and then free? Probably yes, because heap and stack are 2 different region in the RAM?
std::deque exists above is just to bring up another related question -- what if I want to reserve memory for things using both stack pages and heap pages. Will calling "heap" version of reserveStackPages() help?
The goal is to minimize all the jitters in the application (yes, I know there are many other things to look at such as TLB miss, etc; just trying to deal with one kind of jitter at once, and slowly progressing into all).
Thanks in advance.
P.S. This is for a low latency trading application if it matters.

You generally don't need to use mlockall, unless you code (more or less hard) real-time applications (I actually never used it).
If you do need it, you'll better code in C (not in genuine C++) the most real-time parts of your code, because you surely want to understand the details of memory allocation. Notice that unless you dive into std::deque implementation, you don't exactly know where it is sitting (probably most of the data is heap allocated, even if your c is an automatic variable).
You should first understand in details the virtual address space of your process. For that, proc(5) is useful: from inside your process, you'll read /proc/self/maps (see this), from outside (e.g. some terminal) you'll do cat /proc/1234/maps for a process of pid 1234. Or use pmap(1).
because heap and stack are 2 different regions in the RAM?
In fact, your process' address space contains many segments (listed in /proc/1234/maps), much more that two. Typically every dynamically linked shared library (such as libc.so) brings a few segments.
Try cat /proc/self/maps and cat /proc/$$/maps in your terminal to get a better intuition about virtual address spaces. On my machine, the first gives 19 segments of the cat process -each displayed as a line- and the second 97 segments of the zsh (my shell) process.
To ensure that your stack has enough space, you indeed could call a function allocating a large enough automatic variable, like your reserveStackPages. Beware that call stacks are practically of limited size (a few megabytes usually, see also setrlimit(2)).
If you really need mlockall (which is unlikely) you might consider linking statically your program (to have less segments in your virtual address space).
Look also into madvise(2) (and perhaps mincore(2)). It is generally much more useful than mlockall. BTW, in practice, most of your virtual memory is in RAM (unless your system experiments thrashing, and then you'll see it immediately).
Read also Operating Systems: Three Easy Pieces to understand the role of paging.
PS. Nano-second sensitive applications does not make much sense (because of cache misses that the software does not control).

Where machine instructions of a program stored during runtime?

So far as I know, whenever we run any program, the machine instructions of the program is loaded in RAM. Again, there are two regions of memory: stack and heap.
My question is: Which region of memory the machine instruction stored in? stack or heap?
I learnt that the following program gives a runtime error though there is no variable declared inside the function. The reason behind this is the overflow of stack. Then should I assume that the machines instructions of the function is stored in stack?
int func()
{
return func();
}

Neither, as it is not dynamically allocated the way stack and heap are.
The executable loader loads the executable (.text) and any static data it contains, like the initial values of global variables (.data / .rodata), into an unused RAM area. It then sets up any zero-initialized memory the executable asked for (.bss).
Only then is the stack set up for main(). Stack memory is allocated on the stack if you enter another function, holding the return address, function arguments, and any locally declared variables as well as any memory allocated via alloca().[1] Memory is released when you return from the function.
Heap memory is allocated by malloc(), calloc(), or realloc(). It gets released when you free() or realloc() it.
The RAM used for the executable and its static data does not get released until the process terminates.
Thus, stack and heap are, basically, under control of the application. The memory of the executable itself is under control of the executable loader / the operating system. In appropriately equipped operating systems, you don't even have write access to that memory.
Regarding your edited question, no. (Bad style, editing a question to give it a completely new angle.)
The executable code remains where it was loaded. Calling a function does not place machine instructions on the stack. The only thing your func() (a function taking no arguments) places on the stack is the return address, a pointer that indicates where execution should continue once the current function returns.
Since none of the calls ever returns, your program keeps adding return addresses on the stack, until that cannot hold any more. This has nothing to do with machine code instructions at all.
[1]: Note that none of this is actually part and parcel of the C language standard, but implementation-defined, and may differ -- I presented a simplified version of affairs. For example, function parameters might be passed in CPU registers instead of on the stack.

Neither one nor the other.
The image of your program contains the code and the static data (f.e. all you string constants, static arrays and structures, and so on). They will be loaded into different segments of the RAM.
Stack and heap are dynamic structures to store data, they will be created at the start of your program. Stack is hardware supported solutions, while heap is a standard library supported solution.
So, your code will be located in the code segment, your static data and heap will be located in the data segment, and the stack will be located in the stack segment.

the machine instructions of the program is loaded in RAM
Correct for hosted, "PC-like" systems. On embedded systems, code is most often executed directly from flash memory
Again, there is two regions of memory: stack and heap.
No, that's some over-simplification that way too many, way too bad programming teachers teach.
There are lots of other regions too: .data and .bss where all variables with static storage go, .rodata where constants go, etc etc.
The segment where the program code is stored is usually called .text.

Again, there is two regions of memory: stack and heap.
It's not that simple. Normally on mainstream operating system there's more stuff going on:
there is one stack per running thread;
there are as many heaps as you decide to allocate (actually, as far as the memory manager is concerned you ask some memory pages to "enabled" in your virtual address space, the "heap" thing derives from the fact that normally one uses some kind of heap manager code to efficiently distribute these memory portions between allocations);
there can be memory mapped files and shared memory;
most importantly, the executable file (and the dynamic libraries) are mapped in the memory of the process, normally with the code zone (the so called "text" segment) mapped in read-only mode, and other zones (typically pertaining to initialized global and static variables and stuff fixed by the loader) in copy-on-write.
So, the code is stored in the relevant section of the executable, which is mapped in memory.

They are usually on a section called .text.
On Linux you can list the sections of an ELF object or executable with size command from core-utils, for example on a tst ELF executable:
$ size -Ax tst | grep "^\.text"
.text 0x1e8 0x4003b0
$

There are multiple memory segments in addition to the stack and heap. Here's an example of how a program may be laid out in memory:
+------------------------+
high address | Command line arguments |
| and environment vars |
+------------------------+
| stack |
| - - - - - - - - - - - |
| | |
| V |
| |
| ^ |
| | |
| - - - - - - - - - - - |
| heap |
+------------------------+
| global and read- |
| only data |
+------------------------+
| program text |
low address | (machine code) |
+------------------------+
Details will vary from platform to platform, but this layout is pretty common for x86-based systems. The machine code occupies its own memory segment (labeled .text in ELF format), global read-only data will be stored in another segment (.rdata or .rodata), uninitialized globals in yet another segment (.bss), etc. Some segments are read-only, some are writable.

Neither heap nor stack.
Usually, executable instructions are present in Code segment.
Quoting the wikipedia article
In computing, a code segment, also known as a text segment or simply as text, is a portion of an object file or the corresponding section of the program's virtual address space that contains executable instructions.
and
when the loader places a program into memory so that it may be executed, various memory regions are allocated (in particular, as pages)
At runtime, the code segment of an object file is loaded into a corresponding code segment in memory. In particular, it has nothing to do with stack or heap.
EDIT:
In your code snippet above, what you're experiencing is called infinite recursion.
Even if you function does not need any space in stack for local variable, it still needs to push the return address of the outer function before calling the inner function, thus claiming the stack space, only never to return [pop the addresses out of stack] [like at a point of no return], thereby running out of stack space, causing stack overflow.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js