to how many subroutines can x86 processors call? [duplicate]

to how many subroutines can x86 processors call? [duplicate] - c++

This question already has answers here:
C/C++ maximum stack size of program on mainstream OSes
(7 answers)
Closed 1 year ago.
i'm writing a small program to print a polygon with printf("\219") to just see if whatever i'm doing is right for my kernel. but it needs to call many functions and i don't know whether x86 processors can accept that many subroutines and i can't find results in google. so my question is will it accept so many function calls and what is the maximum. (i mean something like this:-)
function a() {b();}
function b() {c();}
function c() {d();}
...
i've used 5 such levels (you know what i mean, right?)

Your function depth is not limited by the processor, but by the size of your stack. Behind the scenes, calls to C++ functions usually translate to call instructions in x86, which push four (or eight, for x64 programs) bytes onto your program's stack for the return pointer. Sometimes calls are optimized and don't touch the stack at all. Functions might also push additional bytes (e.g. local function state) onto the stack.
To get the exact number of functions you can call, you need to disassemble your code to calculate the number of bytes each function pushes to the stack (minimum four/eight because of the return address, but likely many more), then find the maximum stack size and divide it by the function frame size.

call and ret instructions aren't special; the CPU doesn't know how deeply nested it is. (And doesn't know the difference between calling another function or being recursive.) As described in my answer here, functions are a high-level concept that asm gives you the tools to implement.
All the CPU knows is whether pushing a return address causes a page fault or not, if you run out of stack space. (A "stack overflow"). Usually the result of recursion going too deep, or a huge array in automatic storage (a C++ local var) or alloca.
(As mentioned in comments, not every C function call results in a call instruction; inlining can fully optimize away the fact that it's a separate function. You want small functions to inline, and the design of the template classes in the C++ standard library depends on this for efficiency. Also, a tailcall can be just a jmp, having the next function take over this stack space instead of getting new space.)
Linux typically uses 8 MiB stacks for user-space processes. (ulimit -s). Windows is typically 1 MiB. Inside a kernel, you often use smaller stacks. e.g. Linux kernel thread-stacks are currently 16 kiB for x86-64. Previously as small as 4 kiB (one page) for 32-bit x86, but some code has more local vars that take up stack space.
Related: How does the stack work in assembly language?. When ret executes, it just pops into the program counter. It's up to the programmer (or compiler) to create asm that runs ret when the stack pointer is pointing at somewhere you want to jump. (Typically a return address.)
In modern Linux systems, the smallest stack frame is 16 bytes, because the ABI specifies maintaining 16-byte stack alignment before a call. So best case you can have a call depth of 512k before you overflow the stack. (Unless you're in a thread that was started with an extra large thread-stack).
If you're in 32-bit mode with an old version of the i386 System V ABI that only required 4-byte stack alignment (like gcc -mpreferred-stack-boundary=2 instead of the default 4), with functions that just called without using any other stack space, 8 MiB of stack would give you 8 MiB / 4B = 2 Mi call depth.
In real life, some of that 8MiB space on the main thread's stack is used up by env vars and argv[] already on the stack at the process entry point (copied there by the kernel), and then a bit more as _start calls a function that calls main.
To make this able to actually return at the end, instead of just eventually faulting, you'd need either a huge chain, or some recursion with a termination condition, like
void recurse(int n) {
if (n == 1)
return;
recurse(n - 1);
}
and compile with some optimization but not enough to get the compiler to turn it into a do{}while(--n); loop or optimize away.
If you wanted all different functions, that's ok, the code-size limit is at least 2GiB, and call rel32 / ret takes a total of 6 bytes. (Un-optimized code would default to push ebp or push rbp as well, so you'd have to avoid that for 32-bit code if you wanted to meet that 4-byte stack-frame goal of having all your stack space full with just return addresses).
For example with GCC (see How to remove "noise" from GCC/clang assembly output? for the __attribute__((noipa)) option)
__attribute__((noipa)) // don't let other functions even notice that this is empty, let alone inline it
void foo(void){}
__attribute__((noipa))
void bar(){
foo();
}
void baz(){
bar();
}
compiles with GCC (Godbolt compiler explorer) to this 32-bit asm which just calls without using any stack space for anything else:
## GCC11.2 -O1 -m32 -mpreferred-stack-boundary=2
foo():
ret
bar():
call foo()
ret
baz():
call bar()
ret
g++ -O2 would optimize these call/ret functions into a tailcall like jmp foo, which doesn't push a return address. That's why I only used -O1 or -Og. Of course in real life you do want things to inline and optimize away; defeating that is just for this silly computer trick to achieve the longest finite call depth that actually would crash if you made it longer.
You can repeat this pattern indefinitely; GCC allows long symbol names, so it's not a problem to have many different unique function names. You can split across multiple files, with just a prototype for one of the functions in the other file.
If you reduce the -falign-functions tuning setting, you can probably get it down to either 6 bytes per function (no padding) or 8 (align by 8 thus 2 bytes of padding), down from the default of aligning each function label by 16, wasting 10 bytes per function.
Recursive:
I got GCC to make asm that recurses with no gaps between return address:
void recurse(int n){
if (!--n)
return;
recurse(n);
}
## G++11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1 # --n
jne .L6 # skip over the ret if the result is non-zero
.L4:
ret
.L6:
call recurse(int) # push a 4-byte ret addr and jump to top
jmp .L4 # silly compiler, should put another ret here. But we told it not to optimize too much
Note the -mregparm=3 option, so the first up-to-3 args are passed in registers, instead of on the stack in the inefficient i386 System V calling convention. (It was designed a long time ago; the x86-64 SysV calling convention is much better.)
With -O2, this function optimizes away to just a ret. (It turns the call into a tailcall which becomes a loop, and then it can see it's a non-infinite loop with no side effects so it just removes it.)
Of course in real life you want optimizations like this. Recursion in asm sucks compared to loops. If you're worried about robust code and call depth, don't write recursive functions in the first place, unless you know recursion depth will be shallow. If a debug build doesn't convert your recursion to iteration, you don't want your kernel to crash.
Just for fun, I got GCC to tighten up the asm even at -O1 by convincing it to do the conditional branch over the call, to only have one ret instruction in the function so tail-duplication wouldn't be relevant anyway. And it means the fast path (recursion) involves a not-taken macro-fused conditional branch plus a call.
void recurse(int n){
if (__builtin_expect(!!--n, 1))
recurse(n);
}
Godbolt
## GCC 11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1
je .L1
call recurse(int)
.L1:
ret

Related

Understanding the assembly language for if-else in following code [duplicate]

What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax

The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.

Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.

embed a functions assembly code in a struct

I've a rather special question: is it possible in C/++ (both because I am sure the question is the same in both languages) to specify a functions's location? Why? I have a very large list of function pointers, and I want to eliminate them.
(Currently) This looks like that(repeated over lika a million times, stored in the user's RAM):
struct {
int i;
void(* funptr)();
} test;
Because I know that in most assembly languages, functions are just "goto" directives, I had the following idea. Is it possible to optimize the above construct so that it looks like that?
struct {
int i;
// embed the assembler of the function here
// so that all the functions
// instructions are located here
// like this: mov rax, rbx
// jmp _start ; just demo code
} test2;
In the end, the thing should look like this in memory: An int holding any value, followed by the function's assembly code, referenced by test2. I should be able to call these functions like that: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();
You might think that I'm insane to optimize the app that way, and I cannot disclose any more details on it's function, but if anyone has some pointers on how solve this problem, I would appreciate it.
I do not think that there is a standard way to this, so any hacky way to do this via inline assembler/other crazy things is also appreciated!

The only thing you really have to do is make the compiler aware of the (constant) value of the function pointer you want in the struct. The compiler will then (presumably/hopefully) inline that function call wherever it sees it called through that function pointer:
template<void(*FPtr)()>
struct function_struct {
int i;
static constexpr auto funptr = FPtr;
};
void testFunc()
{
volatile int x = 0;
}
using test = function_struct<testFunc>;
int main()
{
test::funptr();
}
Demo - no call or jmp after optimization.
It remains unclear what the point of the int i is. Note that the code is not technically "directly after the i" here, but it is even more unclear how you'd expect instances of the struct to look like (is the code in them or is it "static" in a way? I feel there is some misunderstanding here on your part what compilers actually produce...). But consider the ways that compiler inlining can help you and you might find the solution you need. If you're worried about executable size after inlining, tell the compiler and it will compromise between speed and size.

This sounds like a terrible idea for a lot of reasons that probably won't save memory, and will hurt performance by diluting L1I-cache with data and L1D-cache with code. And worse if you ever modify or copy objects: self-modifying code stalls.
But yes, this would be possible in C99/C11 with a flexible array member at the end of the struct, which you cast to a function pointer.
struct int_with_code {
int i;
char code[]; // C99 flexible array member. GNU extension in C++
// Store machine code here
// you can't get the compiler to do this for you. Good Luck!
};
void foo(struct int_with_code *p) {
// explicit C-style cast compiles as both C and C++
void (*funcp)(void) = ( void (*)(void) ) p->code;
funcp();
}
Compiler output from clang7.0, on the Godbolt compiler explorer is the same when compiled as either C or C++. This is targeting the x86-64 System V ABI, where the first function arg is passed in RDI.
# this is the code that *uses* such an object, not the code that goes in its code[]
# This proves that it compiles,
# without showing any way to get compiler-generated code into code[]
foo: # #foo
add rdi, 4 # move the pointer 4 bytes forward, to point at code[]
jmp rdi # TAILCALL
(If you leave out the (void) arg-type declaration in C, the compiler will zero AL first in the x86-64 SysV calling convention, in case its actually a variadic function, because it's passing no FP args in registers.)
You'd have to allocate your objects in memory that was executable (normally not done unless they're const with static storage), e.g. compile with gcc -zexecstack. Or use a custom mmap/mprotect or VirtualAlloc/VirtualProtect on POSIX or Windows.
Or if your objects are all statically allocated, it might be possible to massage compiler output to turn functions in the .text section into objects by adding an int member right before each one. Maybe with some .section and linker tricks, and maybe a linker script, you could even somehow automate it.
But unless they're all the same length (e.g. with padding like char code[60]), that won't form an array you can index, so you'll need some way of referencing all these variable-length object.
There are potentially huge performance downsides if you ever modify an object before calling its function: on x86 you'll get self-modifying-code pipeline nuke for executing code near a just-written memory location.
Or if you copied an object before calling its function: x86 pipeline flush, or on other ISAs you need to manually flush caches to get the I-cache in sync with D-cache (so the newly-written bytes can be executed). But you can't copy such objects because their size isn't stored anywhere. You can't search the machine code for a ret instruction, because a 0xc3 byte might appear somewhere that's not the start of an x86 instruction. Or on any ISA, the function might have multiple ret instructions (tail duplication optimization). Or end with a jmp instead of a ret (tailcall).
Storing a size would start to defeat the purpose of saving size, eating up at least an extra byte in each object.
Writing code to an object at runtime, then casting to a function pointer, is undefined behaviour in ISO C and C++. On GNU C/C++, make sure you call __builtin___clear_cache on it to sync caches or whatever else is necessary. Yes, this is needed even on x86 to disable dead-store elimination optimizations: see this test case. On x86 it's just a compile-time thing, no extra asm. It doesn't actually clear any caches.
If you do copy at runtime startup, maybe allocate a big chunk of memory and carve out variable-length chunks of it, while copying. If you malloc each separately, you're wasting memory-management overhead on it.
This idea will not save you memory unless you have about as many functions as you have objects
Normally you have a fairly limited number of actual functions, with many objects having copies of the same function pointer. (You've kind of hand-rolled C++ virtual functions, but with only one function you just have a function pointer directly instead of a vtable pointer to a table of pointers for that class type. One fewer levels of indirection, and apparently you're not passing the object's own address to the function.)
One of the several benefits of this level of indirection is that one pointer is usually significantly smaller than the entire code for a function. For that to not be the case, your functions would have to be tiny.
Example: with 10 different functions of 32 bytes each, and 1000 objects with function pointers, you have a total of 320 bytes of code (which will stay hot in I-cache), and 8000 bytes of function pointers. (And in your objects, another 4 bytes per object wasted on padding to align the pointer, making the total size 16 instead of 12 bytes per object.) Anyway, that's 16320 bytes total for entire structs + code. If you allocated each object separately, there's per-object bookkeeping.
With inlining machine code into each object, and no padding, that's 1000 * (4+32) = 36000 bytes, over twice the total size.
x86-64 is probably a best-case scenario, where a pointer is 8 bytes and x86-64 machine code uses a (famously complex) variable-length instruction encoding which allows for high code density in some cases, especially when optimizing for code-size. (e.g. code-golfing. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code). But unless your functions are mostly something trivial like lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte), they're still going to take more than 8 bytes. (That's return x*3; for a function that takes a 32-bit integer x arg, in the x86-64 System V ABI.)
If they're wrappers for larger functions, a normal call rel32 instruction is 5 bytes. A load of static data is at least 6 bytes (opcode + modrm + rel32 for a RIP-relative addressing mode, or loading EAX specifically can use the special no-modrm encoding for an absolute address. But in x86-64 that's a 64-bit absolute unless you use an address-size prefix too, potentially causing an LCP stall in the decoders on Intel. mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 bytes again, so this is worse for no benefit).
Your function-pointer type doesn't have any args (assuming this is C++ where foo() means foo(void) in a declaration, not like old C where an empty arg list is somewhat similar to (...)). Thus we can assume you're not passing args, so to do anything useful the functions are probably accessing some static data or making another call.
Ideas that make more sense:
Use an ILP32 ABI like Linux x32, where the CPU runs in 64-bit mode but your code uses 32-bit pointers. This would make each of your objects only 8 bytes instead of 16. Avoiding pointer-bloat is a classic use-case for x32 or ILP32 ABIs in general.
Or (yuck) compile your code as 32-bit. But then you have obsolete 32-bit calling conventions that pass args on the stack instead of registers, and less than half the registers, and much higher overhead for position-independent code. (No EIP/RIP-relative addressing.)
Store an unsigned int table index to a table of function pointers. If you have 100 functions but 10k objects, the table is only 100 pointers long. In asm you could index an array of code directly (computed goto style) if all the functions were padded to the same length, but in C++ you can't do that. An extra level of indirection with a table of function pointers is probably your best bet.
e.g.
void (*const fptrs[])(void) = {
func1, func2, func3, ...
};
struct int_with_func {
int i;
unsigned f;
};
void bar(struct int_with_func *p) {
fptrs[p->f] ();
}
clang/gcc -O3 output:
bar(int_with_func*):
mov eax, dword ptr [rdi + 4] # load p->f
jmp qword ptr [8*rax + fptrs] # TAILCALL # index the global table with it for a memory-indirect jmp
If you were compiling a shared library, PIE executable, or not targeting Linux, the compiler couldn't use a 32-bit absolute address to index a static array with one instruction. So there'd be a RIP-relative LEA in there and something like jmp [rcx+rax*8].
This is an extra level of indirection vs. storing a function pointer in each object, but it lets you shrink each object to 8 bytes, down from 16, like using 32-bit pointers. Or to 5 or 6 bytes, if you use an unsigned short or uint8_t and pack the structs with __attribute__((packed)) in GNU C.

No, not really.
The way to specify a function's location is to use a function pointer, which you're already doing.
You could make different types which have their own different member functions, but then you're back to the original problem.
I have in the past experimented with auto-generating (as a pre-build step, using Python) a function with a long switch statement that does the work of mapping int i to a normal function call. This gets rid of the function pointers, at the expense of branching. I don't remember whether it ended up being worthwhile in my case and, even if I did, that wouldn't tell us whether it's worthwhile in your case.
Because I know that in most assembly languages, functions are just "goto" directives
Well, it's perhaps a little more complicated than that…
You might think that I'm insane to optimize the app that way
Perhaps. Trying to eliminate indirection is not, in itself, a bad thing, so I don't think you're wrong to try to improve this. I just don't think that you necessarily can.
but if anyone has some pointers
lol

I don't understand the goal of this "optimization" is it about saving the memory?
I might be misunderstanding the question, but if you just replace your function pointer with a regular function, then you'll have your struct only containing the int as data and the function-pointer being inserted by the compiler when you take the address of it, instead of stored in memory.
So just do
struct {
int i;
void func();
} test;
Then sizeof(test)==sizeof(int) should hold true if you set alignment/packing to be tight.

How is it known that variables are in registers, or on stack?

I am reading this question about inline on isocpp FAQ, the code is given as
void f()
{
int x = /*...*/;
int y = /*...*/;
int z = /*...*/;
// ...code that uses x, y and z...
g(x, y, z);
// ...more code that uses x, y and z...
}
then it says that
Assuming a typical C++ implementation that has registers and a stack,
the registers and parameters get written to the stack just before the
call to g(), then the parameters get read from the stack inside
g() and read again to restore the registers while g() returns to
f(). But that’s a lot of unnecessary reading and writing, especially
in cases when the compiler is able to use registers for variables x,
y and z: each variable could get written twice (as a register and
also as a parameter) and read twice (when used within g() and to
restore the registers during the return to f()).
I have a big difficulty understanding the paragraph above. I try to list my questions as below:
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
PS
It's very hard to choose an acceptable answer when the answers are all very good(E.g., the ones provided by #MatsPeterson, #TheodorosChatzigiannakis, and #superultranova) I think. I personally like the one by #Potatoswatter a little bit more since the answer offers some guidelines.

Don't take that paragraph too seriously. It seems to be making excessive assumptions and then going into excessive detail, which can't really be generalized.
But, your questions are very good.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
More-or-less, everything needs to be loaded into registers. Most computers are organized around a datapath, a bus connecting the registers, the arithmetic circuits, and the top level of the memory hierarchy. Usually, anything that is broadcast on the datapath is identified with a register.
You may recall the great RISC vs CISC debate. One of the key points was that a computer design can be much simpler if the memory is not allowed to connect directly to the arithmetic circuits.
In modern computers, there are architectural registers, which are a programming construct like a variable, and physical registers, which are actual circuits. The compiler does a lot of heavy lifting to keep track of physical registers while generating a program in terms of architectural registers. For a CISC instruction set like x86, this may involve generating instructions that send operands in memory directly to arithmetic operations. But behind the scenes, it's registers all the way down.
Bottom line: Just let the compiler do its thing.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Each platform defines a way for C functions to call each other. Passing parameters in registers is more efficient. But, there are trade-offs and the total number of registers is limited. Older ABIs more often sacrificed efficiency for simplicity, and put them all on the stack.
Bottom line: The example is arbitrarily assuming a naive ABI.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
The compiler tends to prefer to use registers for more frequently accessed values. Nothing in the example requires the use of the stack. However, less frequently accessed values will be placed on the stack to make more registers available.
Only when you take the address of a variable, such as by &x or passing by reference, and that address escapes the inliner, is the compiler required use memory and not registers.
Bottom line: Avoid taking addresses and passing/storing them willy-nilly.

It is entirely up to the compiler (in conjunction with the processor type) whether a variable is stored in memory or a register [or in some cases more than one register] (and what options you give the compiler, assuming it's got options to decide such things - most "good" compilers do). For example, the LLVM/Clang compiler uses a specific optimisation pass called "mem2reg" that moves variables from memory to registers. The decision to do so is based on how the variable(s) are used - for example, if you take the address of a variable at some point, it needs to be in memory.
Other compilers have similar, but not necessarily identical, functionality.
Also, at least in compilers that have some semblance of portability, there will ALSO be a phase of generatinc machine code for the actual target, which contains target-specific optimisations, which again can move a variable from memory to a register.
It is not possible [without understanding how the particular compiler works] to determine if the variables in your code are in registers or in memory. One can guess, but such a guess is just like guessing other "kind of predictable things", like looking out the window to guess if it's going to rain in a few hours - depending on where you live, this may be a complete random guess, or quite predictable - some tropical countries, you can set your watch based on when the rain arrives each afternoon, in other countries, it rarely rains, and in some countries, like here in England, you can't know for certain beyond "right now it is [not] raining right here".
To answer the actual questions:
This depends on the processor. Proper RISC processors such as ARM, MIPS, 29K, etc have no instructions that use memory operands except the load and store type instructions. So if you need to add two values, you need to load the values into registers, and use the add operation on those registers. Some, such as x86 and 68K allows one of the two operands to be a memory operand, and for example PDP-11 and VAX have "full freedom", whether your operands are in memory or register, you can use the same instruction, just different addressing modes for the different operands.
Your original premise here is wrong - it's not guaranteed that arguments to g are on the stack. That is just one of many options. Many ABIs (application binary interface, aka "calling conventions) use registers for the first few arguments to a function. So, again, it depends on which compiler (to some degree) and what processor (much more than which compiler) the compiler targets whether the arguments are in memory or in registers.
Again, this is a decision that the compiler makes - it depends on how many registers the processor has, which are available, what the cost is if "freeing" some register for x, y and z - which ranges from "no cost at all" to "quite a bit" - again, depending on the processor model and the ABI.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
Not even this statement is always true. It is probably true for all the platforms you'll ever work with, but there surely can be another architecture that doesn't make use of processor registers at all.
Your x86_64 computer does however.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
These two questions cannot be uniquely answered for any compiler and system your code will be compiled on. They cannot even be taken for granted since g's parameters might not be on the stack, it all depends on several concepts I'll explain below.
First you should be aware of the so-called calling conventions which define, among the other things, how function parameters are passed (e.g. pushed on the stack, placed in registers, or a mix of both). This isn't enforced by the C++ standard and calling conventions are a part of the ABI, a broader topic regarding low-level machine code program issues.
Secondly register allocation (i.e. which variables are actually loaded in a register at any given time) is a complex task and a NP-complete problem. Compilers try to do their best with the information they have. In general less frequently accessed variables are put on the stack while more frequently accessed variables are kept on registers. Thus the part Where the data inside g() is stored, register or stack? cannot be answered once-and-for-all since it depends on many factors including register pressure.
Not to mention compiler optimizations which might even eliminate the need for some variables to be around.
Finally the question you linked already states
Naturally your mileage may vary, and there are a zillion variables that are outside the scope of this particular FAQ, but the above serves as an example of the sorts of things that can happen with procedural integration.
i.e. the paragraph you posted makes some assumptions to set things up for an example. Those are just assumptions and you should treat them as such.
As a small addition: regarding the benefits of inline on a function I recommend taking a look at this answer: https://stackoverflow.com/a/145952/1938163

You can't know, without looking at the assembly language, whether a variable is in a register, stack, heap, global memory or elsewhere. A variable is an abstract concept. The compiler is allowed to use registers or other memory as it chooses, as long as the execution isn't changed.
There's also another rule that affects this topic. If you take the address of a variable and store into a pointer, the variable may not be placed into a register because registers don't have addresses.
The variable storage may also depend on the optimization settings for the compiler. Variables can disappear due to simplification. Variables that don't change value may be placed into the executable as a constant.

Regarding your #1 question, yes, non load/store instructions operate on registers.
Regarding your #2 question, if we are assuming that parameters are passed on the stack, then we have to write the registers to the stack, otherwise g() won't be able to access the data, since the code in g() doesn't "know" which registers the parameters are in.
Regarding your #3 question, it is not known that x, y and z will for sure be stored in registers in f(). One could use the register keyword, but that's more of a suggestion. Based on the calling convention, and assuming the compiler doesn't do any optimization involving parameter passing, you may be able to predict whether the parameters are on the stack or in registers.
You should familiarize yourself with calling conventions. Calling conventions deal with the way that parameters are passed to functions and typically involve passing parameters on the stack in a specified order, putting parameters into registers or a combination of both.
stdcall, cdecl, and fastcall are some examples of calling conventions. In terms of parameter passing, stdcall and cdecl are the same, in the parameters are pushed in right to left order onto the stack. In this case, if g() was cdecl or stdcall the caller would push z,y,x in that order:
mov eax, z
push eax
mov eax, x
push eax
mov eax, y
push eax
call g
In 64bit fastcall, registers are used, microsoft uses RCX, RDX, R8, R9 (plus the stack for functions requiring more than 4 params), linux uses RDI, RSI, RDX, RCX, R8, R9. To call g() using MS 64bit fastcall one would do the following (we assume z, x, and y are not in registers)
mov rcx, x
mov rdx, y
mov r8, z
call g
This is how assembly is written by humans, and sometimes compilers. Compilers will use some tricks to avoid passing parameters, as it typically reduces the number of instructions and can reduce the number of time memory is accessed. Take the following code for example (I'm intentionally ignoring non-volatile register rules):
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
call g
mov rcx, rax
ret
g:
mov rax, rsi
add rax, rcx
add rax, rdx
ret
For illustrative purposes, rcx is already in use, and x has been loaded into rsi. The compiler can compile g such that it uses rsi instead of rcx, so values don't have to be swapped between the two registers when it comes time to call g. The compiler could also inline g, now that f and g share the same set of registers for x, y, and z. In that case, the call g instruction would be replaced with the contents of g, excluding the ret instruction.
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
mov rax, rsi
add rax, rcx
add rax, rdx
mov rcx, rax
ret
This will be even faster, because we don't have to deal with the call instruction, since g has been inlined into f.

Short answer: You can't. It completely depends on your compiler and the optimizing features enabled.
The compiler concern is to translate into assembly your program, but how it is done is tighly coupled to how your compiler works.
Some compilers allows you hint what variable map to register.
Check for example this: https://gcc.gnu.org/onlinedocs/gcc/Global-Reg-Vars.html
Your compiler will apply transformations to your code in order to gain something, may be performance, may be lower code size, and it apply cost functions to estimate this gains, so you normally only can see the result disassembling the compilated unit.

Variables are almost always stored in main memory. Many times, due to compiler optimizations, value of your declared variable will never move to main memory but those are intermediate variable that you use in your method which doesn't hold relevance before any other method is called (i.e. occurrence of stack operation).
This is by design - to improve performance as it is easier (and much faster) for processor to address and manipulate data in registers. Architectural registers are limited in size so everything cannot be put in registers. Even if you 'hint' your compiler to put it in register, eventually, OS may manage it outside register, in main memory, if available registers are full.
Most probably, a variable will be in main memory because it hold relevance further in the near execution and may hold reliance for longer period of CPU time. A variable is in architectural register because it holds relevance in upcoming machine instructions and execution will be almost immediate but may not be relevant for long.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
This depends on the architecture and the instruction set it offers. But in practice, yes - it is the typical case.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
Assuming the compiler doesn't eliminate the local variables, it will prefer to put them in registers, because registers are faster than the stack (which resides in the main memory, or a cache).
But this is far from a universal truth: it depends on the (complicated) inner workings of the compiler (whose details are handwaved in that paragraph).
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Even if we assume that the variables are, in fact, stored in the registers, when you call a function, the calling convention kicks in. That's a convention that describes how a function is called, where the arguments are passed, who cleans up the stack, what registers are preserved.
All calling conventions have some kind of overhead. One source of this overhead is the argument passing. Many calling conventions attempt to reduce that, by preferring to pass arguments through registers, but since the number of CPU registers is limited (compared to the space of the stack), they eventually fall back to pushing through the stack after a number of arguments.
The paragraph in your question assumes a calling convention that passes everything through the stack and based on that assumption, what it's trying to tell you is that it would be beneficial (for execution speed) if we could "copy" (at compile time) the body of the called function inside the caller (instead of emitting a call to the function). This would yield the same results logically, but it would eliminate the runtime cost of the function call.

Why is fastcall slower than stdcall?

I found following question: Is fastcall really faster?
No clear answers for x86 were given so I decided to create benchmark.
Here is the code:
#include <time.h>
int __fastcall func(int i)
{
return i + 5;
}
int _stdcall func2(int i)
{
return i + 5;
}
int _tmain(int argc, _TCHAR* argv[])
{
int iter = 100;
int x = 0;
clock_t t = clock();
for (int j = 0; j <= iter;j++)
for (int i = 0; i <= 1000000;i++)
x = func(x & 0xFF);
printf("%d\n", clock() - t);
t = clock();
for (int j = 0; j <= iter;j++)
for (int i = 0; i <= 1000000;i++)
x = func2(x & 0xFF);
printf("%d\n", clock() - t);
printf("%d", x);
return 0;
}
In case of no optimization result in MSVC 10 is:
4671
4414
With max optimization fastcall is sometimes faster, but I guess it is multitasking noise. Here is average result (with iter = 5000)
6638
6487
stdcall looks faster!
Here are results for GCC: http://ideone.com/hHcfP
Again, fastcall lost race.
Here is part of disassembly in case of fastcall:
011917EF pop ecx
011917F0 mov dword ptr [ebp-8],ecx
return i + 5;
011917F3 mov eax,dword ptr [i]
011917F6 add eax,5
this is for stdcall:
return i + 5;
0119184E mov eax,dword ptr [i]
01191851 add eax,5
i is passed via ECX, instead of stack, but saved into stack in the body! So all the effect is neglected! this simple function can be calculated using only registers! And there is no real difference between them.
Can anyone explain what is reason for fastcall? Why doesn't it give speedup?
Edit: With optimization it turned out that both functions are inlined. When I turned inlining off they both are compiled to:
00B71000 add eax,5
00B71003 ret
This looks like great optimization, indeed, but it doesn't respect calling conventions at all, so test is not fair.

__fastcall was introduced a long time ago. At the time, Watcom C++ was beating Microsoft for optimization, and a number of reviewers picked out its register-based calling convention as one (possible) reason why.
Microsoft responded by adding __fastcall, and they've retained it ever since -- but I don't think they ever did much more than enough to be able to say "we have a register-based calling convention too..." Their preference (especially since the 32-bit migration) seems to be for __stdcall. They've put quite a bit of work into improving their code generation with it, but (apparently) not nearly so much with __fastcall. With on-chip caching, the gain from passing things in registers isn't nearly as great as it was then anyway.

Your micro-benchmark produces irrelevant results. __fastcall has specific uses with SSE instructions (see XNAMath) , clock() is not even remotely a suitable timer for benchmarking, and __fastcall exists for multiple platforms like Itanium and some others too, not just for x86, and in addition, your whole program can be effectively optimized to nothing except the printf statements, making the relative performance of __fastcall or __stdcall very, very irrelevant.
Finally, you've forgotten to realize the main reason that a lot of things are done the way they are- legacy. __fastcall may well have been significant before compiler inlining became as aggressive and effective as it is today, and no compiler will remove __fastcall as there will be programs that depend on it. That makes __fastcall a fact of life.

Several reasons
At least in most decent x86 implementations, register renaming is in effect -- the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level.
Sure, you save some stack movement effort with __fastcall, but you reduce the number of registers available for use in the function without modifying the stack.
Most of the time where __fastcall would be faster the function is simple enough to be inlined in any case, which means that it really doesn't matter in real software. (Which is one of the main reasons why __fastcall is not often used)
Side note: What was wrong with Anon's answer?

Fastcall is really only meaningful if you use full optimization (otherwise its effects will be buried by other artifacts), but as you note, with full optimization, the functions will be inlined and you won't see the effect of calling conventions at all.
So to actually test this, you need to make the functions extern declarations with the actual definitions in a separate source file that you compile separately and link with your main routine. When you do that, you'll see that __fastcall is consistently ~25% faster with small functions like this.
The upshot is that __fastcall is really only useful if you have a lot of calls to tiny functions that can't be inlined because they need to be separately compiled.
Edit
So with separate compilation and gcc -O3 -fomit-frame-pointer -m32 I see quite different code for the two functions:
func:
leal 5(%ecx), %eax
ret
func2:
movl 4(%esp), %eax
addl $5, %eax
ret
Running that with iter=5000 consistently gives me results close to
9990000
14160000
indicating that the fastcall version is a shade over 40% faster.

I compiled the two function with i686-w64-mingw32-gcc -O2 -fno-inline fastcall.c. This is the assembly generated for func and func2:
#func#4:
leal 5(%ecx), %eax
ret
_func2#4:
movl 4(%esp), %eax
addl $5, %eax
ret $4
__fastcall really looks faster to me. func2 needs to load the input parameter from the stack. func can simply perform a %eax := %ecx + 5 and then returns to the caller.
Furthermore, the output of your programming is typically like this on my system:
2560
3250
154
So __fastcall does not only look faster, it is faster.
Also note that on x86_64 (or x64 as Microsoft calls it), __fastcall is the default and the old non-fastcall convetion does not exist anymore.
http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions
By making __fastcall the default, x86_64 catches up with other architectures (such as ARM), where passing arguments in registers is also default.

Fastcall itself as a register based calling convention isn't great on x86 because there aren't that many named registers available and by using key registers for passing the values, all you're doing is potentially forcing the calling code to push other values onto the stack and forcing the called function if it is of sufficient complexity to do the same. Essentially from an assembly language perspective, you're increasing the pressure on those named registers and explicitly using stack operations to compensate. So even if the CPU has far more registers available for renaming, it isn't going to refactor the explicit stack operations that have to be inserted.
On the other hand, on more "register rich" architectures like x86-64, register based calling conventions (not exactly the same as fastcall of old, but same concept) are the norm and are used across the board. In other words, once we got out of a few named registers architecture like x86, to something with more register space, fastcall was back in a big way and became the default and really only way used today.

Note: even edited in May 2017 by the OP, this question and answers are likely to be way out of date and not relevant any more by 2019 (if not a few years ago earlier).
A) By at minimal MSVC 2017 (and 2019 released recently). most of the code is going to be inlined in optimized release builds anyhow. Probably the only function body you will see in the entire example now is "_tmain()".
That is unless you specifically do some tricks like declaring the functions as "volatile" and/or wrapping the test functions in pragmas that turn off some optimizations.
B) The latest generation of desktop CPUs (the assumption here) are much improved since the circa 2010 generation. They are much are better at caching the stack, memory alignment matters less, etc.
But don't take my word for it. Load up your executable in a dissembler (IDA Pro, MSVC debugger, etc.) and look for your self (a good way to learn).
Now it would be interesting to see what the performance would be over a large 32bit application. Example, take the last Open sourced DOOM game release and make builds with stdcall and _fastcall and look for framerate differences. And get metrics off of any built-in performance reporting features it has et al.

It does not appear that __fastcall actually indicates that it will be faster. Seems like all you're doing is moving the first fiew variables into registers before making the call to the function. This most likely makes your function call slower since it must move the variables into those registers first. Wikipedia had a pretty good write up about what exactly Fast Call is and how it is implemented.

C++ CPU Register Usage

In C++, local variables are always allocated on the stack. The stack is a part of the allowed memory that your application can occupy. That memory is kept in your RAM (if not swapped out to disk). Now, does a C++ compiler always create assembler code that stores local variables on the stack?
Take, for example, the following simple code:
int foo( int n ) {
return ++n;
}
In MIPS assembler code, this could look like this:
foo:
addi $v0, $a0, 1
jr $ra
As you can see, I didn't need to use the stack at all for n. Would the C++ compiler recognize that, and directly use the CPU's registers?
Edit: Wow, thanks a lot for your almost immediate and extensive answers! The function body of foo should of course be return ++n;, not return n++;. :)

Yes. There is no rule that "variables are always allocated on the stack". The C++ standard says nothing about a stack.It doesn't assume that a stack exists, or that registers exist. It just says how the code should behave, not how it should be implemented.
The compiler only stores variables on the stack when it has to - when they have to live past a function call for example, or if you try to take the address of them.
The compiler isn't stupid. ;)

Disclaimer: I don't know MIPS, but I do know some x86, and I think the principle should be the same..
In the usual function call convention, the compiler will push the value of n onto the stack to pass it to the function foo. However, there is the fastcall convention that you can use to tell gcc to pass the value through the registers instead. (MSVC also has this option, but I'm not sure what its syntax is.)
test.cpp:
int foo1 (int n) { return ++n; }
int foo2 (int n) __attribute__((fastcall));
int foo2 (int n) {
return ++n;
}
Compiling the above with g++ -O3 -fomit-frame-pointer -c test.cpp, I get for foo1:
mov eax,DWORD PTR [esp+0x4]
add eax,0x1
ret
As you can see, it reads in the value from the stack.
And here's foo2:
lea eax,[ecx+0x1]
ret
Now it takes the value directly from the register.
Of course, if you inline the function the compiler will do a simple addition in the body of your larger function, regardless of the calling convention you specify. But when you can't get it inlined, this is going to happen.
Disclaimer 2: I am not saying that you should continually second-guess the compiler. It probably isn't practical and necessary in most cases. But don't assume it produces perfect code.
Edit 1: If you are talking about plain local variables (not function arguments), then yes, the compiler will allocate them in the registers or on the stack as it sees fit.
Edit 2: It appears that calling convention is architecture-specific, and MIPS will pass the first four arguments on the stack, as Richard Pennington has stated in his answer. So in your case you don't have to specify the extra attribute (which is in fact an x86-specific attribute.)

Yes, a good, optimizing C/C++ will optimize that. And even MUCH more: See here: Felix von Leitners Compiler Survey.
A normal C/C++ compiler will not put every variable on the stack anyway. The problem with your foo() function could be that the variable could get passed via the stack to the function (the ABI of your system (hardware/OS) defines that).
With C's register keyword you can give the compiler a hint that it would probably be good to store a variable in a register. Sample:
register int x = 10;
But remember: The compiler is free not to store x in a register if it wants to!

The answer is yes, maybe. It depends on the compiler, the optimization level, and the target processor.
In the case of the mips, the first four parameters, if small, are passed in registers and the return value is returned in a register. So your example has no requirement to allocate anything on the stack.
Actually, truth is stranger than fiction. In your case the parameter is returned unchanged: the value returned is that of n before the ++ operator:
foo:
.frame $sp,0,$ra
.mask 0x00000000,0
.fmask 0x00000000,0
addu $2, $zero, $4
jr $ra
nop

Since your example foo function is an identity function (it just returns it's argument), my C++ compiler (VS 2008) completely removes this function call. If I change it to:
int foo( int n ) {
return ++n;
}
the compiler inlines this with
lea edx, [eax+1]

Yes, The registers are used in C++. The MDR (memory data registers) contains the data being fetched and stored. For example, to retrieve the contents of cell 123, we would load the value 123 (in binary) into the MAR and perform a fetch operation. When the operation is done, a copy of the contents of cell 123 would be in the MDR. To store the value 98 into cell 4, we load a 4 into the MAR and a 98 into the MDR and perform a store. When the operation is completed the contents of cell 4 will have been set to 98, by discarding whatever was there previously. The data & address registers work with them to achieve this. In C++ too, when we initialize a var with a value or ask its value, the same phenomena Happens.
And, One More Thing, Modern Compilers also perform Register Allocation, which is kinda faster than memory allocation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js