C++ CPU Register Usage

C++ CPU Register Usage - c++

In C++, local variables are always allocated on the stack. The stack is a part of the allowed memory that your application can occupy. That memory is kept in your RAM (if not swapped out to disk). Now, does a C++ compiler always create assembler code that stores local variables on the stack?
Take, for example, the following simple code:
int foo( int n ) {
return ++n;
}
In MIPS assembler code, this could look like this:
foo:
addi $v0, $a0, 1
jr $ra
As you can see, I didn't need to use the stack at all for n. Would the C++ compiler recognize that, and directly use the CPU's registers?
Edit: Wow, thanks a lot for your almost immediate and extensive answers! The function body of foo should of course be return ++n;, not return n++;. :)

Yes. There is no rule that "variables are always allocated on the stack". The C++ standard says nothing about a stack.It doesn't assume that a stack exists, or that registers exist. It just says how the code should behave, not how it should be implemented.
The compiler only stores variables on the stack when it has to - when they have to live past a function call for example, or if you try to take the address of them.
The compiler isn't stupid. ;)

Disclaimer: I don't know MIPS, but I do know some x86, and I think the principle should be the same..
In the usual function call convention, the compiler will push the value of n onto the stack to pass it to the function foo. However, there is the fastcall convention that you can use to tell gcc to pass the value through the registers instead. (MSVC also has this option, but I'm not sure what its syntax is.)
test.cpp:
int foo1 (int n) { return ++n; }
int foo2 (int n) __attribute__((fastcall));
int foo2 (int n) {
return ++n;
}
Compiling the above with g++ -O3 -fomit-frame-pointer -c test.cpp, I get for foo1:
mov eax,DWORD PTR [esp+0x4]
add eax,0x1
ret
As you can see, it reads in the value from the stack.
And here's foo2:
lea eax,[ecx+0x1]
ret
Now it takes the value directly from the register.
Of course, if you inline the function the compiler will do a simple addition in the body of your larger function, regardless of the calling convention you specify. But when you can't get it inlined, this is going to happen.
Disclaimer 2: I am not saying that you should continually second-guess the compiler. It probably isn't practical and necessary in most cases. But don't assume it produces perfect code.
Edit 1: If you are talking about plain local variables (not function arguments), then yes, the compiler will allocate them in the registers or on the stack as it sees fit.
Edit 2: It appears that calling convention is architecture-specific, and MIPS will pass the first four arguments on the stack, as Richard Pennington has stated in his answer. So in your case you don't have to specify the extra attribute (which is in fact an x86-specific attribute.)

Yes, a good, optimizing C/C++ will optimize that. And even MUCH more: See here: Felix von Leitners Compiler Survey.
A normal C/C++ compiler will not put every variable on the stack anyway. The problem with your foo() function could be that the variable could get passed via the stack to the function (the ABI of your system (hardware/OS) defines that).
With C's register keyword you can give the compiler a hint that it would probably be good to store a variable in a register. Sample:
register int x = 10;
But remember: The compiler is free not to store x in a register if it wants to!

The answer is yes, maybe. It depends on the compiler, the optimization level, and the target processor.
In the case of the mips, the first four parameters, if small, are passed in registers and the return value is returned in a register. So your example has no requirement to allocate anything on the stack.
Actually, truth is stranger than fiction. In your case the parameter is returned unchanged: the value returned is that of n before the ++ operator:
foo:
.frame $sp,0,$ra
.mask 0x00000000,0
.fmask 0x00000000,0
addu $2, $zero, $4
jr $ra
nop

Since your example foo function is an identity function (it just returns it's argument), my C++ compiler (VS 2008) completely removes this function call. If I change it to:
int foo( int n ) {
return ++n;
}
the compiler inlines this with
lea edx, [eax+1]

Yes, The registers are used in C++. The MDR (memory data registers) contains the data being fetched and stored. For example, to retrieve the contents of cell 123, we would load the value 123 (in binary) into the MAR and perform a fetch operation. When the operation is done, a copy of the contents of cell 123 would be in the MDR. To store the value 98 into cell 4, we load a 4 into the MAR and a 98 into the MDR and perform a store. When the operation is completed the contents of cell 4 will have been set to 98, by discarding whatever was there previously. The data & address registers work with them to achieve this. In C++ too, when we initialize a var with a value or ask its value, the same phenomena Happens.
And, One More Thing, Modern Compilers also perform Register Allocation, which is kinda faster than memory allocation.

Related

embed a functions assembly code in a struct

I've a rather special question: is it possible in C/++ (both because I am sure the question is the same in both languages) to specify a functions's location? Why? I have a very large list of function pointers, and I want to eliminate them.
(Currently) This looks like that(repeated over lika a million times, stored in the user's RAM):
struct {
int i;
void(* funptr)();
} test;
Because I know that in most assembly languages, functions are just "goto" directives, I had the following idea. Is it possible to optimize the above construct so that it looks like that?
struct {
int i;
// embed the assembler of the function here
// so that all the functions
// instructions are located here
// like this: mov rax, rbx
// jmp _start ; just demo code
} test2;
In the end, the thing should look like this in memory: An int holding any value, followed by the function's assembly code, referenced by test2. I should be able to call these functions like that: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();
You might think that I'm insane to optimize the app that way, and I cannot disclose any more details on it's function, but if anyone has some pointers on how solve this problem, I would appreciate it.
I do not think that there is a standard way to this, so any hacky way to do this via inline assembler/other crazy things is also appreciated!

The only thing you really have to do is make the compiler aware of the (constant) value of the function pointer you want in the struct. The compiler will then (presumably/hopefully) inline that function call wherever it sees it called through that function pointer:
template<void(*FPtr)()>
struct function_struct {
int i;
static constexpr auto funptr = FPtr;
};
void testFunc()
{
volatile int x = 0;
}
using test = function_struct<testFunc>;
int main()
{
test::funptr();
}
Demo - no call or jmp after optimization.
It remains unclear what the point of the int i is. Note that the code is not technically "directly after the i" here, but it is even more unclear how you'd expect instances of the struct to look like (is the code in them or is it "static" in a way? I feel there is some misunderstanding here on your part what compilers actually produce...). But consider the ways that compiler inlining can help you and you might find the solution you need. If you're worried about executable size after inlining, tell the compiler and it will compromise between speed and size.

This sounds like a terrible idea for a lot of reasons that probably won't save memory, and will hurt performance by diluting L1I-cache with data and L1D-cache with code. And worse if you ever modify or copy objects: self-modifying code stalls.
But yes, this would be possible in C99/C11 with a flexible array member at the end of the struct, which you cast to a function pointer.
struct int_with_code {
int i;
char code[]; // C99 flexible array member. GNU extension in C++
// Store machine code here
// you can't get the compiler to do this for you. Good Luck!
};
void foo(struct int_with_code *p) {
// explicit C-style cast compiles as both C and C++
void (*funcp)(void) = ( void (*)(void) ) p->code;
funcp();
}
Compiler output from clang7.0, on the Godbolt compiler explorer is the same when compiled as either C or C++. This is targeting the x86-64 System V ABI, where the first function arg is passed in RDI.
# this is the code that *uses* such an object, not the code that goes in its code[]
# This proves that it compiles,
# without showing any way to get compiler-generated code into code[]
foo: # #foo
add rdi, 4 # move the pointer 4 bytes forward, to point at code[]
jmp rdi # TAILCALL
(If you leave out the (void) arg-type declaration in C, the compiler will zero AL first in the x86-64 SysV calling convention, in case its actually a variadic function, because it's passing no FP args in registers.)
You'd have to allocate your objects in memory that was executable (normally not done unless they're const with static storage), e.g. compile with gcc -zexecstack. Or use a custom mmap/mprotect or VirtualAlloc/VirtualProtect on POSIX or Windows.
Or if your objects are all statically allocated, it might be possible to massage compiler output to turn functions in the .text section into objects by adding an int member right before each one. Maybe with some .section and linker tricks, and maybe a linker script, you could even somehow automate it.
But unless they're all the same length (e.g. with padding like char code[60]), that won't form an array you can index, so you'll need some way of referencing all these variable-length object.
There are potentially huge performance downsides if you ever modify an object before calling its function: on x86 you'll get self-modifying-code pipeline nuke for executing code near a just-written memory location.
Or if you copied an object before calling its function: x86 pipeline flush, or on other ISAs you need to manually flush caches to get the I-cache in sync with D-cache (so the newly-written bytes can be executed). But you can't copy such objects because their size isn't stored anywhere. You can't search the machine code for a ret instruction, because a 0xc3 byte might appear somewhere that's not the start of an x86 instruction. Or on any ISA, the function might have multiple ret instructions (tail duplication optimization). Or end with a jmp instead of a ret (tailcall).
Storing a size would start to defeat the purpose of saving size, eating up at least an extra byte in each object.
Writing code to an object at runtime, then casting to a function pointer, is undefined behaviour in ISO C and C++. On GNU C/C++, make sure you call __builtin___clear_cache on it to sync caches or whatever else is necessary. Yes, this is needed even on x86 to disable dead-store elimination optimizations: see this test case. On x86 it's just a compile-time thing, no extra asm. It doesn't actually clear any caches.
If you do copy at runtime startup, maybe allocate a big chunk of memory and carve out variable-length chunks of it, while copying. If you malloc each separately, you're wasting memory-management overhead on it.
This idea will not save you memory unless you have about as many functions as you have objects
Normally you have a fairly limited number of actual functions, with many objects having copies of the same function pointer. (You've kind of hand-rolled C++ virtual functions, but with only one function you just have a function pointer directly instead of a vtable pointer to a table of pointers for that class type. One fewer levels of indirection, and apparently you're not passing the object's own address to the function.)
One of the several benefits of this level of indirection is that one pointer is usually significantly smaller than the entire code for a function. For that to not be the case, your functions would have to be tiny.
Example: with 10 different functions of 32 bytes each, and 1000 objects with function pointers, you have a total of 320 bytes of code (which will stay hot in I-cache), and 8000 bytes of function pointers. (And in your objects, another 4 bytes per object wasted on padding to align the pointer, making the total size 16 instead of 12 bytes per object.) Anyway, that's 16320 bytes total for entire structs + code. If you allocated each object separately, there's per-object bookkeeping.
With inlining machine code into each object, and no padding, that's 1000 * (4+32) = 36000 bytes, over twice the total size.
x86-64 is probably a best-case scenario, where a pointer is 8 bytes and x86-64 machine code uses a (famously complex) variable-length instruction encoding which allows for high code density in some cases, especially when optimizing for code-size. (e.g. code-golfing. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code). But unless your functions are mostly something trivial like lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte), they're still going to take more than 8 bytes. (That's return x*3; for a function that takes a 32-bit integer x arg, in the x86-64 System V ABI.)
If they're wrappers for larger functions, a normal call rel32 instruction is 5 bytes. A load of static data is at least 6 bytes (opcode + modrm + rel32 for a RIP-relative addressing mode, or loading EAX specifically can use the special no-modrm encoding for an absolute address. But in x86-64 that's a 64-bit absolute unless you use an address-size prefix too, potentially causing an LCP stall in the decoders on Intel. mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 bytes again, so this is worse for no benefit).
Your function-pointer type doesn't have any args (assuming this is C++ where foo() means foo(void) in a declaration, not like old C where an empty arg list is somewhat similar to (...)). Thus we can assume you're not passing args, so to do anything useful the functions are probably accessing some static data or making another call.
Ideas that make more sense:
Use an ILP32 ABI like Linux x32, where the CPU runs in 64-bit mode but your code uses 32-bit pointers. This would make each of your objects only 8 bytes instead of 16. Avoiding pointer-bloat is a classic use-case for x32 or ILP32 ABIs in general.
Or (yuck) compile your code as 32-bit. But then you have obsolete 32-bit calling conventions that pass args on the stack instead of registers, and less than half the registers, and much higher overhead for position-independent code. (No EIP/RIP-relative addressing.)
Store an unsigned int table index to a table of function pointers. If you have 100 functions but 10k objects, the table is only 100 pointers long. In asm you could index an array of code directly (computed goto style) if all the functions were padded to the same length, but in C++ you can't do that. An extra level of indirection with a table of function pointers is probably your best bet.
e.g.
void (*const fptrs[])(void) = {
func1, func2, func3, ...
};
struct int_with_func {
int i;
unsigned f;
};
void bar(struct int_with_func *p) {
fptrs[p->f] ();
}
clang/gcc -O3 output:
bar(int_with_func*):
mov eax, dword ptr [rdi + 4] # load p->f
jmp qword ptr [8*rax + fptrs] # TAILCALL # index the global table with it for a memory-indirect jmp
If you were compiling a shared library, PIE executable, or not targeting Linux, the compiler couldn't use a 32-bit absolute address to index a static array with one instruction. So there'd be a RIP-relative LEA in there and something like jmp [rcx+rax*8].
This is an extra level of indirection vs. storing a function pointer in each object, but it lets you shrink each object to 8 bytes, down from 16, like using 32-bit pointers. Or to 5 or 6 bytes, if you use an unsigned short or uint8_t and pack the structs with __attribute__((packed)) in GNU C.

No, not really.
The way to specify a function's location is to use a function pointer, which you're already doing.
You could make different types which have their own different member functions, but then you're back to the original problem.
I have in the past experimented with auto-generating (as a pre-build step, using Python) a function with a long switch statement that does the work of mapping int i to a normal function call. This gets rid of the function pointers, at the expense of branching. I don't remember whether it ended up being worthwhile in my case and, even if I did, that wouldn't tell us whether it's worthwhile in your case.
Because I know that in most assembly languages, functions are just "goto" directives
Well, it's perhaps a little more complicated than that…
You might think that I'm insane to optimize the app that way
Perhaps. Trying to eliminate indirection is not, in itself, a bad thing, so I don't think you're wrong to try to improve this. I just don't think that you necessarily can.
but if anyone has some pointers
lol

I don't understand the goal of this "optimization" is it about saving the memory?
I might be misunderstanding the question, but if you just replace your function pointer with a regular function, then you'll have your struct only containing the int as data and the function-pointer being inserted by the compiler when you take the address of it, instead of stored in memory.
So just do
struct {
int i;
void func();
} test;
Then sizeof(test)==sizeof(int) should hold true if you set alignment/packing to be tight.

Where are expressions and constants stored if not in memory?

From C Programming Language by Brian W. Kernighan
& operator only applies to objects in memory: variables and array
elements. It cannot be applied to expressions, constants or register
variables.
Where are expressions and constants stored if not in memory?
What does that quote mean?
E.g:
&(2 + 3)
Why can't we take its address? Where is it stored?
Will the answer be same for C++ also since C has been its parent?
This linked question explains that such expressions are rvalue objects and all rvalue objects do not have addresses.
My question is where are these expressions stored such that their addresses can't be retrieved?

Consider the following function:
unsigned sum_evens (unsigned number) {
number &= ~1; // ~1 = 0xfffffffe (32-bit CPU)
unsigned result = 0;
while (number) {
result += number;
number -= 2;
}
return result;
}
Now, let's play the compiler game and try to compile this by hand. I'm going to assume you're using x86 because that's what most desktop computers use. (x86 is the instruction set for Intel compatible CPUs.)
Let's go through a simple (unoptimized) version of how this routine could look like when compiled:
sum_evens:
and edi, 0xfffffffe ;edi is where the first argument goes
xor eax, eax ;set register eax to 0
cmp edi, 0 ;compare number to 0
jz .done ;if edi = 0, jump to .done
.loop:
add eax, edi ;eax = eax + edi
sub edi, 2 ;edi = edi - 2
jnz .loop ;if edi != 0, go back to .loop
.done:
ret ;return (value in eax is returned to caller)
Now, as you can see, the constants in the code (0, 2, 1) actually show up as part of the CPU instructions! In fact, 1 doesn't show up at all; the compiler (in this case, just me) already calculates ~1 and uses the result in the code.
While you can take the address of a CPU instruction, it often makes no sense to take the address of a part of it (in x86 you sometimes can, but in many other CPUs you simply cannot do this at all), and code addresses are fundamentally different from data addresses (which is why you cannot treat a function pointer (a code address) as a regular pointer (a data address)). In some CPU architectures, code addresses and data addresses are completely incompatible (although this is not the case of x86 in the way most modern OSes use it).
Do notice that while (number) is equivalent to while (number != 0). That 0 doesn't show up in the compiled code at all! It's implied by the jnz instruction (jump if not zero). This is another reason why you cannot take the address of that 0 — it doesn't have one, it's literally nowhere.
I hope this makes it clearer for you.

where are these expressions stored such that there addresses can't be retrieved?
Your question is not well-formed.
Conceptually
It's like asking why people can discuss ownership of nouns but not verbs. Nouns refer to things that may (potentially) be owned, and verbs refer to actions that are performed. You can't own an action or perform a thing.
In terms of language specification
Expressions are not stored in the first place, they are evaluated.
They may be evaluated by the compiler, at compile time, or they may be evaluated by the processor, at run time.
In terms of language implementation
Consider the statement
int a = 0;
This does two things: first, it declares an integer variable a. This is defined to be something whose address you can take. It's up to the compiler to do whatever makes sense on a given platform, to allow you to take the address of a.
Secondly, it sets that variable's value to zero. This does not mean an integer with value zero exists somewhere in your compiled program. It might commonly be implemented as
xor eax,eax
which is to say, XOR (exclusive-or) the eax register with itself. This always results in zero, whatever was there before. However, there is no fixed object of value 0 in the compiled code to match the integer literal 0 you wrote in the source.
As an aside, when I say that a above is something whose address you can take - it's worth pointing out that it may not really have an address unless you take it. For example, the eax register used in that example doesn't have an address. If the compiler can prove the program is still correct, a can live its whole life in that register and never exist in main memory. Conversely, if you use the expression &a somewhere, the compiler will take care to create some addressable space to store a's value in.
Note for comparison that I can easily choose a different language where I can take the address of an expression.
It'll probably be interpreted, because compilation usually discards these structures once the machine-executable output replaces them. For example Python has runtime introspection and code objects.
Or I can start from LISP and extend it to provide some kind of addressof operation on S-expressions.
The key thing they both have in common is that they are not C, which as a matter of design and definition does not provide those mechanisms.

Such expressions end up part of the machine code. An expression 2 + 3 likely gets translated to the machine code instruction "load 5 into register A". CPU registers don't have addresses.

It does not really make sense to take the address to an expression. The closest thing you can do is a function pointer. Expressions are not stored in the same sense as variables and objects.
Expressions are stored in the actual machine code. Of course you could find the address where the expression is evaluated, but it just don't make sense to do it.
Read a bit about assembly. Expressions are stored in the text segment, while variables are stored in other segments, such as data or stack.
https://en.wikipedia.org/wiki/Data_segment
Another way to explain it is that expressions are cpu instructions, while variables are pure data.
One more thing to consider: The compiler often optimizes away things. Consider this code:
int x=0;
while(x<10)
x+=1;
This code will probobly be optimized to:
int x=10;
So what would the address to (x+=1) mean in this case? It is not even present in the machine code, so it has - by definition - no address at all.

Where are expressions and constants stored if not in memory
In some (actually many) cases, a constant expression is not stored at all. In particular, think about optimizing compilers, and see CppCon 2017: Matt Godbolt's talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
In your particular case of some C code having 2 + 3, most optimizing compilers would have constant folded that into 5, and that 5 constant might be just inside some machine code instruction (as some bitfield) of your code segment and not even have a well defined memory location. If that constant 5 was a loop limit, some compilers could have done loop unrolling, and that constant won't appear anymore in the binary code.
See also this answer, etc...
Be aware that C11 is a specification written in English. Read its n1570 standard. Read also the much bigger specification of C++11 (or later).
Taking the address of a constant is forbidden by the semantics of C (and of C++).

How is it known that variables are in registers, or on stack?

I am reading this question about inline on isocpp FAQ, the code is given as
void f()
{
int x = /*...*/;
int y = /*...*/;
int z = /*...*/;
// ...code that uses x, y and z...
g(x, y, z);
// ...more code that uses x, y and z...
}
then it says that
Assuming a typical C++ implementation that has registers and a stack,
the registers and parameters get written to the stack just before the
call to g(), then the parameters get read from the stack inside
g() and read again to restore the registers while g() returns to
f(). But that’s a lot of unnecessary reading and writing, especially
in cases when the compiler is able to use registers for variables x,
y and z: each variable could get written twice (as a register and
also as a parameter) and read twice (when used within g() and to
restore the registers during the return to f()).
I have a big difficulty understanding the paragraph above. I try to list my questions as below:
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
PS
It's very hard to choose an acceptable answer when the answers are all very good(E.g., the ones provided by #MatsPeterson, #TheodorosChatzigiannakis, and #superultranova) I think. I personally like the one by #Potatoswatter a little bit more since the answer offers some guidelines.

Don't take that paragraph too seriously. It seems to be making excessive assumptions and then going into excessive detail, which can't really be generalized.
But, your questions are very good.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
More-or-less, everything needs to be loaded into registers. Most computers are organized around a datapath, a bus connecting the registers, the arithmetic circuits, and the top level of the memory hierarchy. Usually, anything that is broadcast on the datapath is identified with a register.
You may recall the great RISC vs CISC debate. One of the key points was that a computer design can be much simpler if the memory is not allowed to connect directly to the arithmetic circuits.
In modern computers, there are architectural registers, which are a programming construct like a variable, and physical registers, which are actual circuits. The compiler does a lot of heavy lifting to keep track of physical registers while generating a program in terms of architectural registers. For a CISC instruction set like x86, this may involve generating instructions that send operands in memory directly to arithmetic operations. But behind the scenes, it's registers all the way down.
Bottom line: Just let the compiler do its thing.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Each platform defines a way for C functions to call each other. Passing parameters in registers is more efficient. But, there are trade-offs and the total number of registers is limited. Older ABIs more often sacrificed efficiency for simplicity, and put them all on the stack.
Bottom line: The example is arbitrarily assuming a naive ABI.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
The compiler tends to prefer to use registers for more frequently accessed values. Nothing in the example requires the use of the stack. However, less frequently accessed values will be placed on the stack to make more registers available.
Only when you take the address of a variable, such as by &x or passing by reference, and that address escapes the inliner, is the compiler required use memory and not registers.
Bottom line: Avoid taking addresses and passing/storing them willy-nilly.

It is entirely up to the compiler (in conjunction with the processor type) whether a variable is stored in memory or a register [or in some cases more than one register] (and what options you give the compiler, assuming it's got options to decide such things - most "good" compilers do). For example, the LLVM/Clang compiler uses a specific optimisation pass called "mem2reg" that moves variables from memory to registers. The decision to do so is based on how the variable(s) are used - for example, if you take the address of a variable at some point, it needs to be in memory.
Other compilers have similar, but not necessarily identical, functionality.
Also, at least in compilers that have some semblance of portability, there will ALSO be a phase of generatinc machine code for the actual target, which contains target-specific optimisations, which again can move a variable from memory to a register.
It is not possible [without understanding how the particular compiler works] to determine if the variables in your code are in registers or in memory. One can guess, but such a guess is just like guessing other "kind of predictable things", like looking out the window to guess if it's going to rain in a few hours - depending on where you live, this may be a complete random guess, or quite predictable - some tropical countries, you can set your watch based on when the rain arrives each afternoon, in other countries, it rarely rains, and in some countries, like here in England, you can't know for certain beyond "right now it is [not] raining right here".
To answer the actual questions:
This depends on the processor. Proper RISC processors such as ARM, MIPS, 29K, etc have no instructions that use memory operands except the load and store type instructions. So if you need to add two values, you need to load the values into registers, and use the add operation on those registers. Some, such as x86 and 68K allows one of the two operands to be a memory operand, and for example PDP-11 and VAX have "full freedom", whether your operands are in memory or register, you can use the same instruction, just different addressing modes for the different operands.
Your original premise here is wrong - it's not guaranteed that arguments to g are on the stack. That is just one of many options. Many ABIs (application binary interface, aka "calling conventions) use registers for the first few arguments to a function. So, again, it depends on which compiler (to some degree) and what processor (much more than which compiler) the compiler targets whether the arguments are in memory or in registers.
Again, this is a decision that the compiler makes - it depends on how many registers the processor has, which are available, what the cost is if "freeing" some register for x, y and z - which ranges from "no cost at all" to "quite a bit" - again, depending on the processor model and the ABI.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
Not even this statement is always true. It is probably true for all the platforms you'll ever work with, but there surely can be another architecture that doesn't make use of processor registers at all.
Your x86_64 computer does however.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
These two questions cannot be uniquely answered for any compiler and system your code will be compiled on. They cannot even be taken for granted since g's parameters might not be on the stack, it all depends on several concepts I'll explain below.
First you should be aware of the so-called calling conventions which define, among the other things, how function parameters are passed (e.g. pushed on the stack, placed in registers, or a mix of both). This isn't enforced by the C++ standard and calling conventions are a part of the ABI, a broader topic regarding low-level machine code program issues.
Secondly register allocation (i.e. which variables are actually loaded in a register at any given time) is a complex task and a NP-complete problem. Compilers try to do their best with the information they have. In general less frequently accessed variables are put on the stack while more frequently accessed variables are kept on registers. Thus the part Where the data inside g() is stored, register or stack? cannot be answered once-and-for-all since it depends on many factors including register pressure.
Not to mention compiler optimizations which might even eliminate the need for some variables to be around.
Finally the question you linked already states
Naturally your mileage may vary, and there are a zillion variables that are outside the scope of this particular FAQ, but the above serves as an example of the sorts of things that can happen with procedural integration.
i.e. the paragraph you posted makes some assumptions to set things up for an example. Those are just assumptions and you should treat them as such.
As a small addition: regarding the benefits of inline on a function I recommend taking a look at this answer: https://stackoverflow.com/a/145952/1938163

You can't know, without looking at the assembly language, whether a variable is in a register, stack, heap, global memory or elsewhere. A variable is an abstract concept. The compiler is allowed to use registers or other memory as it chooses, as long as the execution isn't changed.
There's also another rule that affects this topic. If you take the address of a variable and store into a pointer, the variable may not be placed into a register because registers don't have addresses.
The variable storage may also depend on the optimization settings for the compiler. Variables can disappear due to simplification. Variables that don't change value may be placed into the executable as a constant.

Regarding your #1 question, yes, non load/store instructions operate on registers.
Regarding your #2 question, if we are assuming that parameters are passed on the stack, then we have to write the registers to the stack, otherwise g() won't be able to access the data, since the code in g() doesn't "know" which registers the parameters are in.
Regarding your #3 question, it is not known that x, y and z will for sure be stored in registers in f(). One could use the register keyword, but that's more of a suggestion. Based on the calling convention, and assuming the compiler doesn't do any optimization involving parameter passing, you may be able to predict whether the parameters are on the stack or in registers.
You should familiarize yourself with calling conventions. Calling conventions deal with the way that parameters are passed to functions and typically involve passing parameters on the stack in a specified order, putting parameters into registers or a combination of both.
stdcall, cdecl, and fastcall are some examples of calling conventions. In terms of parameter passing, stdcall and cdecl are the same, in the parameters are pushed in right to left order onto the stack. In this case, if g() was cdecl or stdcall the caller would push z,y,x in that order:
mov eax, z
push eax
mov eax, x
push eax
mov eax, y
push eax
call g
In 64bit fastcall, registers are used, microsoft uses RCX, RDX, R8, R9 (plus the stack for functions requiring more than 4 params), linux uses RDI, RSI, RDX, RCX, R8, R9. To call g() using MS 64bit fastcall one would do the following (we assume z, x, and y are not in registers)
mov rcx, x
mov rdx, y
mov r8, z
call g
This is how assembly is written by humans, and sometimes compilers. Compilers will use some tricks to avoid passing parameters, as it typically reduces the number of instructions and can reduce the number of time memory is accessed. Take the following code for example (I'm intentionally ignoring non-volatile register rules):
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
call g
mov rcx, rax
ret
g:
mov rax, rsi
add rax, rcx
add rax, rdx
ret
For illustrative purposes, rcx is already in use, and x has been loaded into rsi. The compiler can compile g such that it uses rsi instead of rcx, so values don't have to be swapped between the two registers when it comes time to call g. The compiler could also inline g, now that f and g share the same set of registers for x, y, and z. In that case, the call g instruction would be replaced with the contents of g, excluding the ret instruction.
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
mov rax, rsi
add rax, rcx
add rax, rdx
mov rcx, rax
ret
This will be even faster, because we don't have to deal with the call instruction, since g has been inlined into f.

Short answer: You can't. It completely depends on your compiler and the optimizing features enabled.
The compiler concern is to translate into assembly your program, but how it is done is tighly coupled to how your compiler works.
Some compilers allows you hint what variable map to register.
Check for example this: https://gcc.gnu.org/onlinedocs/gcc/Global-Reg-Vars.html
Your compiler will apply transformations to your code in order to gain something, may be performance, may be lower code size, and it apply cost functions to estimate this gains, so you normally only can see the result disassembling the compilated unit.

Variables are almost always stored in main memory. Many times, due to compiler optimizations, value of your declared variable will never move to main memory but those are intermediate variable that you use in your method which doesn't hold relevance before any other method is called (i.e. occurrence of stack operation).
This is by design - to improve performance as it is easier (and much faster) for processor to address and manipulate data in registers. Architectural registers are limited in size so everything cannot be put in registers. Even if you 'hint' your compiler to put it in register, eventually, OS may manage it outside register, in main memory, if available registers are full.
Most probably, a variable will be in main memory because it hold relevance further in the near execution and may hold reliance for longer period of CPU time. A variable is in architectural register because it holds relevance in upcoming machine instructions and execution will be almost immediate but may not be relevant for long.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
This depends on the architecture and the instruction set it offers. But in practice, yes - it is the typical case.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
Assuming the compiler doesn't eliminate the local variables, it will prefer to put them in registers, because registers are faster than the stack (which resides in the main memory, or a cache).
But this is far from a universal truth: it depends on the (complicated) inner workings of the compiler (whose details are handwaved in that paragraph).
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Even if we assume that the variables are, in fact, stored in the registers, when you call a function, the calling convention kicks in. That's a convention that describes how a function is called, where the arguments are passed, who cleans up the stack, what registers are preserved.
All calling conventions have some kind of overhead. One source of this overhead is the argument passing. Many calling conventions attempt to reduce that, by preferring to pass arguments through registers, but since the number of CPU registers is limited (compared to the space of the stack), they eventually fall back to pushing through the stack after a number of arguments.
The paragraph in your question assumes a calling convention that passes everything through the stack and based on that assumption, what it's trying to tell you is that it would be beneficial (for execution speed) if we could "copy" (at compile time) the body of the called function inside the caller (instead of emitting a call to the function). This would yield the same results logically, but it would eliminate the runtime cost of the function call.

C++ what happens to a value that is returned but is not stored?

Say I have a function that returns an int. I don't store the value from the function call. I presume that it is not stored in memory and fades into the aether, but I don't know.
Thank you.

An int return value will normally be stored in a register (e.g., EAX or RAX on 32-bit or 64-bit Intel, respectively).
It won't fade. It'll simply be overwritten when the compiler needs that register for some other purpose. If the function in question is expanded inline, the compiler may detect that the value isn't used, and elide the code to write or compute the value at all.

may a whole array reside in some cpu register?

As I'm not too much familiar with cpu registers, in general and in any architecture specially x86 and if compiler-relevant using VC++ I'm curious that is it possible for all elements of an array with a tiny number of elements like an array of 1-byte characters with 4 elements to reside in some cpu register as I know this could be true for single primitives like double, integer, etc ?
when we have a parameter like below:
void someFunc(char charArray[4]){
//whatever
}
Will this parameter passing be definitely done through passing a pointer to the function or that array would be residing in some cpu register eliminating the need to pass a pointer to main memory?

This is not compiler dependent, nor is it possible. Arrays cannot be passed by value in the same way as other types, i.e. they cannot be copied when passed into a function. The C++ standard is clear in that when processing a function signature in a declaration the following are exact equivalencies:
void foo( char *a );
void foo( char a[] );
void foo( char a[4] );
void foo( char a[ 100000 ] );
A compliant compiler will convert the array in the function signature into a pointer. Now, at the place of call, a similar operation takes place: if the argument is an array, the compiler has to decay it into a pointer to the first element. Again, the size of the array is lost in the decay.
Specific registers can be used to hold more than one value and perform operations on them (google for vectorized operations, MME and variants). But while that means that the compiler can actually insert the contents of a small array into a single register, that cannot be used to change the function call that you refer to.

Within a single function, an array could be held in one or more registers, just so long as the compiler is able to produce CPU instructions to manipulate it as the code dictates. The standard doesn't really define what it means for something to "be" in a register. It's a private matter between the compiler and the debugger, and there may be a fine line between something being in a register, and being "optimized away" entirely.
In your example, the parameter is a pointer, not an array (see dribeas' answer). So it would be unusual that the array it points to could possibly be held a register. The "main" architectures that you probably deal with don't allow a pointer to a register, so even if the array was held in a register in the calling code, it would have to be written into memory in order to take a pointer to it, to pass to the callee.
If the function call was inlined, then better optimizations might be possible, just as if there were no call at all.
If you wrap your array in a struct, then you turn it into something that can be passed by value:
struct Foo {
char a[4];
};
void FooFunc(Foo f) {
// whatever
}
Now, the function is taking the actual array data as its parameter, so there's one less barrier to holding it in a register. Whether the implementation's calling convention actually does pass small structs in registers is another question, though. I don't know what calling conventions do this, if any.

Out of the 5 or so compilers I'm fairly familiar with, (Borland/Turbo C/C++ from 1.0, Watcom C/C++ from v8.0, MSC from 5.0, IBM Visual Age C/C++, gcc of various versions on DOS, Linux and Windows) I've not seen this optimization happen naturally.
There was a string library, whose name I cannot remember, that did optimizations similar to this in x86 ASM. It may have been part of the "Spontaneous Assembly" library, but no guarantees.

A function that accepts an array is probably going to index into that array. I know of no architecture that supports efficient indexing into a register, so it's probably pointless to pass arrays in registers.
(On an x86 architecture, you could access a[0] and a[1] by accessing al and ah of the eax register, but that is a special case that only works if the indexes are known at compile time.)

You asked if its possible with VC++ on an x86.
I doubt it's possible in that configuration. True, you could produce assembler code where that array is kept in a register, but due to the nature of arrays it would be by no means a natural optimization for a compiler, so I doubt they put it in.
You can try it out though and produce some code where the compiler would have an "incentive" to put it in a register, but it would look pretty weird like
char x[4];
*((int*)x) = 36587467;
Compile that with optimizations and the /FA switch and look at the assembler code produced (and then tell us the results :-))
If you use it in a more "natural" way, like accessing single characters or initializing it with a string there is no reason at all for the compiler to put that array into a register.
Even when passing it to a function - the compiler might put the address of the array into the register, but not the array itself

Only variables can be stored in a register. You can try to force register storage by using the register keyword: register int i;
Arrays are by default pointers.
You can get the value located at the 4 position like this (using pointer syntax):
char c = *(charArray + 4);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js