embed a functions assembly code in a struct

embed a functions assembly code in a struct - c++

I've a rather special question: is it possible in C/++ (both because I am sure the question is the same in both languages) to specify a functions's location? Why? I have a very large list of function pointers, and I want to eliminate them.
(Currently) This looks like that(repeated over lika a million times, stored in the user's RAM):
struct {
int i;
void(* funptr)();
} test;
Because I know that in most assembly languages, functions are just "goto" directives, I had the following idea. Is it possible to optimize the above construct so that it looks like that?
struct {
int i;
// embed the assembler of the function here
// so that all the functions
// instructions are located here
// like this: mov rax, rbx
// jmp _start ; just demo code
} test2;
In the end, the thing should look like this in memory: An int holding any value, followed by the function's assembly code, referenced by test2. I should be able to call these functions like that: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();
You might think that I'm insane to optimize the app that way, and I cannot disclose any more details on it's function, but if anyone has some pointers on how solve this problem, I would appreciate it.
I do not think that there is a standard way to this, so any hacky way to do this via inline assembler/other crazy things is also appreciated!

The only thing you really have to do is make the compiler aware of the (constant) value of the function pointer you want in the struct. The compiler will then (presumably/hopefully) inline that function call wherever it sees it called through that function pointer:
template<void(*FPtr)()>
struct function_struct {
int i;
static constexpr auto funptr = FPtr;
};
void testFunc()
{
volatile int x = 0;
}
using test = function_struct<testFunc>;
int main()
{
test::funptr();
}
Demo - no call or jmp after optimization.
It remains unclear what the point of the int i is. Note that the code is not technically "directly after the i" here, but it is even more unclear how you'd expect instances of the struct to look like (is the code in them or is it "static" in a way? I feel there is some misunderstanding here on your part what compilers actually produce...). But consider the ways that compiler inlining can help you and you might find the solution you need. If you're worried about executable size after inlining, tell the compiler and it will compromise between speed and size.

This sounds like a terrible idea for a lot of reasons that probably won't save memory, and will hurt performance by diluting L1I-cache with data and L1D-cache with code. And worse if you ever modify or copy objects: self-modifying code stalls.
But yes, this would be possible in C99/C11 with a flexible array member at the end of the struct, which you cast to a function pointer.
struct int_with_code {
int i;
char code[]; // C99 flexible array member. GNU extension in C++
// Store machine code here
// you can't get the compiler to do this for you. Good Luck!
};
void foo(struct int_with_code *p) {
// explicit C-style cast compiles as both C and C++
void (*funcp)(void) = ( void (*)(void) ) p->code;
funcp();
}
Compiler output from clang7.0, on the Godbolt compiler explorer is the same when compiled as either C or C++. This is targeting the x86-64 System V ABI, where the first function arg is passed in RDI.
# this is the code that *uses* such an object, not the code that goes in its code[]
# This proves that it compiles,
# without showing any way to get compiler-generated code into code[]
foo: # #foo
add rdi, 4 # move the pointer 4 bytes forward, to point at code[]
jmp rdi # TAILCALL
(If you leave out the (void) arg-type declaration in C, the compiler will zero AL first in the x86-64 SysV calling convention, in case its actually a variadic function, because it's passing no FP args in registers.)
You'd have to allocate your objects in memory that was executable (normally not done unless they're const with static storage), e.g. compile with gcc -zexecstack. Or use a custom mmap/mprotect or VirtualAlloc/VirtualProtect on POSIX or Windows.
Or if your objects are all statically allocated, it might be possible to massage compiler output to turn functions in the .text section into objects by adding an int member right before each one. Maybe with some .section and linker tricks, and maybe a linker script, you could even somehow automate it.
But unless they're all the same length (e.g. with padding like char code[60]), that won't form an array you can index, so you'll need some way of referencing all these variable-length object.
There are potentially huge performance downsides if you ever modify an object before calling its function: on x86 you'll get self-modifying-code pipeline nuke for executing code near a just-written memory location.
Or if you copied an object before calling its function: x86 pipeline flush, or on other ISAs you need to manually flush caches to get the I-cache in sync with D-cache (so the newly-written bytes can be executed). But you can't copy such objects because their size isn't stored anywhere. You can't search the machine code for a ret instruction, because a 0xc3 byte might appear somewhere that's not the start of an x86 instruction. Or on any ISA, the function might have multiple ret instructions (tail duplication optimization). Or end with a jmp instead of a ret (tailcall).
Storing a size would start to defeat the purpose of saving size, eating up at least an extra byte in each object.
Writing code to an object at runtime, then casting to a function pointer, is undefined behaviour in ISO C and C++. On GNU C/C++, make sure you call __builtin___clear_cache on it to sync caches or whatever else is necessary. Yes, this is needed even on x86 to disable dead-store elimination optimizations: see this test case. On x86 it's just a compile-time thing, no extra asm. It doesn't actually clear any caches.
If you do copy at runtime startup, maybe allocate a big chunk of memory and carve out variable-length chunks of it, while copying. If you malloc each separately, you're wasting memory-management overhead on it.
This idea will not save you memory unless you have about as many functions as you have objects
Normally you have a fairly limited number of actual functions, with many objects having copies of the same function pointer. (You've kind of hand-rolled C++ virtual functions, but with only one function you just have a function pointer directly instead of a vtable pointer to a table of pointers for that class type. One fewer levels of indirection, and apparently you're not passing the object's own address to the function.)
One of the several benefits of this level of indirection is that one pointer is usually significantly smaller than the entire code for a function. For that to not be the case, your functions would have to be tiny.
Example: with 10 different functions of 32 bytes each, and 1000 objects with function pointers, you have a total of 320 bytes of code (which will stay hot in I-cache), and 8000 bytes of function pointers. (And in your objects, another 4 bytes per object wasted on padding to align the pointer, making the total size 16 instead of 12 bytes per object.) Anyway, that's 16320 bytes total for entire structs + code. If you allocated each object separately, there's per-object bookkeeping.
With inlining machine code into each object, and no padding, that's 1000 * (4+32) = 36000 bytes, over twice the total size.
x86-64 is probably a best-case scenario, where a pointer is 8 bytes and x86-64 machine code uses a (famously complex) variable-length instruction encoding which allows for high code density in some cases, especially when optimizing for code-size. (e.g. code-golfing. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code). But unless your functions are mostly something trivial like lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte), they're still going to take more than 8 bytes. (That's return x*3; for a function that takes a 32-bit integer x arg, in the x86-64 System V ABI.)
If they're wrappers for larger functions, a normal call rel32 instruction is 5 bytes. A load of static data is at least 6 bytes (opcode + modrm + rel32 for a RIP-relative addressing mode, or loading EAX specifically can use the special no-modrm encoding for an absolute address. But in x86-64 that's a 64-bit absolute unless you use an address-size prefix too, potentially causing an LCP stall in the decoders on Intel. mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 bytes again, so this is worse for no benefit).
Your function-pointer type doesn't have any args (assuming this is C++ where foo() means foo(void) in a declaration, not like old C where an empty arg list is somewhat similar to (...)). Thus we can assume you're not passing args, so to do anything useful the functions are probably accessing some static data or making another call.
Ideas that make more sense:
Use an ILP32 ABI like Linux x32, where the CPU runs in 64-bit mode but your code uses 32-bit pointers. This would make each of your objects only 8 bytes instead of 16. Avoiding pointer-bloat is a classic use-case for x32 or ILP32 ABIs in general.
Or (yuck) compile your code as 32-bit. But then you have obsolete 32-bit calling conventions that pass args on the stack instead of registers, and less than half the registers, and much higher overhead for position-independent code. (No EIP/RIP-relative addressing.)
Store an unsigned int table index to a table of function pointers. If you have 100 functions but 10k objects, the table is only 100 pointers long. In asm you could index an array of code directly (computed goto style) if all the functions were padded to the same length, but in C++ you can't do that. An extra level of indirection with a table of function pointers is probably your best bet.
e.g.
void (*const fptrs[])(void) = {
func1, func2, func3, ...
};
struct int_with_func {
int i;
unsigned f;
};
void bar(struct int_with_func *p) {
fptrs[p->f] ();
}
clang/gcc -O3 output:
bar(int_with_func*):
mov eax, dword ptr [rdi + 4] # load p->f
jmp qword ptr [8*rax + fptrs] # TAILCALL # index the global table with it for a memory-indirect jmp
If you were compiling a shared library, PIE executable, or not targeting Linux, the compiler couldn't use a 32-bit absolute address to index a static array with one instruction. So there'd be a RIP-relative LEA in there and something like jmp [rcx+rax*8].
This is an extra level of indirection vs. storing a function pointer in each object, but it lets you shrink each object to 8 bytes, down from 16, like using 32-bit pointers. Or to 5 or 6 bytes, if you use an unsigned short or uint8_t and pack the structs with __attribute__((packed)) in GNU C.

No, not really.
The way to specify a function's location is to use a function pointer, which you're already doing.
You could make different types which have their own different member functions, but then you're back to the original problem.
I have in the past experimented with auto-generating (as a pre-build step, using Python) a function with a long switch statement that does the work of mapping int i to a normal function call. This gets rid of the function pointers, at the expense of branching. I don't remember whether it ended up being worthwhile in my case and, even if I did, that wouldn't tell us whether it's worthwhile in your case.
Because I know that in most assembly languages, functions are just "goto" directives
Well, it's perhaps a little more complicated than that…
You might think that I'm insane to optimize the app that way
Perhaps. Trying to eliminate indirection is not, in itself, a bad thing, so I don't think you're wrong to try to improve this. I just don't think that you necessarily can.
but if anyone has some pointers
lol

I don't understand the goal of this "optimization" is it about saving the memory?
I might be misunderstanding the question, but if you just replace your function pointer with a regular function, then you'll have your struct only containing the int as data and the function-pointer being inserted by the compiler when you take the address of it, instead of stored in memory.
So just do
struct {
int i;
void func();
} test;
Then sizeof(test)==sizeof(int) should hold true if you set alignment/packing to be tight.

Related

Is copying in a loop less efficient than memcpy()?

I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.
// const char *pName
// char *m_pName = nullptr;
for (int i = 0; i < strlen(pName); i++)
m_pName[i] = pName[i];
He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.
If there are more efficient ways or if this is inefficient, please tell me why!
Thanks in advance!

I took a look at actual g++ -O3 output for your code, to see just how bad it was.
char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.
I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.
#include <string.h>
class foo {
char *__restrict__ m_pName = nullptr;
void set_name(const char *__restrict__ pName);
void alloc_name(size_t sz) { m_pName = new char[sz]; }
};
// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
// char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
//char *__restrict__ dst = m_pName; // a local avoids the reload of m_pName, but still can't hoist strlen
#define dst m_pName
for (unsigned int i = 0; i < strlen(pName); i++)
dst[i] = pName[i];
}
Compiles to this asm (g++ -O3 for x86-64, SysV ABI):
...
.L7:
movzx edx, BYTE PTR [rbp+0+rbx] ; byte load from src. clang uses mov al, byte ..., instead of movzx. The difference is debatable.
mov rax, QWORD PTR [r12] ; reload this->m_pName
mov BYTE PTR [rax+rbx], dl ; byte store
add rbx, 1
.L3: ; first iteration entry point
mov rdi, rbp ; function arg for strlen
call strlen
cmp rbx, rax
jb .L7 ; compare-and-branch (unsigned)
Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.
So obviously this is horrible:
a strlen call for every byte copied
copying one byte at a time
reloading m_pName every time through the loop (can be avoided by loading it into a local).
Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.
Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.
memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.

At best it's somewhat inefficient. At worst, it's quite inefficient.
In the good case, the compiler recognizes that it can hoist the call to strlen out of the loop. In this case, you end up traversing the input string once to compute the length, and then again to copy to the destination.
In the bad case, the compiler calls strlen every iteration of the loop, in which case the complexity becomes quadratic instead of linear.
As far as how to do it efficiently, I'd tend to so something like this:
char *dest = m_pName;
for (char const *in = pName; *in; ++in)
*dest++ = *in;
*dest++ = '\0';
This traverses the input only once, so it's potentially about twice as fast as the first, even in the better case (and in the quadratic case, it can be many times faster, depending on the length of the string).
Of course, this is doing pretty much the same thing as strcpy would. That may or may not be more efficient still--I've certainly seen cases where it was. Since you'd normally assume strcpy is going to be used quite a lot, it can be worthwhile to spend more time optimizing it than some random guy on the internet typing in an answer in a couple minutes.

Yes, your code is inefficient. Your code takes what is called "O(n^2)" time. Why? You have the strlen() call in your loop, so your code is recalculating the length of the string every single loop. You can make it faster by doing this:
unsigned int len = strlen(pName);
for (int i = 0; i < len; i++)
m_pName[i] = pName[i];
Now, you calculate the string length only once, so this code takes "O(n)" time, which is much faster than O(n^2). This is now about as efficient as you can get. However, A memcpy call would still be 4-8 times faster, because this code copies 1 byte at a time, whereas memcpy will use your system's word length.

Depends on interpretation of efficiency. I'd claim using memcpy() or strcpy() more efficient, because you don't write such loops every time you need a copy.
He is claiming that for example memcopy would do the same like the for loop above.
Well, not exactly the same. Probably, because memcpy() takes the size once, while strlen(pName) might be called with every loop iteration potentially. Thus from potential performance efficiency considerations memcpy() would be better.
BTW from your commented code:
// char *m_pName = nullptr;
Initializing like that would lead to undefined behavior without allocating memory for m_pName:
char *m_pName = new char[strlen(pName) + 1];
Why the +1? Because you have to consider putting a '\0' indicating the end of the c-style string.

Yes, it's inefficient, not because you're using a loop instead of memcpy but because you're calling strlen on each iteration. strlen loops over the entire array until it finds the terminating zero byte.
Also, it's very unlikely that the strlen will be optimized out of the loop condition, see In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing).
So memcpy(m_pName, pName, strlen(pName)) would indeed be faster.
Even faster would be strcpy, because it avoids the strlen loop:
strcpy(m_pName, pName);
strcpy does the same as the loop in #JerryCoffin's answer.

For simple operations like that you should almost always say what you mean and nothing more.
In this instance if you had meant strcpy() then you should have said that, because strcpy() will copy the terminating NUL character, whereas that loop will not.
Neither one of you can win the debate. A modern compiler has seen a thousand different memcpy() implementations and there's a good chance it's just going to recognise yours and replace your code either with a call to memcpy() or with its own inlined implementation of the same.
It knows which one is best for your situation. Or at least it probably knows better than you do. When you second-guess that you run the risk of the compiler failing to recognise it and your version being worse than the collected clever tricks the compiler and/or library knows.
Here are a few considerations that you have to get right if you want to run your own code instead of the library code:
What's the largest read/write chunk size that is efficient (it's rarely bytes).
For what range of loop lengths is it worth the trouble of pre-aligning reads and writes so that larger chunks can be copied?
Is it better to align reads, align writes, do nothing, or to align both and perform permutations in arithmetic to compensate?
What about using SIMD registers? Are they faster?
How many reads should be performed before the first write? How much register file needs to be used for the most efficient burst accesses?
Should a prefetch instruction be included?
How far ahead?
How often?
Does the loop need extra complexity to avoid preloading over the end?
How many of these decisions can be resolved at run-time without causing too much overhead? Will the tests cause branch prediction failures?
Would inlining help, or is that just wasting icache?
Does the loop code benefit from cache line alignment? Does it need to be packed tightly into a single cache line? Are there constraints on other instructions within the same cache line?
Does the target CPU have dedicated instructions like rep movsb which perform better? Does it have them but they perform worse?
Going further; because memcpy() is such a fundamental operation it's possible that even the hardware will recognise what the compiler's trying to do and implement its own shortcuts that even the compiler doesn't know about.
Don't worry about the superfluous calls to strlen(). Compiler probably knows about that, too. (Compiler should know in some instances, but it doesn't seem to care) Compiler sees all. Compiler knows all. Compiler watches over you while you sleep. Trust the compiler.
Oh, except the compiler might not catch that null pointer reference. Stupid compiler!

This code is confused in various ways.
Just do m_pName = pName; because you're not actually copying the string.
You're just pointing to the one you've already got.
If you want to copy the string m_pName = strdup(pName); would do it.
If you already have storage, strcpy or memcpy would do it.
In any case, get strlen out of the loop.
This is the wrong time to worry about performance.
First get it right.
If you insist on worrying about performance, it's hard to beat strcpy.
What's more, you don't have to worry about it being right.

As a matter of fact, why do you need to copy at all ??? (either with the loop or memcpy)
if you want to duplicate a memory block, thats a different question, but since its a pointer all you need is &pName[0] (which is the address of the first location of the array) and sizeof pName ... thats it ... you can reference any object in the array by incrementing the address of first byte and you know the limit using the size value ... why have all these pointers ???(let me know if there is more to that than theoretical debate)

How is it known that variables are in registers, or on stack?

I am reading this question about inline on isocpp FAQ, the code is given as
void f()
{
int x = /*...*/;
int y = /*...*/;
int z = /*...*/;
// ...code that uses x, y and z...
g(x, y, z);
// ...more code that uses x, y and z...
}
then it says that
Assuming a typical C++ implementation that has registers and a stack,
the registers and parameters get written to the stack just before the
call to g(), then the parameters get read from the stack inside
g() and read again to restore the registers while g() returns to
f(). But that’s a lot of unnecessary reading and writing, especially
in cases when the compiler is able to use registers for variables x,
y and z: each variable could get written twice (as a register and
also as a parameter) and read twice (when used within g() and to
restore the registers during the return to f()).
I have a big difficulty understanding the paragraph above. I try to list my questions as below:
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
PS
It's very hard to choose an acceptable answer when the answers are all very good(E.g., the ones provided by #MatsPeterson, #TheodorosChatzigiannakis, and #superultranova) I think. I personally like the one by #Potatoswatter a little bit more since the answer offers some guidelines.

Don't take that paragraph too seriously. It seems to be making excessive assumptions and then going into excessive detail, which can't really be generalized.
But, your questions are very good.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
More-or-less, everything needs to be loaded into registers. Most computers are organized around a datapath, a bus connecting the registers, the arithmetic circuits, and the top level of the memory hierarchy. Usually, anything that is broadcast on the datapath is identified with a register.
You may recall the great RISC vs CISC debate. One of the key points was that a computer design can be much simpler if the memory is not allowed to connect directly to the arithmetic circuits.
In modern computers, there are architectural registers, which are a programming construct like a variable, and physical registers, which are actual circuits. The compiler does a lot of heavy lifting to keep track of physical registers while generating a program in terms of architectural registers. For a CISC instruction set like x86, this may involve generating instructions that send operands in memory directly to arithmetic operations. But behind the scenes, it's registers all the way down.
Bottom line: Just let the compiler do its thing.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Each platform defines a way for C functions to call each other. Passing parameters in registers is more efficient. But, there are trade-offs and the total number of registers is limited. Older ABIs more often sacrificed efficiency for simplicity, and put them all on the stack.
Bottom line: The example is arbitrarily assuming a naive ABI.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
The compiler tends to prefer to use registers for more frequently accessed values. Nothing in the example requires the use of the stack. However, less frequently accessed values will be placed on the stack to make more registers available.
Only when you take the address of a variable, such as by &x or passing by reference, and that address escapes the inliner, is the compiler required use memory and not registers.
Bottom line: Avoid taking addresses and passing/storing them willy-nilly.

It is entirely up to the compiler (in conjunction with the processor type) whether a variable is stored in memory or a register [or in some cases more than one register] (and what options you give the compiler, assuming it's got options to decide such things - most "good" compilers do). For example, the LLVM/Clang compiler uses a specific optimisation pass called "mem2reg" that moves variables from memory to registers. The decision to do so is based on how the variable(s) are used - for example, if you take the address of a variable at some point, it needs to be in memory.
Other compilers have similar, but not necessarily identical, functionality.
Also, at least in compilers that have some semblance of portability, there will ALSO be a phase of generatinc machine code for the actual target, which contains target-specific optimisations, which again can move a variable from memory to a register.
It is not possible [without understanding how the particular compiler works] to determine if the variables in your code are in registers or in memory. One can guess, but such a guess is just like guessing other "kind of predictable things", like looking out the window to guess if it's going to rain in a few hours - depending on where you live, this may be a complete random guess, or quite predictable - some tropical countries, you can set your watch based on when the rain arrives each afternoon, in other countries, it rarely rains, and in some countries, like here in England, you can't know for certain beyond "right now it is [not] raining right here".
To answer the actual questions:
This depends on the processor. Proper RISC processors such as ARM, MIPS, 29K, etc have no instructions that use memory operands except the load and store type instructions. So if you need to add two values, you need to load the values into registers, and use the add operation on those registers. Some, such as x86 and 68K allows one of the two operands to be a memory operand, and for example PDP-11 and VAX have "full freedom", whether your operands are in memory or register, you can use the same instruction, just different addressing modes for the different operands.
Your original premise here is wrong - it's not guaranteed that arguments to g are on the stack. That is just one of many options. Many ABIs (application binary interface, aka "calling conventions) use registers for the first few arguments to a function. So, again, it depends on which compiler (to some degree) and what processor (much more than which compiler) the compiler targets whether the arguments are in memory or in registers.
Again, this is a decision that the compiler makes - it depends on how many registers the processor has, which are available, what the cost is if "freeing" some register for x, y and z - which ranges from "no cost at all" to "quite a bit" - again, depending on the processor model and the ABI.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
Not even this statement is always true. It is probably true for all the platforms you'll ever work with, but there surely can be another architecture that doesn't make use of processor registers at all.
Your x86_64 computer does however.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
These two questions cannot be uniquely answered for any compiler and system your code will be compiled on. They cannot even be taken for granted since g's parameters might not be on the stack, it all depends on several concepts I'll explain below.
First you should be aware of the so-called calling conventions which define, among the other things, how function parameters are passed (e.g. pushed on the stack, placed in registers, or a mix of both). This isn't enforced by the C++ standard and calling conventions are a part of the ABI, a broader topic regarding low-level machine code program issues.
Secondly register allocation (i.e. which variables are actually loaded in a register at any given time) is a complex task and a NP-complete problem. Compilers try to do their best with the information they have. In general less frequently accessed variables are put on the stack while more frequently accessed variables are kept on registers. Thus the part Where the data inside g() is stored, register or stack? cannot be answered once-and-for-all since it depends on many factors including register pressure.
Not to mention compiler optimizations which might even eliminate the need for some variables to be around.
Finally the question you linked already states
Naturally your mileage may vary, and there are a zillion variables that are outside the scope of this particular FAQ, but the above serves as an example of the sorts of things that can happen with procedural integration.
i.e. the paragraph you posted makes some assumptions to set things up for an example. Those are just assumptions and you should treat them as such.
As a small addition: regarding the benefits of inline on a function I recommend taking a look at this answer: https://stackoverflow.com/a/145952/1938163

You can't know, without looking at the assembly language, whether a variable is in a register, stack, heap, global memory or elsewhere. A variable is an abstract concept. The compiler is allowed to use registers or other memory as it chooses, as long as the execution isn't changed.
There's also another rule that affects this topic. If you take the address of a variable and store into a pointer, the variable may not be placed into a register because registers don't have addresses.
The variable storage may also depend on the optimization settings for the compiler. Variables can disappear due to simplification. Variables that don't change value may be placed into the executable as a constant.

Regarding your #1 question, yes, non load/store instructions operate on registers.
Regarding your #2 question, if we are assuming that parameters are passed on the stack, then we have to write the registers to the stack, otherwise g() won't be able to access the data, since the code in g() doesn't "know" which registers the parameters are in.
Regarding your #3 question, it is not known that x, y and z will for sure be stored in registers in f(). One could use the register keyword, but that's more of a suggestion. Based on the calling convention, and assuming the compiler doesn't do any optimization involving parameter passing, you may be able to predict whether the parameters are on the stack or in registers.
You should familiarize yourself with calling conventions. Calling conventions deal with the way that parameters are passed to functions and typically involve passing parameters on the stack in a specified order, putting parameters into registers or a combination of both.
stdcall, cdecl, and fastcall are some examples of calling conventions. In terms of parameter passing, stdcall and cdecl are the same, in the parameters are pushed in right to left order onto the stack. In this case, if g() was cdecl or stdcall the caller would push z,y,x in that order:
mov eax, z
push eax
mov eax, x
push eax
mov eax, y
push eax
call g
In 64bit fastcall, registers are used, microsoft uses RCX, RDX, R8, R9 (plus the stack for functions requiring more than 4 params), linux uses RDI, RSI, RDX, RCX, R8, R9. To call g() using MS 64bit fastcall one would do the following (we assume z, x, and y are not in registers)
mov rcx, x
mov rdx, y
mov r8, z
call g
This is how assembly is written by humans, and sometimes compilers. Compilers will use some tricks to avoid passing parameters, as it typically reduces the number of instructions and can reduce the number of time memory is accessed. Take the following code for example (I'm intentionally ignoring non-volatile register rules):
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
call g
mov rcx, rax
ret
g:
mov rax, rsi
add rax, rcx
add rax, rdx
ret
For illustrative purposes, rcx is already in use, and x has been loaded into rsi. The compiler can compile g such that it uses rsi instead of rcx, so values don't have to be swapped between the two registers when it comes time to call g. The compiler could also inline g, now that f and g share the same set of registers for x, y, and z. In that case, the call g instruction would be replaced with the contents of g, excluding the ret instruction.
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
mov rax, rsi
add rax, rcx
add rax, rdx
mov rcx, rax
ret
This will be even faster, because we don't have to deal with the call instruction, since g has been inlined into f.

Short answer: You can't. It completely depends on your compiler and the optimizing features enabled.
The compiler concern is to translate into assembly your program, but how it is done is tighly coupled to how your compiler works.
Some compilers allows you hint what variable map to register.
Check for example this: https://gcc.gnu.org/onlinedocs/gcc/Global-Reg-Vars.html
Your compiler will apply transformations to your code in order to gain something, may be performance, may be lower code size, and it apply cost functions to estimate this gains, so you normally only can see the result disassembling the compilated unit.

Variables are almost always stored in main memory. Many times, due to compiler optimizations, value of your declared variable will never move to main memory but those are intermediate variable that you use in your method which doesn't hold relevance before any other method is called (i.e. occurrence of stack operation).
This is by design - to improve performance as it is easier (and much faster) for processor to address and manipulate data in registers. Architectural registers are limited in size so everything cannot be put in registers. Even if you 'hint' your compiler to put it in register, eventually, OS may manage it outside register, in main memory, if available registers are full.
Most probably, a variable will be in main memory because it hold relevance further in the near execution and may hold reliance for longer period of CPU time. A variable is in architectural register because it holds relevance in upcoming machine instructions and execution will be almost immediate but may not be relevant for long.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
This depends on the architecture and the instruction set it offers. But in practice, yes - it is the typical case.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
Assuming the compiler doesn't eliminate the local variables, it will prefer to put them in registers, because registers are faster than the stack (which resides in the main memory, or a cache).
But this is far from a universal truth: it depends on the (complicated) inner workings of the compiler (whose details are handwaved in that paragraph).
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Even if we assume that the variables are, in fact, stored in the registers, when you call a function, the calling convention kicks in. That's a convention that describes how a function is called, where the arguments are passed, who cleans up the stack, what registers are preserved.
All calling conventions have some kind of overhead. One source of this overhead is the argument passing. Many calling conventions attempt to reduce that, by preferring to pass arguments through registers, but since the number of CPU registers is limited (compared to the space of the stack), they eventually fall back to pushing through the stack after a number of arguments.
The paragraph in your question assumes a calling convention that passes everything through the stack and based on that assumption, what it's trying to tell you is that it would be beneficial (for execution speed) if we could "copy" (at compile time) the body of the called function inside the caller (instead of emitting a call to the function). This would yield the same results logically, but it would eliminate the runtime cost of the function call.

C++: Why does this speed my code up?

I have the following function
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = patch_top_left_row + random_values[0];
int first_pixel_col = patch_top_left_col + random_values[1];
int second_pixel_row = patch_top_left_row + random_values[2];
int second_pixel_col = patch_top_left_col + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
Which is called about one and a half million times with different values for patch_top_left_row and patch_top_left_col. This takes about 2 seconds to run, now when I change the calculation of first_pixel_row etc to not use the arguments but hard coded numbers instead (shown below), the thing runs sub second and I don't know why. Is the compiler doing something smart here ( I am using gcc cross compiler)?
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = 5 + random_values[0];
int first_pixel_col = 6 + random_values[1];
int second_pixel_row = 8 + random_values[2];
int second_pixel_col = 10 + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
EDIT:
I have pasted the assembly from the two versions of the function
using arguments: http://pastebin.com/tpCi8c0F
using constants: http://pastebin.com/bV0d7QH7
EDIT:
After compiling with -O3 I get the following clock ticks and speeds:
using arguments: 1990000 ticks and 1.99seconds
using constants: 330000 ticks and 0.33seconds
EDIT:
using argumenst with -03 compilation: http://pastebin.com/fW2HCnHc
using constant with -03 compilation: http://pastebin.com/FHs68Agi

On the x86 platform there are instructions that very quickly add small integers to a register. These instructions are the lea (aka 'load effective address') instructions and they are meant for computing address offsets for structures and the like. The small integer being added is actually part of the instruction. Smart compilers know that these instructions are very quick and use them for addition even when addresses are not involved.
I bet if you changed the constants to some random value that was at least 24 bits long that you would see much of the speedup disappear.
Secondly those constants are known values. The compiler can do a lot to arrange for those values to end up in a register in the most efficient way possible. With an argument, unless the argument is passed in a register (and I think your function has too many arguments for that calling convention to be used) the compiler has no choice but to fetch the number from memory using a stack offset load instruction. That isn't a particularly slow instruction or anything, but with constants the compiler is free to do something much faster than may involve simply fetching the number from the instruction itself. The lea instructions are simply the most extreme example of this.
Edit: Now that you've pasted the assembly things are much clearer
In the non-constant code, here is how the add is done:
addl -68(%rbp), %eax
This fetches a value from the stack an offset -68(%rpb) and adds it to the %eax% register.
In the constant code, here is how the add is done:
addl $5, %eax
and if you look at the actual numbers, you see this:
0138 83C005
It's pretty clear that the constant being added is encoded directly into the instruction as a small value. This is going to be much faster to fetch than fetching a value from a stack offset for a number of reasons. First it's smaller. Secondly, it's part of an instruction stream with no branches. So it will be pre-fetched and pipelined with no possibility for cache stalls of any kind.
So while my surmise about the lea instruction wasn't correct, I was still on the right track. The constant version uses a small instruction specifically oriented towards adding a small integer to a register. The non-constant version has to fetch an integer that may be of indeterminate size (so it has to fetch ALL the bits, not just the low ones) from a stack offset (which adds in an additional add to compute the actual address from the offset and stack base address).
Edit 2: Now that you've posted the -O3 results
Well, it's much more confusing now. It's apparently inlined the function in question and it jumps around a whole ton between the code for the inlined function and the code for the calling function. I'm going to need to see the original code for the whole file to make a proper analysis.
But what I strongly suspect is happening now is that the unpredictability of the values retrieved from get_random_number_in_range is severely limiting the optimization options available to the compiler. In fact, it looks like in the constant version it doesn't even bother to call get_random_number_in_range because the value is tossed out and never used.
I'm assuming that the values of patch_top_left_row and patch_top_left_col are generated in a loop somewhere. I would push this loop into this function. If the compiler knows the values are generated as part of a loop, there are a very large number of optimization options open to it. In the extreme case it could use some of the SIMD instructions that are part of the various SSE or 3dnow! instruction suites to make things a whole ton faster than even the version you have that uses constants.
The other option would be to make this function inline, which would hint to the compiler that it should try inserting it into the loop in which it's called. If the compiler takes the hint (this function is a bit largish, so the compiler might not) it will have much the same effect as if you'd stuffed the loop into the function.

Well, binary arithmetic operations of immediate constant vs. memory format are expected to produce faster code than the ones of memory vs. memory format, but the timing effect you observe appears to be too extreme, especially considering that there are other operations inside that function.
Could it be that the compiler decided to inline your function? Inlining would allow the compiler to easily eliminate everything related to the unused patch_top_left_row and patch_top_left_col parameters in the second version, including any steps that prepare/calculate these parameters in the calling code.
Technically, this can be done even if the function is not inlined, but it is generally more complicated.

may a whole array reside in some cpu register?

As I'm not too much familiar with cpu registers, in general and in any architecture specially x86 and if compiler-relevant using VC++ I'm curious that is it possible for all elements of an array with a tiny number of elements like an array of 1-byte characters with 4 elements to reside in some cpu register as I know this could be true for single primitives like double, integer, etc ?
when we have a parameter like below:
void someFunc(char charArray[4]){
//whatever
}
Will this parameter passing be definitely done through passing a pointer to the function or that array would be residing in some cpu register eliminating the need to pass a pointer to main memory?

This is not compiler dependent, nor is it possible. Arrays cannot be passed by value in the same way as other types, i.e. they cannot be copied when passed into a function. The C++ standard is clear in that when processing a function signature in a declaration the following are exact equivalencies:
void foo( char *a );
void foo( char a[] );
void foo( char a[4] );
void foo( char a[ 100000 ] );
A compliant compiler will convert the array in the function signature into a pointer. Now, at the place of call, a similar operation takes place: if the argument is an array, the compiler has to decay it into a pointer to the first element. Again, the size of the array is lost in the decay.
Specific registers can be used to hold more than one value and perform operations on them (google for vectorized operations, MME and variants). But while that means that the compiler can actually insert the contents of a small array into a single register, that cannot be used to change the function call that you refer to.

Within a single function, an array could be held in one or more registers, just so long as the compiler is able to produce CPU instructions to manipulate it as the code dictates. The standard doesn't really define what it means for something to "be" in a register. It's a private matter between the compiler and the debugger, and there may be a fine line between something being in a register, and being "optimized away" entirely.
In your example, the parameter is a pointer, not an array (see dribeas' answer). So it would be unusual that the array it points to could possibly be held a register. The "main" architectures that you probably deal with don't allow a pointer to a register, so even if the array was held in a register in the calling code, it would have to be written into memory in order to take a pointer to it, to pass to the callee.
If the function call was inlined, then better optimizations might be possible, just as if there were no call at all.
If you wrap your array in a struct, then you turn it into something that can be passed by value:
struct Foo {
char a[4];
};
void FooFunc(Foo f) {
// whatever
}
Now, the function is taking the actual array data as its parameter, so there's one less barrier to holding it in a register. Whether the implementation's calling convention actually does pass small structs in registers is another question, though. I don't know what calling conventions do this, if any.

Out of the 5 or so compilers I'm fairly familiar with, (Borland/Turbo C/C++ from 1.0, Watcom C/C++ from v8.0, MSC from 5.0, IBM Visual Age C/C++, gcc of various versions on DOS, Linux and Windows) I've not seen this optimization happen naturally.
There was a string library, whose name I cannot remember, that did optimizations similar to this in x86 ASM. It may have been part of the "Spontaneous Assembly" library, but no guarantees.

A function that accepts an array is probably going to index into that array. I know of no architecture that supports efficient indexing into a register, so it's probably pointless to pass arrays in registers.
(On an x86 architecture, you could access a[0] and a[1] by accessing al and ah of the eax register, but that is a special case that only works if the indexes are known at compile time.)

You asked if its possible with VC++ on an x86.
I doubt it's possible in that configuration. True, you could produce assembler code where that array is kept in a register, but due to the nature of arrays it would be by no means a natural optimization for a compiler, so I doubt they put it in.
You can try it out though and produce some code where the compiler would have an "incentive" to put it in a register, but it would look pretty weird like
char x[4];
*((int*)x) = 36587467;
Compile that with optimizations and the /FA switch and look at the assembler code produced (and then tell us the results :-))
If you use it in a more "natural" way, like accessing single characters or initializing it with a string there is no reason at all for the compiler to put that array into a register.
Even when passing it to a function - the compiler might put the address of the array into the register, but not the array itself

Only variables can be stored in a register. You can try to force register storage by using the register keyword: register int i;
Arrays are by default pointers.
You can get the value located at the 4 position like this (using pointer syntax):
char c = *(charArray + 4);

C++ CPU Register Usage

In C++, local variables are always allocated on the stack. The stack is a part of the allowed memory that your application can occupy. That memory is kept in your RAM (if not swapped out to disk). Now, does a C++ compiler always create assembler code that stores local variables on the stack?
Take, for example, the following simple code:
int foo( int n ) {
return ++n;
}
In MIPS assembler code, this could look like this:
foo:
addi $v0, $a0, 1
jr $ra
As you can see, I didn't need to use the stack at all for n. Would the C++ compiler recognize that, and directly use the CPU's registers?
Edit: Wow, thanks a lot for your almost immediate and extensive answers! The function body of foo should of course be return ++n;, not return n++;. :)

Yes. There is no rule that "variables are always allocated on the stack". The C++ standard says nothing about a stack.It doesn't assume that a stack exists, or that registers exist. It just says how the code should behave, not how it should be implemented.
The compiler only stores variables on the stack when it has to - when they have to live past a function call for example, or if you try to take the address of them.
The compiler isn't stupid. ;)

Disclaimer: I don't know MIPS, but I do know some x86, and I think the principle should be the same..
In the usual function call convention, the compiler will push the value of n onto the stack to pass it to the function foo. However, there is the fastcall convention that you can use to tell gcc to pass the value through the registers instead. (MSVC also has this option, but I'm not sure what its syntax is.)
test.cpp:
int foo1 (int n) { return ++n; }
int foo2 (int n) __attribute__((fastcall));
int foo2 (int n) {
return ++n;
}
Compiling the above with g++ -O3 -fomit-frame-pointer -c test.cpp, I get for foo1:
mov eax,DWORD PTR [esp+0x4]
add eax,0x1
ret
As you can see, it reads in the value from the stack.
And here's foo2:
lea eax,[ecx+0x1]
ret
Now it takes the value directly from the register.
Of course, if you inline the function the compiler will do a simple addition in the body of your larger function, regardless of the calling convention you specify. But when you can't get it inlined, this is going to happen.
Disclaimer 2: I am not saying that you should continually second-guess the compiler. It probably isn't practical and necessary in most cases. But don't assume it produces perfect code.
Edit 1: If you are talking about plain local variables (not function arguments), then yes, the compiler will allocate them in the registers or on the stack as it sees fit.
Edit 2: It appears that calling convention is architecture-specific, and MIPS will pass the first four arguments on the stack, as Richard Pennington has stated in his answer. So in your case you don't have to specify the extra attribute (which is in fact an x86-specific attribute.)

Yes, a good, optimizing C/C++ will optimize that. And even MUCH more: See here: Felix von Leitners Compiler Survey.
A normal C/C++ compiler will not put every variable on the stack anyway. The problem with your foo() function could be that the variable could get passed via the stack to the function (the ABI of your system (hardware/OS) defines that).
With C's register keyword you can give the compiler a hint that it would probably be good to store a variable in a register. Sample:
register int x = 10;
But remember: The compiler is free not to store x in a register if it wants to!

The answer is yes, maybe. It depends on the compiler, the optimization level, and the target processor.
In the case of the mips, the first four parameters, if small, are passed in registers and the return value is returned in a register. So your example has no requirement to allocate anything on the stack.
Actually, truth is stranger than fiction. In your case the parameter is returned unchanged: the value returned is that of n before the ++ operator:
foo:
.frame $sp,0,$ra
.mask 0x00000000,0
.fmask 0x00000000,0
addu $2, $zero, $4
jr $ra
nop

Since your example foo function is an identity function (it just returns it's argument), my C++ compiler (VS 2008) completely removes this function call. If I change it to:
int foo( int n ) {
return ++n;
}
the compiler inlines this with
lea edx, [eax+1]

Yes, The registers are used in C++. The MDR (memory data registers) contains the data being fetched and stored. For example, to retrieve the contents of cell 123, we would load the value 123 (in binary) into the MAR and perform a fetch operation. When the operation is done, a copy of the contents of cell 123 would be in the MDR. To store the value 98 into cell 4, we load a 4 into the MAR and a 98 into the MDR and perform a store. When the operation is completed the contents of cell 4 will have been set to 98, by discarding whatever was there previously. The data & address registers work with them to achieve this. In C++ too, when we initialize a var with a value or ask its value, the same phenomena Happens.
And, One More Thing, Modern Compilers also perform Register Allocation, which is kinda faster than memory allocation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js