C++ 64 bit int: pass by reference or pass by value

C++ 64 bit int: pass by reference or pass by value - c++

This is an efficiency question about 64 bit ints. Assuming I don't need to modify the value of a "int" parameter, should I pass it by value or reference.
Assuming 32 bit machine:
1) 32 bit int: I guess the answer is "pass by value" as "pass by reference" will have overhead of extra memory lookup.
2) 64 bit int: If I pass by reference, I only pass 32 bit address on the stack, but need an extra memory lookup. So which one of them is better (reference or value)?
What if the machine is 64 bit?
regards,
JP

Pass by value - definitely. If the system is 64-bit it means it copies 64-bit word extremely fast.

Even on a 64 bit machine pass by value is better (with some very few exceptions), because it can be passed as a register value.

Pass them as a boost::call_traits<int64_t>::param_type. This template captures the best practices for passing any type on the supported platforms. Hence, it will be different on 32 and 64 bits platforms, but you can use the same code everywhere. It even works inside other templates where you don't know the precise type yet.

Use a little common sense,
if the object requires a complex copy constructor, it's probably worth passing by reference (saying that - quite a lot of boost's objects are designed to be passed-by-value rather than reference simply because internal implementation is quite trivial) There is one odd one which I haven't really worked out, std::string, I always pass this by reference...
If you intend to modify the value that is passed in, use a reference
Else, PASS-BY-VALUE!
Do you have a particular performance bottleneck with arguments to functions? Else, don't spend too much time worrying about which is the best way to pass...
Optimizing by worrying about how an int is passed in is like pi**ing in the sea...

For argument's sake, lets ignore the trivial case of optimisers removing differences. Let's also say you're using Microsoft's Intel 64-bit calling conventions (which do differ from the Linux ABI), then you've got 4 64-bit registers for passing such values before you have to resort to pushing them on the stack. That's clearly better.
For a 32-bit app, by value and they'd go straight onto the stack. By-reference may instead put a pointer in a register (again, a few such register uses are allowed before resorting to the stack). We can this in some output from g++ -O3 -S, calling f1(99) by value and f2(101) by const reference:
void f1(int64_t);
void f2(const int64_t&);
int main()
{
f1(99);
f2(101);
}
...
pushl 0
pushl $99
call _Z2f1x // by value - pushed two halves to stack
leal -8(%ebp), %eax
movl %eax, (%esp)
movl $101, -8(%ebp)
movl $0, -4(%ebp)
call _Z2f2RKx // by const& - ugly isn't it!?!
The called function would then have to retrieve before first usage (if any). The called function's free to cache the values read in registers, so that's only needed once. With the stack approach, the value can be reread at will, so the register need not be reserved for that value. With the pointer approach, either the pointer or 64-bit value may need to be saved somewhere more predictable (e.g. pushed, or another less useful register) should that register need to be freed up momentarily for some other work, but the 64-bit int parameter be needed again later. All up, it's hard to guess which is faster - may be CPU/register-usage/optimiser/etc dependent, and it's not worth trying.
A node to pst's advice...
"efficiency" :( KISS. pass it how you pass every other bloody integer. - pst
...though, sometimes you apply KISS to template parameters and make them all const T& even though some may fit in registers....

Related

How are rvalues assigned to lvalues in assembly?

First question here. I will in a few weeks/months need to create procedural code in which there will be functions assigning big (I mean really big) sets of data directly to pointers. Here is some example of code I will be doing :
void MyFuntion(string* str)
{
*str = "some data in a string";
}
As it surely is important : I am on windows 10, in visual-studio 2019, compiling with the default c++ compiler on release x86.
Imagine something like this but with strings that can contain several millions of characters, or with int/float arrays also with several millions of elements.
So, this is a single operation assigning a rvalue to a pointer, which is therefore on the heap. Of course, if I create a local variable containing the data, it will be more than 1MB and therefore will cause a stack overflow, right ?
As I understand, since the data only exists as a rvalue here, it doesn't have a memory existence, but I would like to know : how is the rvalue assigned to the pointer ? Like, how is it done in assembly ? I must say I have never done any assembly, I have a few (very few) notions but I'd like to get into it when I have time.
Is it temporary created in the stack or heap before being put in the final memory address ? My guess is that the memory address (the pointer in which I am assigning the data) is directly filled with the data, like, bit by bit, so no existence of the rvalue in memory.
If I'm correct, the only things that exist in the stack here are : the function call, the pointer copy, then the instruction, which should be something like "assign rvalue X to lvalue Y" and the size of the instruction doesn't depend on the size of the rvalue and lvalue, so there should not be any problem regarding the stack here.
So, if I'm correct, this code should not cause any problem, no matter how big the rvalue is, but I would still like to know how it is done exactly, assembly-wise. Note that I am not only looking for an answer, but more like some references, books or docs, that could explain in detail. I guess what I am looking for won't be in a c++ book, but more like a assembly book, this might be a good starting point to get myself into it !

Although a specific OS and compiler were mentioned, the example assembly in this answer will probably differ from what the querent's compiler would output, because I don't have a Windows 10 machine available at the time of writing and used a different environment having forgotten about Godbolt. However, this topic is general enough in my opinion that it shouldn't really matter in this specific case.
What even is a value on the right side of an assignment operator? What does assignment look like at the assembly level? Here's a simple example.
void assign_thing(int *p) {
*p = 42;
}
movl $42, (%rdi)
retq
"Move the 32-bit integer 42 into the memory location to which rdi is pointing." %rdi here represents p, and (%rdi) means *p. For something dead simple like an integer, it's pretty much that simple. How about a simple structure?
struct stuff {
int id;
float value;
char text[8];
};
void assign_thing(stuff *p) {
*p = {42, 1.5, "Hello!"};
}
movabsq $4593671619917905962, %rax
movq %rax, (%rdi)
movabsq $36762444129608, %rax
movq %rax, 8(%rdi)
retq
A little harder to read at first glance, but pretty much the same idea. The compiler was smart and packed the integer and float values 42 and 1.5 into a single 64-bit value and stuffs that directly into (%rdi). Likewise with the string "Hello!", which is short enough to fit into a single 64-bit value and gets stuffed into 8(%rdi) (8 bytes past p is the offset of text).
So far, none of the rvalues actually exist in memory when they get assigned. They're just part of the instructions. What if it's something a lot bigger, like a string?
// Overflow checking omitted for brevity.
void assign_thing(char *p) {
// Assignment with = doesn't actually do what you'd want here,
// so this'll have to do.
strcpy(p, "What if it's something a lot bigger, like a string?");
}
vmovups -5484(%rip), %ymm0
vmovups %ymm0, 20(%rdi) ; I'm guessing the disassembler meant to say 0x20
vmovups -5517(%rip), %ymm0
vmovups %ymm0, (%rdi)
vzeroupper
retq
Now, the rvalue does reside in memory when it gets assigned. Do note that this is not because strcpy was used instead of =, but because the compiler decided that it would be better to store that "rvalue" string somewhere in a read-only area like .rodata and just copy it over. If I had used a much shorter string, any reasonably modern compiler would probably optimize it into a few mov or movabsq instructions like in the second example. Unless p points to a buffer on the stack and your strcpy ends up overflowing it, you won't get a stack overflow here.
Now what about your example? I'm guessing that your string type is really std::string, and that's not a trivial type. So what happens there? In C++, the assignment operator = is overloadable, and std::string indeed has its own overloads, so instead of directly stuffing or copying values into the object, a special member function operator= is called. That is to say, your *str = "some data in a string" is really a str->operator=("some data in a string"). How your rvalue string gets copied is up to the implementation of std::string::operator=, but it'll most likely be optimized into something like my last example. The actual string data of an std::string resides on the heap, so stack overflow still isn't a problem here.
tl;dr (this answer + the comments, compressed into a few sentences)
If your string is small enough, it probably won't exist in memory during assignment. If it's big enough, it'll sit in a read-only area somewhere and get copied over when needed. The stack is often not even involved, so don't worry about overflow.

embed a functions assembly code in a struct

I've a rather special question: is it possible in C/++ (both because I am sure the question is the same in both languages) to specify a functions's location? Why? I have a very large list of function pointers, and I want to eliminate them.
(Currently) This looks like that(repeated over lika a million times, stored in the user's RAM):
struct {
int i;
void(* funptr)();
} test;
Because I know that in most assembly languages, functions are just "goto" directives, I had the following idea. Is it possible to optimize the above construct so that it looks like that?
struct {
int i;
// embed the assembler of the function here
// so that all the functions
// instructions are located here
// like this: mov rax, rbx
// jmp _start ; just demo code
} test2;
In the end, the thing should look like this in memory: An int holding any value, followed by the function's assembly code, referenced by test2. I should be able to call these functions like that: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();
You might think that I'm insane to optimize the app that way, and I cannot disclose any more details on it's function, but if anyone has some pointers on how solve this problem, I would appreciate it.
I do not think that there is a standard way to this, so any hacky way to do this via inline assembler/other crazy things is also appreciated!

The only thing you really have to do is make the compiler aware of the (constant) value of the function pointer you want in the struct. The compiler will then (presumably/hopefully) inline that function call wherever it sees it called through that function pointer:
template<void(*FPtr)()>
struct function_struct {
int i;
static constexpr auto funptr = FPtr;
};
void testFunc()
{
volatile int x = 0;
}
using test = function_struct<testFunc>;
int main()
{
test::funptr();
}
Demo - no call or jmp after optimization.
It remains unclear what the point of the int i is. Note that the code is not technically "directly after the i" here, but it is even more unclear how you'd expect instances of the struct to look like (is the code in them or is it "static" in a way? I feel there is some misunderstanding here on your part what compilers actually produce...). But consider the ways that compiler inlining can help you and you might find the solution you need. If you're worried about executable size after inlining, tell the compiler and it will compromise between speed and size.

This sounds like a terrible idea for a lot of reasons that probably won't save memory, and will hurt performance by diluting L1I-cache with data and L1D-cache with code. And worse if you ever modify or copy objects: self-modifying code stalls.
But yes, this would be possible in C99/C11 with a flexible array member at the end of the struct, which you cast to a function pointer.
struct int_with_code {
int i;
char code[]; // C99 flexible array member. GNU extension in C++
// Store machine code here
// you can't get the compiler to do this for you. Good Luck!
};
void foo(struct int_with_code *p) {
// explicit C-style cast compiles as both C and C++
void (*funcp)(void) = ( void (*)(void) ) p->code;
funcp();
}
Compiler output from clang7.0, on the Godbolt compiler explorer is the same when compiled as either C or C++. This is targeting the x86-64 System V ABI, where the first function arg is passed in RDI.
# this is the code that *uses* such an object, not the code that goes in its code[]
# This proves that it compiles,
# without showing any way to get compiler-generated code into code[]
foo: # #foo
add rdi, 4 # move the pointer 4 bytes forward, to point at code[]
jmp rdi # TAILCALL
(If you leave out the (void) arg-type declaration in C, the compiler will zero AL first in the x86-64 SysV calling convention, in case its actually a variadic function, because it's passing no FP args in registers.)
You'd have to allocate your objects in memory that was executable (normally not done unless they're const with static storage), e.g. compile with gcc -zexecstack. Or use a custom mmap/mprotect or VirtualAlloc/VirtualProtect on POSIX or Windows.
Or if your objects are all statically allocated, it might be possible to massage compiler output to turn functions in the .text section into objects by adding an int member right before each one. Maybe with some .section and linker tricks, and maybe a linker script, you could even somehow automate it.
But unless they're all the same length (e.g. with padding like char code[60]), that won't form an array you can index, so you'll need some way of referencing all these variable-length object.
There are potentially huge performance downsides if you ever modify an object before calling its function: on x86 you'll get self-modifying-code pipeline nuke for executing code near a just-written memory location.
Or if you copied an object before calling its function: x86 pipeline flush, or on other ISAs you need to manually flush caches to get the I-cache in sync with D-cache (so the newly-written bytes can be executed). But you can't copy such objects because their size isn't stored anywhere. You can't search the machine code for a ret instruction, because a 0xc3 byte might appear somewhere that's not the start of an x86 instruction. Or on any ISA, the function might have multiple ret instructions (tail duplication optimization). Or end with a jmp instead of a ret (tailcall).
Storing a size would start to defeat the purpose of saving size, eating up at least an extra byte in each object.
Writing code to an object at runtime, then casting to a function pointer, is undefined behaviour in ISO C and C++. On GNU C/C++, make sure you call __builtin___clear_cache on it to sync caches or whatever else is necessary. Yes, this is needed even on x86 to disable dead-store elimination optimizations: see this test case. On x86 it's just a compile-time thing, no extra asm. It doesn't actually clear any caches.
If you do copy at runtime startup, maybe allocate a big chunk of memory and carve out variable-length chunks of it, while copying. If you malloc each separately, you're wasting memory-management overhead on it.
This idea will not save you memory unless you have about as many functions as you have objects
Normally you have a fairly limited number of actual functions, with many objects having copies of the same function pointer. (You've kind of hand-rolled C++ virtual functions, but with only one function you just have a function pointer directly instead of a vtable pointer to a table of pointers for that class type. One fewer levels of indirection, and apparently you're not passing the object's own address to the function.)
One of the several benefits of this level of indirection is that one pointer is usually significantly smaller than the entire code for a function. For that to not be the case, your functions would have to be tiny.
Example: with 10 different functions of 32 bytes each, and 1000 objects with function pointers, you have a total of 320 bytes of code (which will stay hot in I-cache), and 8000 bytes of function pointers. (And in your objects, another 4 bytes per object wasted on padding to align the pointer, making the total size 16 instead of 12 bytes per object.) Anyway, that's 16320 bytes total for entire structs + code. If you allocated each object separately, there's per-object bookkeeping.
With inlining machine code into each object, and no padding, that's 1000 * (4+32) = 36000 bytes, over twice the total size.
x86-64 is probably a best-case scenario, where a pointer is 8 bytes and x86-64 machine code uses a (famously complex) variable-length instruction encoding which allows for high code density in some cases, especially when optimizing for code-size. (e.g. code-golfing. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code). But unless your functions are mostly something trivial like lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte), they're still going to take more than 8 bytes. (That's return x*3; for a function that takes a 32-bit integer x arg, in the x86-64 System V ABI.)
If they're wrappers for larger functions, a normal call rel32 instruction is 5 bytes. A load of static data is at least 6 bytes (opcode + modrm + rel32 for a RIP-relative addressing mode, or loading EAX specifically can use the special no-modrm encoding for an absolute address. But in x86-64 that's a 64-bit absolute unless you use an address-size prefix too, potentially causing an LCP stall in the decoders on Intel. mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 bytes again, so this is worse for no benefit).
Your function-pointer type doesn't have any args (assuming this is C++ where foo() means foo(void) in a declaration, not like old C where an empty arg list is somewhat similar to (...)). Thus we can assume you're not passing args, so to do anything useful the functions are probably accessing some static data or making another call.
Ideas that make more sense:
Use an ILP32 ABI like Linux x32, where the CPU runs in 64-bit mode but your code uses 32-bit pointers. This would make each of your objects only 8 bytes instead of 16. Avoiding pointer-bloat is a classic use-case for x32 or ILP32 ABIs in general.
Or (yuck) compile your code as 32-bit. But then you have obsolete 32-bit calling conventions that pass args on the stack instead of registers, and less than half the registers, and much higher overhead for position-independent code. (No EIP/RIP-relative addressing.)
Store an unsigned int table index to a table of function pointers. If you have 100 functions but 10k objects, the table is only 100 pointers long. In asm you could index an array of code directly (computed goto style) if all the functions were padded to the same length, but in C++ you can't do that. An extra level of indirection with a table of function pointers is probably your best bet.
e.g.
void (*const fptrs[])(void) = {
func1, func2, func3, ...
};
struct int_with_func {
int i;
unsigned f;
};
void bar(struct int_with_func *p) {
fptrs[p->f] ();
}
clang/gcc -O3 output:
bar(int_with_func*):
mov eax, dword ptr [rdi + 4] # load p->f
jmp qword ptr [8*rax + fptrs] # TAILCALL # index the global table with it for a memory-indirect jmp
If you were compiling a shared library, PIE executable, or not targeting Linux, the compiler couldn't use a 32-bit absolute address to index a static array with one instruction. So there'd be a RIP-relative LEA in there and something like jmp [rcx+rax*8].
This is an extra level of indirection vs. storing a function pointer in each object, but it lets you shrink each object to 8 bytes, down from 16, like using 32-bit pointers. Or to 5 or 6 bytes, if you use an unsigned short or uint8_t and pack the structs with __attribute__((packed)) in GNU C.

No, not really.
The way to specify a function's location is to use a function pointer, which you're already doing.
You could make different types which have their own different member functions, but then you're back to the original problem.
I have in the past experimented with auto-generating (as a pre-build step, using Python) a function with a long switch statement that does the work of mapping int i to a normal function call. This gets rid of the function pointers, at the expense of branching. I don't remember whether it ended up being worthwhile in my case and, even if I did, that wouldn't tell us whether it's worthwhile in your case.
Because I know that in most assembly languages, functions are just "goto" directives
Well, it's perhaps a little more complicated than that…
You might think that I'm insane to optimize the app that way
Perhaps. Trying to eliminate indirection is not, in itself, a bad thing, so I don't think you're wrong to try to improve this. I just don't think that you necessarily can.
but if anyone has some pointers
lol

I don't understand the goal of this "optimization" is it about saving the memory?
I might be misunderstanding the question, but if you just replace your function pointer with a regular function, then you'll have your struct only containing the int as data and the function-pointer being inserted by the compiler when you take the address of it, instead of stored in memory.
So just do
struct {
int i;
void func();
} test;
Then sizeof(test)==sizeof(int) should hold true if you set alignment/packing to be tight.

Passing value as a function argument vs calculating it twice?

I recall from Agner Fog's excellent guide that 64-bit Linux can pass 6 integer function parameters via registers:
http://www.agner.org/optimize/optimizing_cpp.pdf
(page 8)
I have the following function:
void x(signed int a, uint b, char c, unit d, uint e, signed short f);
and I need to pass an additional unsigned short parameter, which would make 7 in total. However, I can actually derive the value of the 7th from one of the existing 6.
So my question is which of the following is a better practice for performance:
Passing the already-calculated value as a 7th argument on 64-bit Linux
Not passing the already-calculated value, but calculating it again for a second time using one of the existing 6 arguments.
The operation in question is a simple bit-shift:
unsigned short g = c & 1;
Not fully understanding x86 assembler I am not too sure how precious registers are and whether it is better to recalculate a value as a local variable, than pass it through function calls as an argument?
My belief is that it would be better to calculate the value twice because it is such a simple 1 CPU cycle task.
EDIT I know I can just profile this- but I'd like to also understand what is happening under the hood with both approaches. Having a 7th argument does this mean cache/memory is involved, rather than registers?

The machine conventions to pass arguments is called the application binary interface (or ABI), and for Linux x86-64 is described in x86-64 ABI spec. See also x86 calling conventions wikipage.
In your case, it is probably not worthwhile to pass c & 1 as an additional parameter (since that 7th parameter is passed on stack).
Don't forget that current processor cores (on desktop or laptop computers) are often doing out-of-order execution and are superscalar, so the c & 1 operation could be done in parallel with other operations and might cost "nothing".
But leave such micro-optimizations to the compiler. If you care a lot about performance, use a recent GCC 4.8 compiler with gcc-4.8 -O3 -flto both for compiling and for linking (i.e. enable link-time optimization).
BTW, cache performance is much more relevant than such micro-optimizations. A single cache miss may take the same time (e.g. 250 nanoseconds) as hundreds of CPU machine instructions. Current CPUs are rumored to mostly wait for the caches. You might want to add a few explicit (and judicious) calls to __builtin_prefetch (see this question and this answer). But adding too much these prefetches would slow down your code.
At last, readability and maintainability of your code should matter much more than raw performance!

Basile's answer is good, I'll just point out another thing to keep in mind:
a) The stack is very likely to be in L1 cache, so passing arguments on the stack should not take more than ~3 cycles extra.
b) The ABI (x86-64 System V, in this case) requires clobbered registers to be restored. Some are saved by the caller, others by the callee. Obviously, the registers used to pass arguments must be saved by the caller if the original contents were needed again. But when your function uses more registers than the caller saved, any additional temporary results the function needs to calculate must go into a callee-saved register. So the function ends up spilling a register on the stack, reusing the register for your temporary variable, and then pops the original value back.
The only way you can avoid accessing memory is by using a smaller, simpler function that needs fewer temporary variables.

C++: Pass Vector struct by reference or value?

This is my Vector struct: struct Vector{ float x, y; };
Should I pass it to functions by value or as const Vector&?

If you pass by value, the function will get a copy that it can modify locally without affecting the caller.
If you pass by const-reference, the function will only get a read-only reference. No copying involved, but the called function cannot modify it locally.
Given the size of the struct, the copy overhead will be very small. So choose what is best/easiest for the called function.

For such a small structure, either one may be the most efficient, depending on your platform and compiler.
(The name may be a bit confusing to other programmers, since vector in C++ usually means "dynamic array". How's about Vector2D?)

I really depends on your platform and compiler, and whether the function is inline or not.
When passed by reference, the structure is not copied, only its address is stored on the stack, not the content. When passed by value, the content is copied. On a 64-bit platform the size of the struct is the same as a pointer to the struct (assuming 64-bit pointers which seems to be the more common situation). So, the gains of passing by reference is not really clear here.
However, there is another thing to consider. Your structure contains float value. On Intel architecture, they can be stored in the FPU or in SIMD register before the call to the function. In that situation, if the function take the parameter by reference, then, they will have to be spilled to memory, and the address to this memory passed to the function. This can be really slow. If they had been passed by value, no copy to memory would be needed (faster). And one some platform (PS3), the compiler will not be smart enough to remove those spilling, even in case of inline function.
In fact, like every question of micro-optimisation, there is no "good answer", it all depends on what usage you make of the function, and what your compiler / platform want. The best would be to mesure (or use a tool to analyse assembly) to check what is the best for your platform / compiler combination.
I'm going to finish by quoting Jaymin Kessler from Q-Games that is much more versed in those topics than I can ever be:
2) If a type fits in a register, pass it by value. DO NOT PASS VECTOR TYPES BY REFERENCE, ESPECIALLY CONST REFERENCE. If the function ends up getting inlined, GCC occasionally will go to memory when it hits the reference. I’ll say it again: If the type you are using fits in registers (float, int, or vector) do not pass it to the function by anything but value. In the case of non-sane compilers like Visual Studio for x86, it can’t maintain the alignment of objects on the stack, and therefore objects that have align directives must be passed to functions by reference. This may be fixed or the Xbox 360. If you are multiplatform, the best thing you can do is make a parameter passing typedef to avoid having to cater to the lowest common denominator.
Considering the following code:
struct Vector { float x, y; };
extern Vector DoSomething1(Vector v);
extern Vector DoSomething2(const Vector& v);
void Test1()
{
Vector v0 = { 1., 2. };
Vector v1 = DoSomething1(v0);
}
void Test2()
{
Vector v0 = { 1., 2. };
Vector v1 = DoSomething2(v0);
}
From the point of view of the code, the only difference between Test1 and Test2 is the calling convention used by DoSomething1 and DoSomething2 to receive the Vector struct. When compiled with g++ (version 4.2, architecture x86_64), the generated code is:
.globl __Z5Test1v
__Z5Test1v:
LFB2:
movabsq $4611686019492741120, %rax
movd %rax, %xmm0
jmp __Z12DoSomething16Vector
LFE2:
.globl __Z5Test2v
__Z5Test2v:
LFB3:
subq $24, %rsp
LCFI0:
movl $0x3f800000, (%rsp)
movl $0x40000000, 4(%rsp)
movq %rsp, %rdi
call __Z12DoSomething2RK6Vector
addq $24, %rsp
ret
LFE3:
We can see that in the case of Test1, the value are passed via the %xmm0 SIMD register once loaded from memory (so if they where the result of a previous computation, they would already be in the register and there would be no need to load them from memory). On the other hand, in the case of Test2, the value are passed on the stack (movl $0x3f800000, (%rsp) push 1.0f on the stack). And if they where the result of a previous computation, that would require copying them from the %xmm0 SIMD register. And that can be really slow (it may well stall the pipeline until the value is available, and if the stack is not properly aligned, the copy will also be slow).
So if your function is not inline, prefer to pass by copy instead of const-reference. If the function is indeed inline, watch the code generated before making your mind.

As a reference. That is more efficient than making a copy of the structure and passing that (i.e. passing by value).
The only exception is if your platform can fit the entire structure into a register.

may a whole array reside in some cpu register?

As I'm not too much familiar with cpu registers, in general and in any architecture specially x86 and if compiler-relevant using VC++ I'm curious that is it possible for all elements of an array with a tiny number of elements like an array of 1-byte characters with 4 elements to reside in some cpu register as I know this could be true for single primitives like double, integer, etc ?
when we have a parameter like below:
void someFunc(char charArray[4]){
//whatever
}
Will this parameter passing be definitely done through passing a pointer to the function or that array would be residing in some cpu register eliminating the need to pass a pointer to main memory?

This is not compiler dependent, nor is it possible. Arrays cannot be passed by value in the same way as other types, i.e. they cannot be copied when passed into a function. The C++ standard is clear in that when processing a function signature in a declaration the following are exact equivalencies:
void foo( char *a );
void foo( char a[] );
void foo( char a[4] );
void foo( char a[ 100000 ] );
A compliant compiler will convert the array in the function signature into a pointer. Now, at the place of call, a similar operation takes place: if the argument is an array, the compiler has to decay it into a pointer to the first element. Again, the size of the array is lost in the decay.
Specific registers can be used to hold more than one value and perform operations on them (google for vectorized operations, MME and variants). But while that means that the compiler can actually insert the contents of a small array into a single register, that cannot be used to change the function call that you refer to.

Within a single function, an array could be held in one or more registers, just so long as the compiler is able to produce CPU instructions to manipulate it as the code dictates. The standard doesn't really define what it means for something to "be" in a register. It's a private matter between the compiler and the debugger, and there may be a fine line between something being in a register, and being "optimized away" entirely.
In your example, the parameter is a pointer, not an array (see dribeas' answer). So it would be unusual that the array it points to could possibly be held a register. The "main" architectures that you probably deal with don't allow a pointer to a register, so even if the array was held in a register in the calling code, it would have to be written into memory in order to take a pointer to it, to pass to the callee.
If the function call was inlined, then better optimizations might be possible, just as if there were no call at all.
If you wrap your array in a struct, then you turn it into something that can be passed by value:
struct Foo {
char a[4];
};
void FooFunc(Foo f) {
// whatever
}
Now, the function is taking the actual array data as its parameter, so there's one less barrier to holding it in a register. Whether the implementation's calling convention actually does pass small structs in registers is another question, though. I don't know what calling conventions do this, if any.

Out of the 5 or so compilers I'm fairly familiar with, (Borland/Turbo C/C++ from 1.0, Watcom C/C++ from v8.0, MSC from 5.0, IBM Visual Age C/C++, gcc of various versions on DOS, Linux and Windows) I've not seen this optimization happen naturally.
There was a string library, whose name I cannot remember, that did optimizations similar to this in x86 ASM. It may have been part of the "Spontaneous Assembly" library, but no guarantees.

A function that accepts an array is probably going to index into that array. I know of no architecture that supports efficient indexing into a register, so it's probably pointless to pass arrays in registers.
(On an x86 architecture, you could access a[0] and a[1] by accessing al and ah of the eax register, but that is a special case that only works if the indexes are known at compile time.)

You asked if its possible with VC++ on an x86.
I doubt it's possible in that configuration. True, you could produce assembler code where that array is kept in a register, but due to the nature of arrays it would be by no means a natural optimization for a compiler, so I doubt they put it in.
You can try it out though and produce some code where the compiler would have an "incentive" to put it in a register, but it would look pretty weird like
char x[4];
*((int*)x) = 36587467;
Compile that with optimizations and the /FA switch and look at the assembler code produced (and then tell us the results :-))
If you use it in a more "natural" way, like accessing single characters or initializing it with a string there is no reason at all for the compiler to put that array into a register.
Even when passing it to a function - the compiler might put the address of the array into the register, but not the array itself

Only variables can be stored in a register. You can try to force register storage by using the register keyword: register int i;
Arrays are by default pointers.
You can get the value located at the 4 position like this (using pointer syntax):
char c = *(charArray + 4);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js