Returning Vs. Pointer - c++

How much would performance differ between these two situations?
int func(int a, int b) { return a + b; }
And
void func(int a, int b, int * c) { *c = a + b; }
Now, what if it's a struct?
typedef struct { int a; int b; char c; } my;
my func(int a, int b, char c) { my x; x.a = a; x.b = b; x.c = c; return x; }
And
void func(int a, int b, int c, my * x) { x->a = a; x->b = b; x->c = c; }
One thing I can think of is that a register cannot be used for this purpose, correct? Other than that, I am unaware of how this function would turn out after going trough a compiler.
Which would be more efficient and speedy?

If the function can inline, often no difference between the first 2.
Otherwise (no inlining because of no link-time optimization) returning an int by value is more efficient because it's just a value in a register that can be used right away. Also, the caller didn't have to pass as many args, or find/make space to point at. If the caller does want to use the output value, it will have to reload it, introducing latency in the total dependency chain from inputs ready to output ready. (Store-forwarding latency is ~5 cycles on modern x86 CPUs, vs. 1 cycle latency for the lea eax, [rdi + rsi] that would implement that function for x86-64 System V.
The exception is maybe for rare cases where the caller isn't going to use the value, just wants it in memory at some address. Passing that address to the callee (in a register) so it can be used there means the caller doesn't have to keep that address anywhere that will survive across the function call.
For the struct version:
a register cannot be used for this purpose, correct?
No, for some calling conventions, small structs can be returned in registers.
x86-64 System V will return your my struct by value in the RDX:RAX register pair because it's less than 16 bytes and all integer. (And trivially copyable.) Try it on https://godbolt.org/z/x73cEh -
# clang11.0 -O3 for x86-64 SysV
func_val:
shl rsi, 32
mov eax, edi
or rax, rsi # (uint64_t)b<<32 | a; the low 64 bits of the struct
# c was already in EDX, the low half of RDX; clang leaves it there.
ret
func_out:
mov dword ptr [rcx], edi
mov dword ptr [rcx + 4], esi # just store the struct members
mov byte ptr [rcx + 8], dl # to memory pointed-to by 4th arg
ret
GCC doesn't assume that char c is correctly sign-extended to EDX the way clang does (unofficial ABI feature). GCC does a really dumb byte store / dword reload that creates a store-forwarding stall, to get uninitialized garbage from memory instead of from high bytes of EDX. Purely a missed optimization, but see it in https://godbolt.org/z/WGcqKc. It also insanely uses SSE2 to merge the two integers into a 64-bit value before doing a movq rax, xmm0, or to memory for the output-arg.
You definitely want the struct version to inline if the caller uses the values, so this packing into return-value registers can be optimized away.
How does function ACTUALLY return struct variable in C? has an ARM example for a larger struct: return by value passes a hidden pointer to the caller's return-value object. From there, it may need to be copied by the caller if assigning to something that escape analysis can't prove is private. (e.g. through some pointer). What prevents the usage of a function argument as hidden pointer?
Also related: Why is tailcall optimization not performed for types of class MEMORY?
How do C compilers implement functions that return large structures? points out that code-gen may differ between C and C++.
I don't know how to explain any general rule of thumb that one could apply without understand asm and the calling convention you care about. Usually pass/return large structs by reference, but for small structs it's very much "it depends".

Related

How does GCC (not clang) make this optimization deciding that a store to one struct member couldn't affect a member of another?

This is the code in question:
struct Cell
{
Cell* U;
Cell* D;
void Detach();
};
void Cell::Detach()
{
U->D = D;
D->U = U;
}
clang-14 -O3 produces:
mov rax, qword ptr [rdi] <-- rax = U
mov rcx, qword ptr [rdi + 8] <-- rcx = D
mov qword ptr [rax + 8], rcx <-- U->D = D
mov rcx, qword ptr [rdi + 8] <-- this queries the D field again
mov qword ptr [rcx], rax <-- D->U = U
gcc 11.2 -O3 produces almost the same, but leaves out one mov:
mov rdx, QWORD PTR [rdi]
mov rax, QWORD PTR [rdi+8]
mov QWORD PTR [rdx+8], rax
mov QWORD PTR [rax], rdx
Clang reads the D field twice, while GCC reads it only once and re-uses it. Apparently GCC is not afraid of the first assignment changing anything that has an impact on the second assignment. I'm trying to understand if/when this is allowed.
Checking correctness gets a bit complicated when U or D point at themselves, each other and/or the same target.
My understanding is that the shorter code of GCC is correct if it is guaranteed that the pointers point at the beginning of a Cell (never inside it), regardless of which Cell it is.
Following this line of thought further, this is the case when a) Cells are always aligned to their size, and b) no custom manipulation of such a pointer occurs (referencing and arithmetic are fine).
I suspect case a) is guaranteed by the compiler, and case b) would require invoking undefined behavior of some sort, and as such can be ignored.
This would explain why GCC allows itself this optimization.
Is my reasoning correct? If so, why does clang not make the same optimization?
There are many potential optimizations in C and C++ that are usually safe, but aren't quite sound. If one regards the -> operator as being usable to build a standard-layout object without having to use placement new on it first (an abstraction model that is relied upon by a lot of code, whether or not the Standard mandates support), removing the if (mode) in the following C and C++ funcitons would be such an optimization.
C version:
struct s { int x,y; }; /* Assume int is 4 bytes, and struct is 8 */
void test(struct s *p1, struct s *p2, int mode)
{
p1->y = 1;
p2->x = 2;
if (mode)
p1->y = 1;
}
C++ version:
#include <new>
struct s { int x,y; };
void test(void *vp1, void *vp2, int mode)
{
if (1)
{
struct s* p1 = new (vp1) struct s;
p1->x = 1;
}
if (1)
{
struct s* p2 = new (vp2) struct s;
p2->y = 2;
}
if (mode)
{
struct s* p3 = new (vp1) struct s;
p3->x = 1;
}
}
The optimization would be correct unless the address in p2 is four bytes higher than p1. Under the "traditional" abstraction model used in C or C++, if the address of p1 happens to be 0x1000 and that of p2 happens to be 0x1004, the first assignment would cause addresses 0x1000-0x1007 to hold a struct s, if it didn't already, whose second member (at address 0x1004) would equal 1. The second assignment, by overwriting that object, would end its lifetime and cause addresses 0x1004 to 0x100B to hold a struct s whose first member would equal 2. The third assignment, if executed, would end the lifetime of that second object and re-create the first.
If the third assignment is executed, there would be an object at address 0x1000 whose second field (at address 0x1004) would hold the readable value 1. If the assignment is skipped, there would be an object at address 0x1004 whose first field would hold the value 2. Behavior would be defined in both cases, and a compiler that didn't know which case would apply would have to accommodate both of them by making the value at 0x1004 depend upon mode.
As it happens, the authors of clang do not seem to have provided for that corner case, and thus omit the conditional check. While I think the Standard should use an abstraction model that would allow such optimization, while also supporting the common structure-creation pattern in situations that don't involve weird aliasing corner cases, I don't see any way of interpreting the Standard that would allow for such optimization without allowing compilers to arbitrarily break a large amount of existing code.
I don't think there's any general way of knowing when a decision by gcc or clang not to impose a particular optimization represents a recognition of potential corner cases where the optimization would be incorrect, and an inability to prove that none of them apply, and when it simply represents an oversight which may be "corrected" to as to replace correct behavior with an unsound optimization.

Access through reference overhead vs copy overhead

Let's say that I want to pass a POD object to function as a const argument. I know that for simple types like int and double passing by value is better than by const reference because of the reference overhead. But at what size it is worth it to pass as a reference?
struct arg
{
...
}
void foo(const arg input)
{
// read from input
}
or
void foo(const arg& input)
{
// read from input
}
i.e., at what size of struct arg should I start using the latter approach?
I should also mention that I'm not talking about copy elision here. Let's suppose that it doesn't happen.
TL;DR: This depends highly on the target architecture, the compiler and the context in which the functions are invoked. When unsure, profile and manually inspect generated code.
If the functions are inlined, a good optimizing compiler will probably emit exact same code in both cases.
If the functions are not inlined however, the ABI on most C++ implementations dictate to pass a const& argument as a pointer. That means the structure has to be stored in RAM just so one can get an address of it. This can have a significant impact on performance for small objects.
Let's take x86_64 Linux G++ 8.2 as an example...
A struct with 2 members:
struct arg
{
int a;
long b;
};
int foo1(const arg input)
{
return input.a + input.b;
}
int foo2(const arg& input)
{
return input.a + input.b;
}
Generated assembly:
foo1(arg):
lea eax, [rdi+rsi]
ret
foo2(arg const&):
mov eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
ret
First version passes the structure entirely via registers, the second one via the stack..
Now let's try 3 members:
struct arg
{
int a;
long b;
int c;
};
int foo1(const arg input)
{
return input.a + input.b + input.c;
}
int foo2(const arg& input)
{
return input.a + input.b + input.c;
}
Generated assembly:
foo1(arg):
mov eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+24]
ret
foo2(arg const&):
mov eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi+16]
ret
Not a whole lot of difference anymore, although using the second version will still be a bit slower because it requires the address to be put in rdi.
Does it really matter that much?
Usually not. If you care about performance of a particular function, it's probably called frequently and is therefore small. As such, it will most likely be inlined.
Let's try invoking the two functions above:
int test(int x)
{
arg a {x, x};
return foo1(a) + foo2(a);
}
Generated assembly:
test(int):
lea eax, [0+rdi*4]
ret
VoilĂ . It's all moot now. The compiler inlined and merged both functions into a single instruction!
A reasonable rule of thumb: If the size of the class is same or less than size of a pointer, then copying may be a bit faster.
If the size of the class is slightly higher, then it may be hard to predict. The difference is often insignificant.
If the size of the class is humongous, then copying is likely slower. That said, point is moot since humongous objects can't in practice have automatic storage, since it is limited.
If the function is expanded inline, then there is probably no difference whatsoever.
To find out whether one program is faster than the other on a particular system, and whether the difference is significant in the first place, you can use a profiler.
In addition to other responses, there is also optimization concerns.
Since it's a reference, the compiler cannot know if the reference point to a mutable global variable or not. When calling any function that the source is not available to the current TU, the compiler must assume the variable may have been mutated.
For example, if you have a if depending on a data member of Foo, call a function, then use the same data member, the compiler will be force to output two sparated loads, whereas if the variable is local, it knows it cannot be mutated elsewhere. Here's an example:
struct Foo {
int data;
};
extern void use_data(int);
void bar(Foo const& foo) {
int const& data = foo.data;
// may mutate foo.data through a global Foo
use_data(data);
// must load foo.data again through the reference
use_data(data);
}
If the variable is local, the compiler will simply reuse the value already inside the registers.
Here's a compiler explorer example that shows the optimization being applied only if the variable is local.
This is why the "general advise" will give you good performance, but won't give you optimal performance. You must mesure and profile your code if you truly care about the performance of your code.

C++ XOR swap on float values

Is it possible to use XOR swapping algorithm with float values in c++?
However wikipedia says that:
XOR bitwise operation to swap values of distinct variables having the same data type
but I am bit confused. With this code:
void xorSwap (int* x, int* y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
int main() {
float a = 255.33333f;
float b = 0.123023f;
xorSwap(reinterpret_cast<int*>(&a), reinterpret_cast<int*>(&b));
std::cout << a << ", " << b << "\n";
return 0;
}
it seems to work (at least under gcc), but I am concerned if such a practice is allowed if needed?
Technically, what you ask is possible but, as clearly commented by IInspectable, it causes UB (Undefined Behaviour).
Anyway, I do suggest you to use std::swap instead, it's a template often specialized for specific data types and designed to do a good job.
If int is the same size as float, it is going to work in practice on any reasonable architecture. The memory in the float is a set of bits, you interpret those bits and swap them completely using xor operations. You can then use Those bits as the proper floats again. The quote you refer to only says that the two values you swap need to be of the same type, and both are ints.
However, on some architectures, this can result in a movement between different sorts of registers, or explicitly flushing registers to memory. What you will see on almost any sane architecture with a sane optimizing compiler these days, is that an explicit swap, using std::swap or an expression with a temporary variable, is in fact faster.
I.e. you should write:
float a = 255.33333f;
float b = 0.123023f;
float tmp = a;
a = b;
b = tmp;
or preferably:
float a = 255.33333f;
float b = 0.123023f;
std::swap(a,b);
If the standard library author for your architecture has determined xor swapping to indeed be beneficial, then you should hope that the last form will use it. xor swapping is a typical bad idiom in terms of hiding intent in an unneccesarily arcane implementation. It was only ever efficient in seriously register-starved cases with bad optimizers.
Your code there invokes undefined behavior. It's not legal in C or C++ to cast a float* to a int* and use it as such. reinterpret_cast should be used to to convert between unrelated structs with compatible layouts, or to temporarily convert between a typed pointer and void*.
Oh, and in this particular case the UB is not of merely academic concern. A compiler may notice that xorSwap() doesn't touch any floats, perform optimizations allowed to it by the aliasing rules of the language, and print out the original values of a and b instead of the swapped values. And that's not even getting into architectures where int and float are of different sizes or alignments.
If you wanted to do this safely, you'd have to memcpy() from your floats into unsigned char arrays, do the XOR in a loop, then memcpy() back. Which would of course make the operation slower than a normal swap. Of course, xor-based swapping is ALREADY slower than normal swapping.
It is:
a) possible when the compiler allows it.
b) an operation for which the standard does not define behaviour (i.e. undefined behaviour)
c) on gcc, actually less efficient than stating exactly what you want:
given:
void xorSwap (unsigned int* x, unsigned int* y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
void swapit3(float& a, float&b)
{
xorSwap(reinterpret_cast<unsigned int*>(&a), reinterpret_cast<unsigned int*>(&b));
}
results in this:
swapit3(float&, float&): # #swapit3(float&, float&)
mov eax, dword ptr [rdi]
xor eax, dword ptr [rsi]
mov dword ptr [rdi], eax
xor eax, dword ptr [rsi]
mov dword ptr [rsi], eax
xor dword ptr [rdi], eax
ret
whereas this:
void swapit2(float& a, float&b)
{
std::swap(a,b);
}
results in this:
swapit2(float&, float&): # #swapit2(float&, float&)
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rsi]
mov dword ptr [rdi], ecx
mov dword ptr [rsi], eax
ret
link: https://godbolt.org/g/K4cazx

compiler memory optimization - reusing existing blocks

Say i were to allocate 2 memory blocks.
I use the first memory block to store something and use this stored data.
Then i use the second memory block to do something similar.
{
int a[10];
int b[10];
setup_0(a);
use_0(a);
setup_1(b);
use_1(b);
}
|| compiler optimizes this to this?
\/
{
int a[10];
setup_0(a);
use_0(a);
setup_1(a);
use_1(a);
}
// the setup functions overwrites all 10 words
The question is now: Do compiler optimize this, so that they reuse the existing memory blocks, instead of allocating a second one, if the compiler knows that the first block will not be referenced again?
If this is true:
Does this also work with dynamic memory allocation?
Is this also possible if the memory persists outside the scope, but is used in the same way as given in the example?
I assume this only works if setup and foo are implemented in the same c file (exist in the same object as the calling code)?
Do compiler optimize this
This question can only be answered if you ask about a particular compiler. And the answer can be found by inspecting the generated code.
so that they reuse the existing memory blocks, instead of allocating a second one, if the compiler knows that the first block will not be referenced again?
Such optimization would not change the behaviour of the program, so it would be allowed. Another matter is: Is it possible to prove that the memory will not be referenced? If it is possible, then is it easy enough to prove in reasonable time? I feel very safe in saying that it is not possible to prove in general, but it is provable in some cases.
I assume this only works if setup and foo are implemented in the same c file (exist in the same object as the calling code)?
That would usually be required to prove the untouchability of the memory. Link time optimization might lift this requirement, in theory.
Does this also work with dynamic memory allocation?
In theory, since it doesn't change the behaviour of the program. However, the dynamic memory allocation is typically performed by a library and thus the compiler may not be able to prove the lack of side-effects and therefore wouldn't be able to prove that removing an allocation wouldn't change behaviour.
Is this also possible if the memory persists outside the scope, but is used in the same way as given in the example?
If the compiler is able to prove that the memory is leaked, then perhaps.
Even though the optimization may be possible, it is not very significant. Saving a bit of stack space probably has very little effect on run time. It could be useful to prevent stack overflows if the arrays are large.
https://godbolt.org/g/5nDqoC
#include <cstdlib>
extern int a;
extern int b;
int main()
{
{
int tab[1];
tab[0] = 42;
a = tab[0];
}
{
int tab[1];
tab[0] = 42;
b = tab[0];
}
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
mov DWORD PTR a[rip], 42
mov DWORD PTR b[rip], 42
xor eax, eax
ret
If you follow the link you should see the code being compiled on gcc and clang with -O3 optimisation level. The resulting asm code is pretty straight forward. As the value stored in the array is know at compilation time, the compiler can easily skip everything and straight up set the variables a and b. Your buffer is not needed.
Following a code similar to the one provided in your example:
https://godbolt.org/g/bZHSE4
#include <cstdlib>
int func1(const int (&tab)[10]);
int func2(const int (&tab)[10]);
int main()
{
int a[10];
int b[10];
func1(a);
func2(b);
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
sub rsp, 104
mov rdi, rsp ; first address is rsp
call func1(int const (&) [10])
lea rdi, [rsp+48] ; second address is [rsp+48]
call func2(int const (&) [10])
xor eax, eax
add rsp, 104
ret
You can see the pointer sent to the function func1 and func2 is different as the first pointer used is rsp in the call to func1, and [rsp+48] in the call to func2.
You can see that either the compiler completely ignores your code in the case it is predictable. In the other case, at least for gcc 7 and clang 3.9.1, it is not optimized.
https://godbolt.org/g/TnV62V
#include <cstdlib>
extern int * a;
extern int * b;
inline int do_stuff(int ** to)
{
*to = (int *) malloc(sizeof(int));
(**to) = 42;
return **to;
}
int main()
{
do_stuff(&a);
free(a);
do_stuff(&b);
free(b);
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
sub rsp, 8
mov edi, 4
call malloc
mov rdi, rax
mov QWORD PTR a[rip], rax
call free
mov edi, 4
call malloc
mov rdi, rax
mov QWORD PTR b[rip], rax
call free
xor eax, eax
add rsp, 8
ret
While not being fluent at reading this, it is pretty easy to tell that with the following example, malloc and free is not being optimized neither by gcc or clang (if you want to try with more compiler, suit yourself but don't forget to set the optimization flag).
You can clearly see a call to "malloc" followed by a call to "free", twice
Optimizing stack space is quite unlikely to really have an effect on the speed of your program, unless you manipulate large amount of data.
Optimizing dynamically allocated memory is more relevant. AFAIK you will have to use a third-party library or run your own system if you plan to do that and this is not a trivial task.
EDIT: Forgot to mention the obvious, this is very compiler dependent.
As the compiler sees that a is used as a parameter for a function, it will not optimize b away. It can't, because it doesn't know what happens in the function that uses a and b. Same for a: the compiler doesn't know that a isn't used anymore.
As far as the compiler is concerned, the address of a could e.g. have ben stored by setup0 in a global variable and will be used by setup1 when it is called with b.
The short answer is: No! The compiler cannot optimize this code to what you suggested, because it is not semantically equivalent.
Long explenation: The lifetime of a and b is with some simplification the complete block.
So now lets assume, that one of setup_0 or use_0 stores a pointer to a in some global variable. Now setup_1 and use_1 are allowed to use a via this global variable in combination with b (It can for example add the array elements of a and b. If the transformation you suggested of the code was done, this would result in undefined behaviour. If you really want to make a statement about the lifetime, you have to write the code in the following way:
{
{ // Lifetime block for a
char a[100];
setup_0(a);
use_0(a);
} // Lifetime of a ends here, so no one of the following called
// function is allowed to access it. If it does access it by
// accident it is undefined behaviour
char b[100];
setup_1(b); // Not allowed to access a
use_1(b); // Not allowed to access a
}
Please also note that gcc 12.x and clang 15 both do the optimization. If you comment out the curly brackets, the optimization is (correctly!) not done.
Yes, theoretically, a compiler could optimize the code as you describe, assuming that it could prove that these functions did not modify the arrays passed in as parameters.
But in practice, no, that does not happen. You can write a simple test case to verify this. I've avoided defining the helper functions so the compiler can't inline them, but passed the arrays by const-reference to ensure that the compiler knows the functions don't modify them:
void setup_0(const int (&p)[10]);
void use_0 (const int (&p)[10]);
void setup_1(const int (&p)[10]);
void use_1 (const int (&p)[10]);
void TestFxn()
{
int a[10];
int b[10];
setup_0(a);
use_0(a);
setup_1(b);
use_1(b);
}
As you can see here on Godbolt's Compiler Explorer, no compilers (GCC, Clang, ICC, nor MSVC) will optimize this to use a single stack-allocated array of 10 elements. Of course, each compiler varies in how much space it allocates on the stack. Some of that is due to different calling conventions, which may or may not require a red zone. Otherwise, it's due to the optimizer's alignment preferences.
Taking GCC's output as an example, you can immediately tell that it is not reusing the array a. The following is the disassembly, with my annotations:
; Allocate 104 bytes on the stack
; by subtracting from the stack pointer, RSP.
; (The stack always grows downward on x86.)
sub rsp, 104
; Place the address of the top of the stack in RDI,
; which is how the array is passed to setup_0().
mov rdi, rsp
call setup_0(int const (&) [10])
; Since setup_0() may have clobbered the value in RDI,
; "refresh" it with the address at the top of the stack,
; and call use_0().
mov rdi, rsp
call use_0(int const (&) [10])
; We are now finished with array 'a', so add 48 bytes
; to the top of the stack (RSP), and place the result
; in the RDI register.
lea rdi, [rsp+48]
; Now, RDI contains what is effectively the address of
; array 'b', so call setup_1().
; The parameter is passed in RDI, just like before.
call setup_1(int const (&) [10])
; Second verse, same as the first: "refresh" the address
; of array 'b' in RDI, since it might have been clobbered,
; and pass it to use_1().
lea rdi, [rsp+48]
call use_1(int const (&) [10])
; Clean up the stack by adding 104 bytes to compensate for the
; same 104 bytes that we subtracted at the top of the function.
add rsp, 104
ret
So, what gives? Are compilers just massively missing the boat here when it comes to an important optimization? No. Allocating space on the stack is extremely fast and cheap. There would be very little benefit in allocating ~50 bytes, as opposed to ~100 bytes. Might as well just play it safe and allocate enough space for both arrays separately.
There might be more of a benefit in reusing the stack space for the second array if both arrays were extremely large, but empirically, compilers don't do this, either.
Does this work with dynamic memory allocation? No. Emphatically no. I've never seen a compiler that optimizes around dynamic memory allocation like this, and I don't expect to see one. It just doesn't make sense. If you wanted to re-use the block of memory, you would have written the code to re-use it instead of allocating a separate block.
I suppose you are thinking that if you had something like the following C code:
void TestFxn()
{
int* a = malloc(sizeof(int) * 10);
setup_0(a);
use_0(a);
free(a);
int* b = malloc(sizeof(int) * 10);
setup_1(b);
use_1(b);
free(b);
}
that the optimizer could see that you were freeing a, and then immediately re-allocating a block of the same size as b? Well, the optimizer won't recognize this and elide the back-to-back calls to free and malloc, but the run-time library (and/or operating system) very likely will. free is a very cheap operation, and since a block of the appropriate size was just released, allocation will also be very cheap. (Most run-time libraries maintain a private heap for the application and won't even return the memory to the operating system, so depending on the memory-allocation strategy, it's even possible that you get the exact same block back.)

How to optimize function return values in C and C++ on x86-64?

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g() it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.
Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2 using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.
Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.