Consider this example:
#include <utility>
// runtime dominated by argument passing
template <class T>
void foo(T t) {}
int main() {
int i(0);
foo<int>(i); // fast -- int is scalar type
foo<int&>(i); // slow -- lvalue reference overhead
foo<int&&>(std::move(i)); // ???
}
Is foo<int&&>(i) as fast as foo<int>(i), or does it involve pointer overhead like foo<int&>(i)?
EDIT: As suggested, running g++ -S gave me the same 51-line assembly file for foo<int>(i) and foo<int&>(i), but foo<int&&>(std::move(i)) resulted in 71 lines of assembly code (it looks like the difference came from std::move).
EDIT: Thanks to those who recommended g++ -S with different optimization levels -- using -O3 (and making foo noinline) I was able to get output which looks like xaxxon's solution.
In your specific situation, it's likely they are all the same. The resulting code from godbolt with gcc -O3 is https://godbolt.org/g/XQJ3Z4 for:
#include <utility>
// runtime dominated by argument passing
template <class T>
int foo(T t) { return t;}
int main() {
int i{0};
volatile int j;
j = foo<int>(i); // fast -- int is scalar type
j = foo<int&>(i); // slow -- lvalue reference overhead
j = foo<int&&>(std::move(i)); // ???
}
is:
mov dword ptr [rsp - 4], 0 // foo<int>(i);
mov dword ptr [rsp - 4], 0 // foo<int&>(i);
mov dword ptr [rsp - 4], 0 // foo<int&&>(std::move(i));
xor eax, eax
ret
The volatile int j is so that the compiler cannot optimize away all the code because it would otherwise know that the results of the calls are discarded and the whole program would optimize to nothing.
HOWEVER, if you force the function to not be inlined, then things change a bit int __attribute__ ((noinline)) foo(T t) { return t;}:
int foo<int>(int): # #int foo<int>(int)
mov eax, edi
ret
int foo<int&>(int&): # #int foo<int&>(int&)
mov eax, dword ptr [rdi]
ret
int foo<int&&>(int&&): # #int foo<int&&>(int&&)
mov eax, dword ptr [rdi]
ret
above: https://godbolt.org/g/pbZ1BT
For questions like these, learn to love https://godbolt.org and https://quick-bench.com/ (quick bench requires you to learn how to properly use google test)
Efficiency of parameter passing depends on the ABI.
For example, on linux the Itanium C++ ABI specifies that references are passed as pointers to the referred object:
3.1.2 Reference Parameters
Reference parameters are handled by passing a pointer to the actual parameter.
This is independent of the reference category (rvalue/lvalue reference).
For a broader view, I have found this quote in a document from the Technical University of Denmark, calling convention, which analyzes most of the compilers:
References are treated as identical to pointers in all respects.
So rvalue and lvalue reference involve pointer overhead on all ABI.
Related
If I compile this code with GCC or Clang and enable -O2 optimizations, I still get some global object initialization. Is it even possible for any code to reach these variables?
#include <string>
static const std::string s = "";
int main() { return 0; }
Compiler output:
main:
xor eax, eax
ret
_GLOBAL__sub_I_main:
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:s
mov edi, OFFSET FLAT:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEED1Ev
mov QWORD PTR s[rip], OFFSET FLAT:s+16
mov QWORD PTR s[rip+8], 0
mov BYTE PTR s[rip+16], 0
jmp __cxa_atexit
Specifically, I was not expecting the _GLOBAL__sub_I_main: section.
Godbolt link
Edit:
Even with a simple custom defined type, the compiler still generates some code.
class Aloha
{
public:
Aloha () : i(1) {}
~Aloha() = default;
private:
int i;
};
static const Aloha a;
int main() { return 0; }
Compiler output:
main:
xor eax, eax
ret
_GLOBAL__sub_I_main:
ret
Compiling that code with short string optimization (SSO) may be an equivalent of taking address of std::string's member variable. Constructor have to analyze string length at compile time and choose if it can fit into internal storage of std::string object or it have to allocate memory dynamically but then find that it never was read so allocation code can be optimized out.
Lack of optimization in this case might be an optimization flaw limited to such simple outlying examples like this one:
const int i = 3;
int main()
{
return (long long)(&i); // to make sure that address was used
}
GCC generates code:
i:
.long 3 ; this a variable
main:
push rbp
mov rbp, rsp
mov eax, OFFSET FLAT:i
pop rbp
ret
GCC would not optimize this code as well:
const int i = 3;
const int *p = &i;
int main() { return 0; }
Static variables declared in file scope, especially const-qualified ones can be optimized out per as-if rule unless their address was used, GCC does that only to const-qualified ones regardless of use case. Taking address of variable is an observable behaviour, because it can be passed somewhere. Logic which would trace that would be too complex to implement and would be of little practical value.
Of course, the code that doesn't use address
const int i = 3;
int main() { return i; }
results in optimizing out reserved storage:
main:
mov eax, 3
ret
As of C++20 constexpr construction of std::string? Per older rules it could not be a compile-time expression if result was dependant on arguments. It possible that std::string would allocate memory dynamically if string is too long, which isn't a compile-time action. It appears that only mainstream compiler that supports C++20 features required for that it at this moment is MSVC in certain conditions.
This is the code in question:
struct Cell
{
Cell* U;
Cell* D;
void Detach();
};
void Cell::Detach()
{
U->D = D;
D->U = U;
}
clang-14 -O3 produces:
mov rax, qword ptr [rdi] <-- rax = U
mov rcx, qword ptr [rdi + 8] <-- rcx = D
mov qword ptr [rax + 8], rcx <-- U->D = D
mov rcx, qword ptr [rdi + 8] <-- this queries the D field again
mov qword ptr [rcx], rax <-- D->U = U
gcc 11.2 -O3 produces almost the same, but leaves out one mov:
mov rdx, QWORD PTR [rdi]
mov rax, QWORD PTR [rdi+8]
mov QWORD PTR [rdx+8], rax
mov QWORD PTR [rax], rdx
Clang reads the D field twice, while GCC reads it only once and re-uses it. Apparently GCC is not afraid of the first assignment changing anything that has an impact on the second assignment. I'm trying to understand if/when this is allowed.
Checking correctness gets a bit complicated when U or D point at themselves, each other and/or the same target.
My understanding is that the shorter code of GCC is correct if it is guaranteed that the pointers point at the beginning of a Cell (never inside it), regardless of which Cell it is.
Following this line of thought further, this is the case when a) Cells are always aligned to their size, and b) no custom manipulation of such a pointer occurs (referencing and arithmetic are fine).
I suspect case a) is guaranteed by the compiler, and case b) would require invoking undefined behavior of some sort, and as such can be ignored.
This would explain why GCC allows itself this optimization.
Is my reasoning correct? If so, why does clang not make the same optimization?
There are many potential optimizations in C and C++ that are usually safe, but aren't quite sound. If one regards the -> operator as being usable to build a standard-layout object without having to use placement new on it first (an abstraction model that is relied upon by a lot of code, whether or not the Standard mandates support), removing the if (mode) in the following C and C++ funcitons would be such an optimization.
C version:
struct s { int x,y; }; /* Assume int is 4 bytes, and struct is 8 */
void test(struct s *p1, struct s *p2, int mode)
{
p1->y = 1;
p2->x = 2;
if (mode)
p1->y = 1;
}
C++ version:
#include <new>
struct s { int x,y; };
void test(void *vp1, void *vp2, int mode)
{
if (1)
{
struct s* p1 = new (vp1) struct s;
p1->x = 1;
}
if (1)
{
struct s* p2 = new (vp2) struct s;
p2->y = 2;
}
if (mode)
{
struct s* p3 = new (vp1) struct s;
p3->x = 1;
}
}
The optimization would be correct unless the address in p2 is four bytes higher than p1. Under the "traditional" abstraction model used in C or C++, if the address of p1 happens to be 0x1000 and that of p2 happens to be 0x1004, the first assignment would cause addresses 0x1000-0x1007 to hold a struct s, if it didn't already, whose second member (at address 0x1004) would equal 1. The second assignment, by overwriting that object, would end its lifetime and cause addresses 0x1004 to 0x100B to hold a struct s whose first member would equal 2. The third assignment, if executed, would end the lifetime of that second object and re-create the first.
If the third assignment is executed, there would be an object at address 0x1000 whose second field (at address 0x1004) would hold the readable value 1. If the assignment is skipped, there would be an object at address 0x1004 whose first field would hold the value 2. Behavior would be defined in both cases, and a compiler that didn't know which case would apply would have to accommodate both of them by making the value at 0x1004 depend upon mode.
As it happens, the authors of clang do not seem to have provided for that corner case, and thus omit the conditional check. While I think the Standard should use an abstraction model that would allow such optimization, while also supporting the common structure-creation pattern in situations that don't involve weird aliasing corner cases, I don't see any way of interpreting the Standard that would allow for such optimization without allowing compilers to arbitrarily break a large amount of existing code.
I don't think there's any general way of knowing when a decision by gcc or clang not to impose a particular optimization represents a recognition of potential corner cases where the optimization would be incorrect, and an inability to prove that none of them apply, and when it simply represents an oversight which may be "corrected" to as to replace correct behavior with an unsound optimization.
Let's say that I want to pass a POD object to function as a const argument. I know that for simple types like int and double passing by value is better than by const reference because of the reference overhead. But at what size it is worth it to pass as a reference?
struct arg
{
...
}
void foo(const arg input)
{
// read from input
}
or
void foo(const arg& input)
{
// read from input
}
i.e., at what size of struct arg should I start using the latter approach?
I should also mention that I'm not talking about copy elision here. Let's suppose that it doesn't happen.
TL;DR: This depends highly on the target architecture, the compiler and the context in which the functions are invoked. When unsure, profile and manually inspect generated code.
If the functions are inlined, a good optimizing compiler will probably emit exact same code in both cases.
If the functions are not inlined however, the ABI on most C++ implementations dictate to pass a const& argument as a pointer. That means the structure has to be stored in RAM just so one can get an address of it. This can have a significant impact on performance for small objects.
Let's take x86_64 Linux G++ 8.2 as an example...
A struct with 2 members:
struct arg
{
int a;
long b;
};
int foo1(const arg input)
{
return input.a + input.b;
}
int foo2(const arg& input)
{
return input.a + input.b;
}
Generated assembly:
foo1(arg):
lea eax, [rdi+rsi]
ret
foo2(arg const&):
mov eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
ret
First version passes the structure entirely via registers, the second one via the stack..
Now let's try 3 members:
struct arg
{
int a;
long b;
int c;
};
int foo1(const arg input)
{
return input.a + input.b + input.c;
}
int foo2(const arg& input)
{
return input.a + input.b + input.c;
}
Generated assembly:
foo1(arg):
mov eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+24]
ret
foo2(arg const&):
mov eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi+16]
ret
Not a whole lot of difference anymore, although using the second version will still be a bit slower because it requires the address to be put in rdi.
Does it really matter that much?
Usually not. If you care about performance of a particular function, it's probably called frequently and is therefore small. As such, it will most likely be inlined.
Let's try invoking the two functions above:
int test(int x)
{
arg a {x, x};
return foo1(a) + foo2(a);
}
Generated assembly:
test(int):
lea eax, [0+rdi*4]
ret
VoilĂ . It's all moot now. The compiler inlined and merged both functions into a single instruction!
A reasonable rule of thumb: If the size of the class is same or less than size of a pointer, then copying may be a bit faster.
If the size of the class is slightly higher, then it may be hard to predict. The difference is often insignificant.
If the size of the class is humongous, then copying is likely slower. That said, point is moot since humongous objects can't in practice have automatic storage, since it is limited.
If the function is expanded inline, then there is probably no difference whatsoever.
To find out whether one program is faster than the other on a particular system, and whether the difference is significant in the first place, you can use a profiler.
In addition to other responses, there is also optimization concerns.
Since it's a reference, the compiler cannot know if the reference point to a mutable global variable or not. When calling any function that the source is not available to the current TU, the compiler must assume the variable may have been mutated.
For example, if you have a if depending on a data member of Foo, call a function, then use the same data member, the compiler will be force to output two sparated loads, whereas if the variable is local, it knows it cannot be mutated elsewhere. Here's an example:
struct Foo {
int data;
};
extern void use_data(int);
void bar(Foo const& foo) {
int const& data = foo.data;
// may mutate foo.data through a global Foo
use_data(data);
// must load foo.data again through the reference
use_data(data);
}
If the variable is local, the compiler will simply reuse the value already inside the registers.
Here's a compiler explorer example that shows the optimization being applied only if the variable is local.
This is why the "general advise" will give you good performance, but won't give you optimal performance. You must mesure and profile your code if you truly care about the performance of your code.
Why is a constexpr function no evaluated at compile time but in runtime in the return statement of main function?
It tried
template<int x>
constexpr int fac() {
return fac<x - 1>() * x;
}
template<>
constexpr int fac<1>() {
return 1;
}
int main() {
const int x = fac<3>();
return x;
}
and the result is
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 6
mov eax, 6
pop rbp
ret
with gcc 8.2. But when I call the function in the return statement
template<int x>
constexpr int fac() {
return fac<x - 1>() * x;
}
template<>
constexpr int fac<1>() {
return 1;
}
int main() {
return fac<3>();
}
I get
int fac<1>():
push rbp
mov rbp, rsp
mov eax, 1
pop rbp
ret
main:
push rbp
mov rbp, rsp
call int fac<3>()
nop
pop rbp
ret
int fac<2>():
push rbp
mov rbp, rsp
call int fac<1>()
add eax, eax
pop rbp
ret
int fac<3>():
push rbp
mov rbp, rsp
call int fac<2>()
mov edx, eax
mov eax, edx
add eax, eax
add eax, edx
pop rbp
ret
Why is the first code evaluated at compile time and the second at runtime?
Also I tried both snippets with clang 7.0.0 and they are evaluated at runtime. Why is this not valid constexpr for clang?
All evaluation was done in godbolt compiler explorer.
A common misconception with regard to constexpr is that it means "this will be evaluated at compile time"1.
It is not. constexpr was introduced to let us write natural code that may produce constant expressions in contexts that need them. It means "this must be evaluatable at compile time", which is what the compiler will check.
So if you wrote a constexpr function returning an int, you can use it to calculate a template argument, an initializer for a constexpr variable (also const if it's an integral type) or an array size. You can use the function to obtain natural, declarative, readable code instead of the old meta-programming tricks one needed to resort to in the past.
But a constexpr function is still a regular function. The constexpr specifier doesn't mean a compiler has2 to optimize it to heck and do constant folding at compile time. It's best not to confuse it for such a hint.
1 - Thanks user463035818 for the phrasing.
2 - c++20 and consteval is a different story however :)
StoryTeller's answer is good, but I think there's a slightly different take possible.
With constexpr, there are three situations to distinguish:
The result is needed in a compile-time context, such as array sizes. In this case, the arguments too must be known at compile time. Evaluation is probably at compile time, and at least all diagnosable errors will be found at compile time.
The arguments are only known at run time, and the result is not needed at compile time. In this case, evaluation necessarily has to happen at run time.
The arguments may be available at compile time, but the result is needed only at run time.
The fourth combination (arguments available only at runtime, result needed at compile time) is an error; the compiler will reject such code.
Now in cases 1 and 3 the calculation could happen at compile time, as all inputs are available. But to facilitate case 2, the compiler must be able to create a run-time version, and it may decide to use this variant in the other cases as well - if it can.
E.g. some compilers internally support variable-sized arrays, so even while the language requires compile-time array bounds, the implementation may decide not to.
I'm in the process of updating a codebase that is currently using a custom equivalent of std::variant to C++17 .
In certain parts of the code, the variant is being reset from a known alternative, so the class provides a method that asserts that index() is at a current value, but still directly invokes the proper destructor unconditionally.
This is used in some tight inner loops, and has (measured) non-trivial performance impact. That's because it allows the compiler to eliminate the entire destruction when the alternative in question is a trivially destructible type.
At face value, it seems to me that I can't achieve this with the current std::variant<> implementation in the STL, but I'm hoping that I'm wrong.
Is there a way to accomplish this that I'm not seeing, or am I out of luck?
Edit: as requested, here's a usage example (using #T.C's example as basis):
struct S {
~S();
};
using var = MyVariant<S, int, double>;
void change_int_to_double(var& v){
v.reset_from<1>(0.0);
}
change_int_to_double compiles to effectively:
#change_int_to_double(MyVariant<S, int, double>&)
mov qword ptr [rdi], 0 // Sets the storage to double(0.0)
mov dword ptr [rdi + 8], 2 // Sets the index to 2
Edit #2
Thanks to various insight from #T.C., I've landed on this monstrosity. It "works" even though it does violate the standard by skipping a few destructors. However, every skipped destructor is checked at compile-time to be trivial so...:
see on godbolt: https://godbolt.org/g/2LK2fa
// Let's make sure our std::variant implementation does nothing funky internally.
static_assert(std::is_trivially_destructible<std::variant<char, int>>::value,
"change_from_I won't be valid");
template<size_t I, typename arg_t, typename... VAR_ARGS>
void change_from_I(std::variant<VAR_ARGS...>& v, arg_t&& new_val) {
assert(I == v.index());
// Optimize away the std::get<> runtime check if possible.
#if defined(__GNUC__)
if(v.index() != I) __builtin_unreachable();
#else
if(v.index() != I) std::terminate();
#endif
// Smart compilers handle this fine without this check, but MSVC can
// use the help.
using current_t = std::variant_alternative_t<I, std::variant<VAR_ARGS...>>;
if(!std::is_trivially_destructible<current_t>::value) {
std::get<I>(v).~current_t();
}
new (&v) var(std::forward<arg_t>(new_val));
}
#include <variant>
struct S {
~S();
};
using var = std::variant<S, int, double>;
void change_int_to_double(var& v){
if(v.index() != 1) __builtin_unreachable();
v = 0.0;
}
GCC compiles the function down to:
change_int_to_double(std::variant<S, int, double>&):
mov QWORD PTR [rdi], 0x000000000
mov BYTE PTR [rdi+8], 2
ret
which is optimal. Clang's codegen, OTOH, leaves much to be desired, although it isn't too bad if you use std::terminate() (the equivalent of an assertion) rather than __builtin_unreachable():
change_int_to_double(std::__1::variant<S, int, double>&): # #change_int_to_double(std::__1::variant<S, int, double>&)
cmp dword ptr [rdi + 8], 1
jne .LBB0_2
mov qword ptr [rdi], 0
mov dword ptr [rdi + 8], 2
ret
.LBB0_2:
push rax
call std::terminate()
MSVC...let's not talk about MSVC.