Suppose I have a std::tuple:
std::tuple<int,int,int,int> t = {1,2,3,4};
and I want to use std::tie just for readability purpose like that:
int a, b, c, d; // in real context these names would be meaningful
std::tie(a, b, c, d) = t;
vs. just using t.get<int>(0), etc.
Would a GCC optimize the memory use of this tuple or would it allocate additional space for a, b, c, d variables?
In this case I don't see any reason why not, under the as-if rule the compiler only has to emulate the observable behavior of the program. A quick experiment using godbolt:
#include <tuple>
#include <cstdio>
void func( int x1, int x2,int x3, int x4)
{
std::tuple<int,int,int,int> t{x1,x2,x3,x4};
int a, b, c, d; // in real context these names would be meaningful
std::tie(a, b, c, d) = t;
printf( "%d %d %d %d\n", a, b, c, d ) ;
}
shows that gcc does indeed optimize it away:
func(int, int, int, int):
movl %ecx, %r8d
xorl %eax, %eax
movl %edx, %ecx
movl %esi, %edx
movl %edi, %esi
movl $.LC0, %edi
jmp printf
On the other hand if you used a address of t and printed it out, we now have observable behavior which relies on t existing (see it live):
printf( "%p\n", static_cast<void*>(&t) );
and we can see that gcc no longer optimizes away the t:
movl %esi, 12(%rsp)
leaq 16(%rsp), %rsi
movd 12(%rsp), %xmm1
movl %edi, 12(%rsp)
movl $.LC0, %edi
movd 12(%rsp), %xmm2
movl %ecx, 12(%rsp)
movd 12(%rsp), %xmm0
movl %edx, 12(%rsp)
movd 12(%rsp), %xmm3
punpckldq %xmm2, %xmm1
punpckldq %xmm3, %xmm0
punpcklqdq %xmm1, %xmm0
At the end of the day you need to look at what the compiler generates and profile your code, in more complicated cases it may surprise you. Just because the compiler is allowed to do certain optimizations does not mean it will. I have looked at more complicated cases where the compiler does not do what I would expect with std::tuple. godbolt is a very helpful tool here, I can not count how many optimizations assumptions I used to have that were upended by plugging in simple examples into godbolt.
Note, I typically use printf in these examples because iostreams generates a lot of code that gets in the way of the example.
Related
Could someone please tell me whether or not such a construction is valid (i.e not an UB) in C++. I have some segfaults because of that and spent couple of days trying to figure out what is going on there.
// Synthetic example
int main(int argc, char** argv)
{
int array[2] = {99, 99};
/*
The point is here. Is it legal? Does it have defined behaviour?
Will it increment first and than access element or vise versa?
*/
std::cout << array[argc += 7]; // Use argc just to avoid some optimisations
}
So, of course I did some analysis, both GCC(5/7) and clang(3.8) generate same code. First add than access.
Clang(3.8): clang++ -O3 -S test.cpp
leal 7(%rdi), %ebx
movl .L_ZZ4mainE5array+28(,%rax,4), %esi
movl $_ZSt4cout, %edi
callq _ZNSolsEi
movl $.L.str, %esi
movl $1, %edx
movq %rax, %rdi
callq _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
GCC(5/7) g++-7 -O3 -S test.cpp
leal 7(%rdi), %ebx
movl $_ZSt4cout, %edi
subq $16, %rsp
.cfi_def_cfa_offset 32
movq %fs:40, %rax
movq %rax, 8(%rsp)
xorl %eax, %eax
movabsq $425201762403, %rax
movq %rax, (%rsp)
movslq %ebx, %rax
movl (%rsp,%rax,4), %esi
call _ZNSolsEi
movl $.LC0, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl %ebx, %esi
So, can I assume such a baheviour is a standard one?
By itself array[argc += 7] is OK, the result of argc + 7 will be used as an index to array.
However, in your example array has just 2 elements, and argc is never negative, so your code will always result in UB due to an out-of-bounds array access.
In case of a[i+=N] the expression i += N will always be evaluated first before accessing the index. But the example that you provided invokes UB as your example array contains only two elements and thus you are accessing out of bounds of the array.
Your case is clearly undefined behaviour, since you will exceed array bounds for the following reasons:
First, expression array[argc += 7] is equal to *((array)+(argc+=7)), and the values of the operands will be evaluated before + is evaluated (cf. here); Operator += is an assignment (and not a side effect), and the value of an assignment is the result of argc (in this case) after the assignment (cf. here). Hence, the +=7 gets effective for subscripting;
Second, argc is defined in C++ to be never negative (cf. here); So argc += 7 will always be >=7 (or a signed integer overflow in very unrealistic scenarious, but still UB then).
Hence, UB.
It's normal behavior. Name of array actualy is a pointer to first element of array. And array[n] is the same as *(array+n)
This sounds too much of a simple question to not be answered already somewhere, but I tried to look around and I couldn't find any simple answer. Take the following example:
class vec
{
double x;
double y;
};
inline void sum_x(vec & result, vec & a, vec & b)
{
result.x = a.x + b.x;
}
inline void sum(vec & result, vec & a, vec & b)
{
sum_x(result, a, b);
result.y = a.y + b.y;
}
What happens when I call sum and compile? Will both sum and sum_x be inlined, so that it will just translate to an inline assembly code to sum the two components?
This looks like a trivial example, but I am working with a vector class that has the dimensionality defined in a template, so iterating over operations on vectors looks a bit like this.
inline is just a hint to the compiler. Whether the compiler actually inlines the function or not is a different question. For gcc there is an always inline attribute to force this.
__attribute__((always_inline));
With always inlining you should achieve what you described (code generate as if it where written in one function).
However, with all the optimizations and transformations applied by compilers you can only be sure if you check the generated code (assembly)
Yes, inlining may be applied recursively.
The entire set of operations that you're performing here can be inlined at the call site.
Note that this has very little to do with your use of the inline keyword, which (other than its effect on the ODR — which can be very noticeable) is just a hint and nowadays mostly ignored for purposes of actually inlining. The functions will be inlined because your clever compiler can see that they are good candidates for it.
The only way you can actually tell whether it's doing this is to inspect the resulting assembly yourself.
It depends. inline is just a hint to the compiler that it might want to think about inlining that function. It's entirely possible for a compiler to inline both calls, but that's up to the implementation.
As an example, here's some prettified assembly output from GCC with and without those inlines of this simple program:
int main()
{
vec a;
vec b;
std::cin >> a.x;
std::cin >> a.y;
sum(b,a,a);
std::cout << b.x << b.y;
return 0;
}
With inlining:
main:
subq $40, %rsp
leaq 16(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
leaq 24(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
movsd 24(%rsp), %xmm0
movapd %xmm0, %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, 8(%rsp)
movsd 16(%rsp), %xmm0
addsd %xmm0, %xmm0
movl std::cout, %edi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movsd 8(%rsp), %xmm0
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movl $0, %eax
addq $40, %rsp
ret
subq $8, %rsp
movl std::__ioinit, %edi
call std::ios_base::Init::Init()
movl $__dso_handle, %edx
movl std::__ioinit, %esi
movl std::ios_base::Init::~Init(), %edi
call __cxa_atexit
addq $8, %rsp
ret
Without:
sum_x(vec&, vec&, vec&):
movsd (%rsi), %xmm0
addsd (%rdx), %xmm0
movsd %xmm0, (%rdi)
ret
sum(vec&, vec&, vec&):
movsd (%rsi), %xmm0
addsd (%rdx), %xmm0
movsd %xmm0, (%rdi)
movsd 8(%rsi), %xmm0
addsd 8(%rdx), %xmm0
movsd %xmm0, 8(%rdi)
ret
main:
pushq %rbx
subq $48, %rsp
leaq 32(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
leaq 40(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
leaq 32(%rsp), %rdx
movq %rdx, %rsi
leaq 16(%rsp), %rdi
call sum(vec&, vec&, vec&)
movq 24(%rsp), %rbx
movsd 16(%rsp), %xmm0
movl std::cout, %edi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movq %rbx, 8(%rsp)
movsd 8(%rsp), %xmm0
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movl $0, %eax
addq $48, %rsp
popq %rbx
ret
subq $8, %rsp
movl std::__ioinit, %edi
call std::ios_base::Init::Init()
movl $__dso_handle, %edx
movl std::__ioinit, %esi
movl std::ios_base::Init::~Init(), %edi
call __cxa_atexit
addq $8, %rsp
ret
As you can see, GCC inlined both functions when asked to.
If your assembly is a bit rusty, simply note that sum is present and called in the second version, but not in the first.
As mentioned, the inline keyword is just a hint. However, compilers do an amazing job here (even without your hints), and they do inline recursively.
If you're really interested in this stuff, I recommend learning a bit about compiler design. I've been studying it recently and it blew my mind what complex beasts our production-quality compilers are today.
About inlining, this is one of the things that compilers tend to do an extremely good job at. It was so by necessity, since if you look at how we write code in C++, we often write accessor functions (methods) just to do nothing more than return the value of a single variable. C++'s popularity hinged in large part on the idea that we can write this kind of code utilizing concepts like information hiding without being forced into creating software that is slower than its C-like equivalent, so you often found optimizers as early as the 90s doing a really good job at inlining (and recursively).
For this next part, it's somewhat speculative as I'm somewhat assuming that what I've been reading and studying about compiler design is applicable towards the production-quality compilers we're using today. Who knows exactly what kind of advanced tricks they're all applying?
... but I believe compilers typically inline code before you get to the kind of machine code level. This is because one of the keys to an optimizer is efficient instruction selection and register allocation. To do that, it needs to know all the memory (variables) the code is going to be working with inside a procedure. It wants that in a form that is somewhat abstract where specific registers haven't been chosen yet but are ready to be assigned. So inlining is usually done at this intermediate representation stage, before you get to the kind of assembly realm of specific machine instructions and registers, so that the compiler can gather up all that information before it does its magical optimizations. It might even apply some heuristics here to kind of 'try' inlining or unrolling away branches of code prior to actually doing it.
A lot of linkers can even inline code, and I'm not sure how that works. I think when they can do that, the object code is actually still in an intermediate representation form, still somewhat abstracted away from specific machine-level instructions and registers. Then the linker can still move that code between object files and inline it, deferring that code generation/optimization process until after.
The Problem: C++11 has made some changes to complex numbers so that real() and imag() can no longer be used and abused like member variables.
I have some code that I am converting over that passes real() and imag() to sincosf() by reference. It looks a little like this:
sincosf(/*...*/, &cplx.real(), &cplx.imag());
This now gives a error: lvalue required as unary '&' operand
which error was not received prior to c++11.
My Question: Is there an easy inline fix? or do I have to create temporary variables to get the result and then pass those to the complex number via setters?
Thanks
As T.C. mentions in the comments, the standard allows you to reinterpret_cast std::complex to your heart's content.
From N3337, §26.4/4 [complex.numbers]
If z is an lvalue expression of type cv std::complex<T> then:
— the expression reinterpret_cast<cv T(&)[2]>(z) shall be well-formed,
— reinterpret_cast<cv T(&)[2]>(z)[0] shall designate the real part of z, and
— reinterpret_cast<cv T(&)[2]>(z)[1] shall designate the imaginary part of z.
Moreover, if a is an expression of type cv std::complex<T>* and the expression a[i] is well-defined for an integer expression i, then:
— reinterpret_cast<cv T*>(a)[2*i] shall designate the real part of a[i], and
— reinterpret_cast<cv T*>(a)[2*i + 1] shall designate the imaginary part of a[i].
So make the following replacement in your code
sincosf(/*...*/,
&reinterpret_cast<T*>(&cplx)[0],
&reinterpret_cast<T*>(&cplx)[1]);
Just do
cplx = std::polar(1.0f, /*...*/);
Do not that declaring temporaries still is a good option and should be just as efficient as what you are previously doing, given a good optimizing compiler (say, gcc -O3).
See the resulting assembly from gcc -O3 here: https://goo.gl/uCPAa9
Using this code:
#include<complex>
std::complex<float> scf1(float x) {
float r = 0., i = 0.;
sincosf(x, &r, &i);
return std::complex<float>(r, i);
}
void scf2(std::complex<float>& cmp, float x) {
float r = 0., i = 0.;
sincosf(x, &r, &i);
cmp.real(r);
cmp.imag(i);
}
void scf3(std::complex<float>& cmp, float x) {
float r = 0., i = 0.;
sincosf(x, &cmp.real(), &cmp.imag());
}
Where scf2 is equivalent to your statement, you can see very similar assembly in the three cases.
scf1(float):
subq $24, %rsp
leaq 8(%rsp), %rsi
leaq 12(%rsp), %rdi
call sincosf
movss 12(%rsp), %xmm0
movss %xmm0, (%rsp)
movss 8(%rsp), %xmm0
movss %xmm0, 4(%rsp)
movq (%rsp), %xmm0
addq $24, %rsp
ret
scf2(std::complex<float>&, float):
pushq %rbx
movq %rdi, %rbx
subq $16, %rsp
leaq 8(%rsp), %rsi
leaq 12(%rsp), %rdi
call sincosf
movss 12(%rsp), %xmm0
movss %xmm0, (%rbx)
movss 8(%rsp), %xmm0
movss %xmm0, 4(%rbx)
addq $16, %rsp
popq %rbx
ret
scf3(std::complex<float>&, float):
pushq %rbx
movq %rdi, %rbx
subq $16, %rsp
leaq 8(%rsp), %rsi
leaq 12(%rsp), %rdi
call sincosf
movss 12(%rsp), %xmm0
movss %xmm0, (%rbx)
movss 8(%rsp), %xmm0
movss %xmm0, 4(%rbx)
addq $16, %rsp
popq %rbx
ret
I have a large Swig Python module. The C++ wrapper ends up being about 320,000 LoC (including headers I guess). I currently compile this with -O1, and g++ produces a binary that is 44MiB in size, and takes about 3 minutes to compile it.
If I turn off optimisation (-O0), the binary comes out at 40MiB, and it takes 44s to compile.
Is compiling the wrapper with -O0 going to hurt the performance of the python module significantly? Before I go and profile the performance of the module at different optimisation levels, has anyone done this sort of analysis before or have any insight into whether it matters?
-O0 deactivates all the optimizations performed by gcc. And optimizations matter.
So, without a lot of knowledge in your application, I could suggest that this will hurt the performance of your application.
A usually safe optimization level to use is -O2.
You can check the kind of optimizations performed by GCC in:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.
But at the end, if you want to know exactly you should compile at different levels and profile.
This is bad irrespective of SWIG modules or not. There are many optimisations that happen even with gcc -O1 which you will miss if you prevent them happening.
You can check the difference by inspecting the asm generated by your compiler of choice. Of these the ones I trivially know will be detrimental to SWIG's generated wrapper:
Dead code elimination:
void foo() {
int a = 1;
a = 0;
}
With -O1 this entirely pointless code gets totally removed:
foo:
pushl %ebp
movl %esp, %ebp
popl %ebp
ret
whereas with -O0 it becomes:
foo:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $1, -4(%ebp)
movl $0, -4(%ebp)
leave
ret
Register allocation will be detrimentally impacted in functions with lots of local variables - most SWIG wrapper functions will see a hit from this. It's hard to show a concise example of this though.
Another example, the output from gcc compiling the SWIG wrapper for the prototype:
int foo(unsigned int a, unsigned int b, unsigned int c, unsigned int d);
Generates with -O0:
Java_testJNI_foo:
pushl %ebp
movl %esp, %ebp
subl $88, %esp
movl 16(%ebp), %eax
movl %eax, -48(%ebp)
movl 20(%ebp), %eax
movl %eax, -44(%ebp)
movl 24(%ebp), %eax
movl %eax, -56(%ebp)
movl 28(%ebp), %eax
movl %eax, -52(%ebp)
movl 32(%ebp), %eax
movl %eax, -64(%ebp)
movl 36(%ebp), %eax
movl %eax, -60(%ebp)
movl 40(%ebp), %eax
movl %eax, -72(%ebp)
movl 44(%ebp), %eax
movl %eax, -68(%ebp)
movl $0, -32(%ebp)
movl -48(%ebp), %eax
movl %eax, -28(%ebp)
movl -56(%ebp), %eax
movl %eax, -24(%ebp)
movl -64(%ebp), %eax
movl %eax, -20(%ebp)
movl -72(%ebp), %eax
movl %eax, -16(%ebp)
movl -16(%ebp), %eax
movl %eax, 12(%esp)
movl -20(%ebp), %eax
movl %eax, 8(%esp)
movl -24(%ebp), %eax
movl %eax, 4(%esp)
movl -28(%ebp), %eax
movl %eax, (%esp)
call foo
movl %eax, -12(%ebp)
movl -12(%ebp), %eax
movl %eax, -32(%ebp)
movl -32(%ebp), %eax
leave
ret
Compared to -O1 which generates just:
Java_testJNI_foo:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl 40(%ebp), %eax
movl %eax, 12(%esp)
movl 32(%ebp), %eax
movl %eax, 8(%esp)
movl 24(%ebp), %eax
movl %eax, 4(%esp)
movl 16(%ebp), %eax
movl %eax, (%esp)
call foo
leave
ret
With -O1 g++ can generate far smarter code for:
%module test
%{
int value() { return 100; }
%}
%feature("compactdefaultargs") foo;
%inline %{
int foo(int a=value(), int b=value(), int c=value()) {
return 0;
}
%}
The short answer is with optimisations disabled completely GCC generates extremely naive code - this is true of SWIG wrappers as much as any other program, if not more given the style of the automatically generated code.
This question already has answers here:
Do temp variables slow down my program?
(5 answers)
Closed 5 years ago.
Let's say we have two functions:
int f();
int g();
I want to get the sum of f() and g().
First way:
int fRes = f();
int gRes = g();
int sum = fRes + gRes;
Second way:
int sum = f() + g();
Will be there any difference in performance in this two cases?
Same question for complex types instead of ints
EDIT
Do I understand right i should not worry about performance in such case (in each situation including frequently performed tasks) and use temporary variables to increase readability and to simplify the code ?
You can answer questions like this for yourself by compiling to assembly language (with optimization on, of course) and inspecting the output. If I flesh your example out to a complete, compilable program...
extern int f();
extern int g();
int direct()
{
return f() + g();
}
int indirect()
{
int F = f();
int G = g();
return F + G;
}
and compile it (g++ -S -O2 -fomit-frame-pointer -fno-exceptions test.cc; the last two switches eliminate a bunch of distractions from the output), I get this (further distractions deleted):
__Z8indirectv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
__Z6directv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
As you can see, the code generated for both functions is identical, so the answer to your question is no, there will be no performance difference. Now let's look at complex numbers -- same code, but s/int/std::complex<double>/g throughout and #include <complex> at the top; same compilation switches --
__Z8indirectv:
subq $72, %rsp
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
__Z6directv:
subq $72, %rsp
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
That's a lot more instructions and the compiler isn't doing a perfect optimization job, it looks like, but nonetheless the code generated for both functions is identical.
I think in the second way it is assigned to a temporary variable when the function returns a value anyway. However, it becomes somewhat significant when you need to use the values from f() and g() more than once case in which storing them to a variable instead of recalculating them each time can help.
If you have optimization turned off, there likely will be. If you have it turned on, they will likely result in identical code. This is especially true of you label the fRes and gRes as const.
Because it's legal for the compiler to elide the call to the copy constructor if fRes and gRes are complex types they will not differ in performance for complex types either.
Someone mentioned using fRes and gRes more than once. And of course, this is obviously potentially less optimal as you would have to call f() or g() more than once.
As you wrote it, there's only a subtle difference (which another answer addresses, that there's a sequence point in the one vs the other).
They would be different if you had done this instead:
int fRes;
int gRes;
fRes = f();
fRes = g();
int sum = fRes + gRes;
(Imagining that int as actually some other type with a non-trivial default constructor.)
In the case here, you invoke default constructors and then assignment operators, which is potentially more work.
It depends entirely on what optimizations the compiler performs. The two could compile to slightly different or exactly the same bytecode. Even if slightly different, you couldn't measure a statistically significant difference in time and space costs for those particular samples.
On my platform with full optimization turned on, a function returning the sum from both different cases compiled to exactly the same machine code.
The only minor difference between the two examples is that the first guarantees the order in which f() and g() are called, so in theory the second allows the compiler slightly more flexibility. Whether this ever makes a difference would depend on what f() and g() actually do and, perhaps, whether they can be inlined.
There is a slight difference between the two examples. In expression f() + g() there is no sequence point, whereas when the calls are made in different statements there are sequence points at the end of each statement.
The absence of a sequence point means the order these two functions are called is unspecified, they can be called in any order, which might help the compiler optimize it.