Why compiler put so much commands before function call (look at the link below)? As I understand, it should pass only function parameters before call.
struct A{
int c = 5;
void test(unsigned int a){
a++;
c++;
}
};
struct C{
int k =2;
A a;
};
struct D{
int k =2;
C c;
};
struct B{
int k =2;
D d;
};
void test(unsigned int a){
a++;
}
B *b = new B();
A *ae = new A();
int main()
{
int a = 1;
A ai;
B bi;
C ci;
// 2 operations (why not pop/push ?)
// movl -36(%rbp), %eax
// movl %eax, %edi
// call test(unsigned int)
test(a);
// 4 operations (why 4? we pass something else?)
// movl -36(%rbp), %edx
// leaq -48(%rbp), %rax
// movl %edx, %esi
// movq %rax, %rdi
// call A::test(unsigned int)
ai.test(a);
ae->test(a);
// 5 operations before call (what a hell is going here?, why that "addq" ?)
// movl -36(%rbp), %eax
// leaq -32(%rbp), %rdx
// addq $4, %rdx
// movl %eax, %esi
// movq %rdx, %rdi
// call A::test(unsigned int)
ci.a.test(a);
bi.d.c.a.test(a);
b->d.c.a.test(a);
// no matter how long this chain will be - it will always took 5 operations
}
http://goo.gl/smFSA6
Why when we call class member, it took 4 additional commands to prepare to call? We load object address to register, as well?
And the last case with 5 ops, is just beyond me...
P.S. In the days of my youth, usually, we put function params to stack (push), than read them (pop). Now what, we pass parameters through registers?
It's normal. In assembly I intruction is usually only doing one thing. for example in the last case:
movl -36(%rbp), %eax ; move a to %eax
leaq -32(%rbp), %rdx ; move & ci to %rdx
addq $4, %rdx ; set %rdx to ci->a = ci + offset of a
movl %eax, %esi ; move a from %eax to %esi (second parameter)
movq %rdx, %rdi ; move ci->a from %rdx to %rdi (first parameter)
call A::test(unsigned int) ; call A::test
In 64 bit linux systems function parameters are no longer transferred on the stack, the first 6 integer parameters are transferred in %rdi, %rsi, %rdx, %rcx, %r8, %r9 registers. Floating point values use the %xmm0 - %xmm7 registers, and the others are transferred on the stack.
The local variables of course are located on the stack and accessed through %rbp
Related
I've seen people define a member function like this:
void getValue(int& v)
{
v = m_value;
}
and also like this:
int getValue()
{
return m_value;
}
I guess the first saves memory? Is that the only time you would use the first type of get-function? The second seems a lot more convenient.
I thought I would godbolt it for you
source
#include <iostream>
struct Foof{
int m_val;
Foof(int v){
m_val = v;
}
void woodle()
{
if(m_val > 42)
m_val++;
else
m_val--;
}
void Get1(int &v)
{
v = m_val;
}
int Get2()
{
return m_val;
}
};
int main(int c, char**v){
int q;
std::cin >> q;
Foof f1(q);
std::cin >> q;
Foof f2(q);
f1.woodle();
f2.woodle();
int k;
f1.Get1(k);
int j = f2.Get2();
std::cout << k << j;
}
the woodle function and the cin to initialize is to make the compiler think a bit
I have 2 foofs otherwise the compiler goes "well I know the answer to this question" when I call Get2 after Get1
compiled with -03 - ie optimize hard. The code comes out as (gcc)
pushq %rbx
movl $_ZSt3cin, %edi
subq $16, %rsp
leaq 12(%rsp), %rsi
call std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
movl 12(%rsp), %ebx
leaq 12(%rsp), %rsi
movl $_ZSt3cin, %edi
call std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
movl 12(%rsp), %eax
movl $_ZSt4cout, %edi
leal 1(%rbx), %edx
cmpl $43, %ebx
leal -1(%rbx), %esi
cmovge %edx, %esi
leal -1(%rax), %ebx
leal 1(%rax), %edx
cmpl $43, %eax
cmovge %edx, %ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movq %rax, %rdi
movl %ebx, %esi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
addq $16, %rsp
xorl %eax, %eax
popq %rbx
ret
I separated out the actual calls to Get1 or Get2 you can see that
the generated code is identical
the compiler is very aggressive at optimizing, there are no function calls etc
Lesson, write your code to be human readable and let the compiler do the heavy lifting
I have the following linear algebra function call (vector-vector addition) in C++.
int m = 4;
blasfeo_dvec one, two, three;
blasfeo_allocate_dvec(m, &one);
blasfeo_allocate_dvec(m, &two);
blasfeo_allocate_dvec(m, &three);
// initialize vectors ... (omitted)
blasfeo_daxpy(m, 1.0, &one, 0, &two, 0, &three, 0);
Using expression templates (ETs), we can wrap it as follows:
three = one + two;
where the vector struct looks like
struct blasfeo_dvec {
int m; // length
int pm; // packed length
double *pa; // pointer to a pm array of doubles, the first is aligned to cache line size
int memsize; // size of needed memory
void operator=(const vec_expression_sum<blasfeo_dvec, blasfeo_dvec> expr) {
blasfeo_daxpy(m, 1.0, (blasfeo_dvec *) &expr.vec_a, 0, (blasfeo_dvec *) &expr.vec_b, 0, this, 0);
}
};
The cast to non-const is necessary because blasfeo_daxpy takes non-const pointers. The ET code is simply
template<typename Ta, typename Tb>
struct vec_expression_sum {
const Ta vec_a;
const Tb vec_b;
vec_expression_sum(const Ta va, const Tb vb) : vec_a {va}, vec_b {vb} {}
};
template<typename Ta, typename Tb>
auto operator+(const Ta a, const Tb b) {
return vec_expression_sum<Ta, Tb>(a, b);
}
The 'native' call, i.e. blasfeo_daxpy(...) generates the following assembly:
; allocation and initialization omitted ...
movl $0, (%rsp)
movl $4, %edi
xorl %edx, %edx
xorl %r8d, %r8d
movsd LCPI0_0(%rip), %xmm0 ## xmm0 = mem[0],zero
movq %r14, %rsi
movq %rbx, %rcx
movq %r15, %r9
callq _blasfeo_daxpy
...
which is exactly what you would expect. The ET code is quite a bit longer:
; allocation :
leaq -120(%rbp), %rbx
movl $4, %edi
movq %rbx, %rsi
callq _blasfeo_allocate_dvec
leaq -96(%rbp), %r15
movl $4, %edi
movq %r15, %rsi
callq _blasfeo_allocate_dvec
leaq -192(%rbp), %r14
movl $4, %edi
movq %r14, %rsi
callq _blasfeo_allocate_dvec
; initialization code omitted
; operator+ :
movq -104(%rbp), %rax
movq %rax, -56(%rbp)
movq -120(%rbp), %rax
movq -112(%rbp), %rcx
movq %rcx, -64(%rbp)
movq %rax, -72(%rbp)
; vec_expression_sum :
movq -80(%rbp), %rax
movq %rax, -32(%rbp)
movq -96(%rbp), %rax
movq -88(%rbp), %rcx
movq %rcx, -40(%rbp)
movq %rax, -48(%rbp)
movq -32(%rbp), %rax
movq %rax, -128(%rbp)
movq -40(%rbp), %rax
movq %rax, -136(%rbp)
movq -48(%rbp), %rax
movq %rax, -144(%rbp)
movq -56(%rbp), %rax
movq %rax, -152(%rbp)
movq -72(%rbp), %rax
movq -64(%rbp), %rcx
movq %rcx, -160(%rbp)
movq %rax, -168(%rbp)
leaq -144(%rbp), %rcx
; blasfeo_daxpy :
movl -192(%rbp), %edi
movl $0, (%rsp)
leaq -168(%rbp), %rsi
xorl %edx, %edx
xorl %r8d, %r8d
movsd LCPI0_0(%rip), %xmm0 ## xmm0 = mem[0],zero
movq %r14, %r9
callq _blasfeo_daxpy
...
It involves quite a bit of copying, namely the fields of blasfeo_dvec. I (naively, maybe) hoped that the ET code would generate the exact same code as the native call, given that everything is fixed at compile time and const, but it doesn't.
The question is: why the extra loads? And is there a way of getting fully 'optimized' code? (edit: I use Apple LLVM version 8.1.0 (clang-802.0.42) with -std=c++14 -O3)
Note: I read and understood this and this post on a similar topic, but they unfortunately do not contain an answer to my question.
Let's say I have this code:
int v;
setV(&v);
for (int i = 0; i < v - 5; i++) {
// Do stuff here, but don't use v.
}
Will the operation v - 5 be run every time or will a modern compiler be smart enough to store it once and never run it again?
What if I did this:
int v;
setV(&v);
const int cv = v;
for (int i = 0; i < cv - 5; i++) {
// Do stuff here. Changing cv is actually impossible.
}
Would the second style make a difference?
Edit:
This was an interesting question for an unexpected reason. It's more a question of the compiler avoiding the obtuse case of an unintended aliasing of v. If the compiler can prove that this won't happen (version 2) then we get better code.
The lesson here is to be more concerned with eliminating aliasing than trying to do the optimiser's job for it.
Making the copy cv actually presented the biggest optimisation (elision of redundant memory fetches), even though at a first glance it would appear to be (slightly) less efficient.
original answer and demo:
Let's see:
given:
extern void setV(int*);
extern void do_something(int i);
void test1()
{
int v;
setV(&v);
for (int i = 0; i < v - 5; i++) {
// Do stuff here, but don't use v.
do_something(i);
}
}
void test2()
{
int v;
setV(&v);
const int cv = v;
for (int i = 0; i < cv - 5; i++) {
// Do stuff here. Changing cv is actually impossible.
do_something(i);
}
}
compile on gcc5.3 with -x c++ -std=c++14 -O2 -Wall
gives:
test1():
pushq %rbx
subq $16, %rsp
leaq 12(%rsp), %rdi
call setV(int*)
cmpl $5, 12(%rsp)
jle .L1
xorl %ebx, %ebx
.L5:
movl %ebx, %edi
addl $1, %ebx
call do_something(int)
movl 12(%rsp), %eax
subl $5, %eax
cmpl %ebx, %eax
jg .L5
.L1:
addq $16, %rsp
popq %rbx
ret
test2():
pushq %rbp
pushq %rbx
subq $24, %rsp
leaq 12(%rsp), %rdi
call setV(int*)
movl 12(%rsp), %eax
cmpl $5, %eax
jle .L8
leal -5(%rax), %ebp
xorl %ebx, %ebx
.L12:
movl %ebx, %edi
addl $1, %ebx
call do_something(int)
cmpl %ebp, %ebx
jne .L12
.L8:
addq $24, %rsp
popq %rbx
popq %rbp
ret
The second form is better on this compiler.
What is the cleanest way to write a function (not a procedure) ?
Does the 2nd solution is told to have "side effect" ?
struct myArea
{
int t[10][10]; // it could by 100x100...
};
Solution 1 : pass by value
double mySum1(myArea a)
{
// compute and return the sum of elements
}
Solution 2 : pass by const reference
double mySum2(const myArea & a)
{
// compute and return the sum of elements
}
My prefered is the first one (clean function) although it is less effective. But when there are a lot of data to copy, it can be time-consuming.
Thank you for feedback.
I have a number of quibbles with your terminology:
There's no such thing as a "procedure" in C or C++. At best, there are functions that return no value: "void"
Your example has no "side effect".
I'm not sure what you mean by "clean function" ... but I HOPE you don't mean "less source == cleaner code". Nothing could be further from the truth :(
TO ANSWER YOUR ORIGINAL QUESTION:
In your example, double mySum1(myArea a) incurs the space and CPU overhead of a COMPLETELY UNNECESSARY COPY. Don't do it :)
To my mind, double mySum1(myArea & a) or double mySum1(myArea * a) are equivalent. Personally, I'd prefer double mySum1(myArea * a) ... but most C++ developers would (rightly!) prefer double mySum1(myArea & a).
double mySum1 (const myArea & a) is best of all: it has the runtime efficiency of 2), and it signals your intent that it WON'T modify the array.
PS:
I generated assembly output from the following test:
struct myArea {
int t[10][10];
};
double mySum1(myArea a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a.t[i][j];
return sum;
}
double mySum2(myArea & a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a.t[i][j];
return sum;
}
double mySum3(myArea * a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a->t[i][j];
return sum;
}
double mySum4(const myArea & a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a.t[i][j];
return sum;
}
mySum1, as you'd expect, had extra code to do the extra copy.
The output for mySum2, mySum3 and mySun4, however, were IDENTICAL:
_Z6mySum2R6myArea:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
movq %rdi, -32(%rbp)
movl $0, %eax
movq %rax, -24(%rbp)
movl $0, -16(%rbp)
jmp .L8
.cfi_offset 3, -24
.L11:
movl $0, -12(%rbp)
jmp .L9
.L10:
movl -16(%rbp), %eax
movl -12(%rbp), %edx
movq -32(%rbp), %rcx
movslq %edx, %rbx
movslq %eax, %rdx
movq %rdx, %rax
salq $2, %rax
addq %rdx, %rax
addq %rax, %rax
addq %rbx, %rax
movl (%rcx,%rax,4), %eax
cvtsi2sd %eax, %xmm0
movsd -24(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -24(%rbp)
addl $1, -12(%rbp)
.L9:
cmpl $9, -12(%rbp)
setle %al
testb %al, %al
jne .L10
addl $1, -16(%rbp)
.L8:
cmpl $9, -16(%rbp)
setle %al
testb %al, %al
jne .L11
movq -24(%rbp), %rax
movq %rax, -40(%rbp)
movsd -40(%rbp), %xmm0
popq %rbx
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
<= mySum3 and mySum4 had different labels ... but identical instructions!
It's also worth noting that one of the benefits of "const" is that it can help the compiler perform several different kinds of optimizations, whenever possible. For example:
What kind of optimization does const offer in C/C++? (if any)
The C++ 'const' Declaration: Why & How
Please note there is no such thing as "procedure" in C++. Functions that don't return anything are still functions.
Now to the question: If your parameter is an output parameter or an input/output parameter, that is, you want the caller to see changes made inside the function to the object passed to it, then pass by reference. Otherwise if the type is small/trivially cheap to copy, pass by value. Otherwise, pass by reference to const. In your case I'd pass by reference to const.
Parameter-passing is not a side effect in itself.
If a function does anything observable that isn't just returning a value, that would be a side effect.
(For example modifying a reference parameter, printing something, modifying any global state...)
That is, even if you pass by non-const reference, the presence of side effects depends on whether you modify the referenced object.
Should one use dynamic memory allocation when one knows that a variable will not be needed before it goes out of scope?
For example in the following function:
void func(){
int i =56;
//do something with i, i is not needed past this point
for(int t; t<1000000; t++){
//code
}
}
say one only needed i for a small section of the function, is it worthwhile deleting i as it is not needed in the very long for loop?
As Borgleader said:
A) This is micro (and most probably premature) optimization, meaning
don't worry about it. B) In this particular case, dynamically
allocation i might even hurt performance. tl;dr; profile first,
optimize later
As an example, I compiled the following two programs into assembly (using g++ -S flag with no optimisation enabled).
Creating i on the stack:
int main(void)
{
int i = 56;
i += 5;
for(int t = 0; t<1000; t++) {}
return 0;
}
Dynamically:
int main(void)
{
int* i = new int(56);
*i += 5;
delete i;
for(int t = 0; t<1000; t++) {}
return 0;
}
The first program compiled to:
movl $56, -8(%rbp) # Store 56 on stack (int i = 56)
addl $5, -8(%rbp) # Add 5 to i (i += 5)
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
And the second:
subq $16, %rsp # Allocate memory (new)
movl $4, %edi
call _Znwm
movl $56, (%rax) # Store 56 in *i
movq %rax, -16(%rbp)
movq -16(%rbp), %rax # Add 5
movl (%rax), %eax
leal 5(%rax), %edx
movq -16(%rbp), %rax
movl %edx, (%rax)
movq -16(%rbp), %rax # Free memory (delete)
movq %rax, %rdi
call _ZdlPv
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
In the above assembly output, you can see strait away that there is a significant difference between the number of commands being executed. If I compile the same programs with optimisation turned on. The first program produced the result:
xorl %eax, %eax # Equivalent to return 0;
The second produced:
movl $4, %edi
call _Znwm
movl $61, (%rax) # A smart compiler knows 56+5 = 61
movq %rax, %rdi
call _ZdlPv
xorl %eax, %eax
addq $8, %rsp
With optimisation on, the compiler becomes a pretty powerful tool for improving your code, in certain cases it can even detect that a program only returns 0 and get rid of all the unnecessary code. When you use dynamic memory in the code above, the program still has to request and then free the dynamic memory, it can't optimise it out.