Should one use dynamic memory allocation when one knows that a variable will not be needed before it goes out of scope?
For example in the following function:
void func(){
int i =56;
//do something with i, i is not needed past this point
for(int t; t<1000000; t++){
//code
}
}
say one only needed i for a small section of the function, is it worthwhile deleting i as it is not needed in the very long for loop?
As Borgleader said:
A) This is micro (and most probably premature) optimization, meaning
don't worry about it. B) In this particular case, dynamically
allocation i might even hurt performance. tl;dr; profile first,
optimize later
As an example, I compiled the following two programs into assembly (using g++ -S flag with no optimisation enabled).
Creating i on the stack:
int main(void)
{
int i = 56;
i += 5;
for(int t = 0; t<1000; t++) {}
return 0;
}
Dynamically:
int main(void)
{
int* i = new int(56);
*i += 5;
delete i;
for(int t = 0; t<1000; t++) {}
return 0;
}
The first program compiled to:
movl $56, -8(%rbp) # Store 56 on stack (int i = 56)
addl $5, -8(%rbp) # Add 5 to i (i += 5)
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
And the second:
subq $16, %rsp # Allocate memory (new)
movl $4, %edi
call _Znwm
movl $56, (%rax) # Store 56 in *i
movq %rax, -16(%rbp)
movq -16(%rbp), %rax # Add 5
movl (%rax), %eax
leal 5(%rax), %edx
movq -16(%rbp), %rax
movl %edx, (%rax)
movq -16(%rbp), %rax # Free memory (delete)
movq %rax, %rdi
call _ZdlPv
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
In the above assembly output, you can see strait away that there is a significant difference between the number of commands being executed. If I compile the same programs with optimisation turned on. The first program produced the result:
xorl %eax, %eax # Equivalent to return 0;
The second produced:
movl $4, %edi
call _Znwm
movl $61, (%rax) # A smart compiler knows 56+5 = 61
movq %rax, %rdi
call _ZdlPv
xorl %eax, %eax
addq $8, %rsp
With optimisation on, the compiler becomes a pretty powerful tool for improving your code, in certain cases it can even detect that a program only returns 0 and get rid of all the unnecessary code. When you use dynamic memory in the code above, the program still has to request and then free the dynamic memory, it can't optimise it out.
Related
I read the cppreference guide over the sperimental feature of transactional memory and i try it.
I write some simple code with sincronized that as say cpp reference is not a transaction but only guarantees that the operation in the block are executed in a total order, the i write the same code with atomic_noexcept and atomic_commit, not with atomic_cancel that seems to be not yet implemented.
The doubt that i have is about the difference between atomic_noexcept, atomic_commit and synchronized, apparently they work in the same way, except for the compilation error if a no transaction safe function is called in an atomic block.
So I analyze the assembly code for the 3 variants, and result the same, as reported below:
cpp atomic_noexcept:
int a;
void thread_func() {
atomic_noexcept
{
++a;
}
}
assembly atomic_noexcept:
thread_func():
subq $8, %rsp
movl $43, %edi
xorl %eax, %eax
call _ITM_beginTransaction
testb $2, %al
jne .L2
movl $a, %edi
call _ITM_RfWU4
movl $a, %edi
leal 1(%rax), %esi
call _ITM_WaWU4
call _ITM_commitTransaction
addq $8, %rsp
ret
.L2:
addl $1, a(%rip)
addq $8, %rsp
jmp _ITM_commitTransaction
a:
.zero 4
cpp atomic_commit:
int a;
void thread_func() {
atomic_commit
{
++a;
}
}
assembly atomic_commit:
thread_func():
subq $8, %rsp
movl $43, %edi
xorl %eax, %eax
call _ITM_beginTransaction
testb $2, %al
jne .L2
movl $a, %edi
call _ITM_RfWU4
movl $a, %edi
leal 1(%rax), %esi
call _ITM_WaWU4
call _ITM_commitTransaction
addq $8, %rsp
ret
.L2:
addl $1, a(%rip)
addq $8, %rsp
jmp _ITM_commitTransaction
a:
.zero 4
cpp synchronized:
int a;
void thread_func() {
synchronized
{
++a;
}
}
assembly synchronized:
thread_func():
subq $8, %rsp
movl $43, %edi
xorl %eax, %eax
call _ITM_beginTransaction
testb $2, %al
jne .L2
movl $a, %edi
call _ITM_RfWU4
movl $a, %edi
leal 1(%rax), %esi
call _ITM_WaWU4
call _ITM_commitTransaction
addq $8, %rsp
ret
.L2:
addl $1, a(%rip)
addq $8, %rsp
jmp _ITM_commitTransaction
a:
.zero 4
How can they work differently? For example i report the explanation of different atomic block of cppreference:
atomic_noexcept : If an exception is thrown, std::abort is called
atomic_cancel : If an exception is thrown, std::abort is called,
unless the exception is one of the exceptions uses for transaction
cancellation (see below) in which case the transaction is cancelled:
the values of all memory locations in the program that were modified
by side effects of the operations of the atomic block are restored to
the values they had at the time the start of the atomic block was
executed, and the exception continues stack unwinding as usual.
atomic_commit : If an exception is thrown, the transaction is
committed normally.
How can atomic_noexcept work differently from atomic_commit if has the same assembly code?
How can syncronized block work differently from atomic block if has the same assembly code?
EDIT:
All these test and assembly code are extracted from last version of GCC (V. 10.2)
EDIT2:
After some test and research i haven't found yet a logical explanation for the said different behaviour.
Let's say I have this code:
int v;
setV(&v);
for (int i = 0; i < v - 5; i++) {
// Do stuff here, but don't use v.
}
Will the operation v - 5 be run every time or will a modern compiler be smart enough to store it once and never run it again?
What if I did this:
int v;
setV(&v);
const int cv = v;
for (int i = 0; i < cv - 5; i++) {
// Do stuff here. Changing cv is actually impossible.
}
Would the second style make a difference?
Edit:
This was an interesting question for an unexpected reason. It's more a question of the compiler avoiding the obtuse case of an unintended aliasing of v. If the compiler can prove that this won't happen (version 2) then we get better code.
The lesson here is to be more concerned with eliminating aliasing than trying to do the optimiser's job for it.
Making the copy cv actually presented the biggest optimisation (elision of redundant memory fetches), even though at a first glance it would appear to be (slightly) less efficient.
original answer and demo:
Let's see:
given:
extern void setV(int*);
extern void do_something(int i);
void test1()
{
int v;
setV(&v);
for (int i = 0; i < v - 5; i++) {
// Do stuff here, but don't use v.
do_something(i);
}
}
void test2()
{
int v;
setV(&v);
const int cv = v;
for (int i = 0; i < cv - 5; i++) {
// Do stuff here. Changing cv is actually impossible.
do_something(i);
}
}
compile on gcc5.3 with -x c++ -std=c++14 -O2 -Wall
gives:
test1():
pushq %rbx
subq $16, %rsp
leaq 12(%rsp), %rdi
call setV(int*)
cmpl $5, 12(%rsp)
jle .L1
xorl %ebx, %ebx
.L5:
movl %ebx, %edi
addl $1, %ebx
call do_something(int)
movl 12(%rsp), %eax
subl $5, %eax
cmpl %ebx, %eax
jg .L5
.L1:
addq $16, %rsp
popq %rbx
ret
test2():
pushq %rbp
pushq %rbx
subq $24, %rsp
leaq 12(%rsp), %rdi
call setV(int*)
movl 12(%rsp), %eax
cmpl $5, %eax
jle .L8
leal -5(%rax), %ebp
xorl %ebx, %ebx
.L12:
movl %ebx, %edi
addl $1, %ebx
call do_something(int)
cmpl %ebp, %ebx
jne .L12
.L8:
addq $24, %rsp
popq %rbx
popq %rbp
ret
The second form is better on this compiler.
I've been told many times that recursion is slow due to function calls, but in this code, it seems much faster than the iterative solution. At best, I typically expect a compiler to optimize recursion into iteration (which looking at the assembly, did seem to happen).
#include <iostream>
bool isDivisable(int x, int y)
{
for (int i = y; i != 1; --i)
if (x % i != 0)
return false;
return true;
}
bool isDivisableRec(int x, int y)
{
if (y == 1)
return true;
return x % y == 0 && isDivisableRec(x, y-1);
}
int findSmallest()
{
int x = 20;
for (; !isDivisable(x,20); ++x);
return x;
}
int main()
{
std::cout << findSmallest() << std::endl;
}
Assembly here: https://gist.github.com/PatrickAupperle/2b56e16e9e5a6a9b251e
I'd love to know what is going on here. I'm sure it is some tricky compiler optimization that I can be amazed to learn about.
Edit: I just realized I forgot to mention that if I use the recursive version, it runs in about .25 seconds, the iterative, about .6.
Edit 2: I am compiling with -O3 using
$ g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Though, I'm not really sure what that matters.
Edit 3:
Better benchmarking:
Source: http://gist.github.com/PatrickAupperle/ee8241ac51417437d012
Output: http://gist.github.com/PatrickAupperle/5870136a5552b83fd0f1
Running with 100 iterations shows very similar results
Edit 4:
At Roman's suggestion, I added -fno-inline-functions -fno-inline-small-functions to the compilation flags. The effect is extremely bizarre to me. The code runs about 15x faster, but the ratio between the recursive version and the iterative version remains similar.
https://gist.github.com/PatrickAupperle/3a87eb53a9f11c1f0bec
Using this code I also see large timing difference (in favor of the recursive version) with GCC 4.9.3 in Cygwin. I get
13.411 seconds for iterative
4.29101 seconds for recursive
Looking at the assembly code it generated with -O3, I see two things
The compiler replaced tail recursion in isDivisableRec with a cycle and then unrolled the cycle: each iteration of the cycle in the machine code covers two levels of the original recursion.
_Z14isDivisableRecii:
.LFB1467:
.seh_endprologue
movl %edx, %r8d
.L15:
cmpl $1, %r8d
je .L18
movl %ecx, %eax ; First unrolled divisibility check
cltd
idivl %r8d
testl %edx, %edx
je .L20
.L19:
xorl %eax, %eax
ret
.p2align 4,,10
.L20:
leal -1(%r8), %r9d
cmpl $1, %r9d
jne .L21
.p2align 4,,10
.L18:
movl $1, %eax
ret
.p2align 4,,10
.L21:
movl %ecx, %eax ; Second unrolled divisibility check
cltd
idivl %r9d
testl %edx, %edx
jne .L19
subl $2, %r8d
jmp .L15
.seh_endproc
The compiler inlined several iterations of isDivisableRec by lifting them into findSmallestRec. Since the value of y parameter of isDivisableRec is hardcoded as 20 the compiler managed to replace the iterations for 20, 19...15 with some "magical" code inlined directly into findSmallestRec. The actual call to isDivisableRec happens only for y parameter value of 14 (if it happens at all).
Here's the inlined code in findSmallestRec
movl $20, %ebx
movl $1717986919, %esi ; Magic constants
movl $1808407283, %edi ; for divisibility tests
movl $954437177, %ebp ;
movl $2021161081, %r12d ;
movl $-2004318071, %r13d ;
jmp .L28
.p2align 4,,10
.L29: ; The main cycle
addl $1, %ebx
.L28:
movl %ebx, %eax ; Divisibility by 20 test
movl %ebx, %ecx
imull %esi
sarl $31, %ecx
sarl $3, %edx
subl %ecx, %edx
leal (%rdx,%rdx,4), %eax
sall $2, %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 19 test
imull %edi
sarl $3, %edx
subl %ecx, %edx
leal (%rdx,%rdx,8), %eax
leal (%rdx,%rax,2), %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 18 test
imull %ebp
sarl $2, %edx
subl %ecx, %edx
leal (%rdx,%rdx,8), %eax
addl %eax, %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 17 test
imull %r12d
sarl $3, %edx
subl %ecx, %edx
movl %edx, %eax
sall $4, %eax
addl %eax, %edx
cmpl %edx, %ebx
jne .L29
testb $15, %bl ; Divisibility by 16 test
jne .L29
movl %ebx, %eax ; Divisibility by 15 test
imull %r13d
leal (%rdx,%rbx), %eax
sarl $3, %eax
subl %ecx, %eax
movl %eax, %edx
sall $4, %edx
subl %eax, %edx
cmpl %edx, %ebx
jne .L29
movl $14, %edx
movl %ebx, %ecx
call _Z14isDivisableRecii ; call isDivisableRecii(x, 14)
...
The above blocks of machine instructions before each jne .L29 jump are divisibility tests for 20, 19...15 lifted directly into findSmallestRec. Apparently, they are more efficient than the tests used inside isDivisableRec for a run-time value of y. As you can see, the divisibility by 16 test is implemented simply as testb $15, %bl. Because of this, non-divisibility of x by high values of y is caught early by the above highly optimized code.
None of this happens for isDivisable and findSmallest - they are basically translated literally. Even the cycle is not unrolled.
I believe it is the second optimization that makes for the most of the difference. The compiler used highly optimized methods of checking divisibility for higher y values, which happen to be known at compile time.
If you replace the second argument of isDivisableRec with an "unpredictable" run-time value of 20 (instead of hard-coded compile-time constant 20), it should disable this optimization and bring the timings in line. I just tried this and ended up with
12.9 seconds for iterative
13.26 seconds for recursive
I'm starting to try to mess around with inlining ASM in C++, so I wrote up this little snippet:
#include <iostream>
int foo(int, int, int);
int main(void)
{
return foo(1,2,3);
}
int foo(int a, int b, int c)
{
asm volatile("add %1, %0\n\t"
"add %2, %0\n\t"
"add $0x01, %0":"+r"(a):"r"(b), "r"(c):"cc");
}
Which outputs the following assembly code:
main:
.LFB969:
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
call __main
movl $3, %r8d
movl $2, %edx
movl $1, %ecx
call _Z3fooiii
... stuff not shown...
_Z3fooiii:
.LFB970:
.seh_endprologue
movl %ecx, 8(%rsp)
movl %edx, 16(%rsp)
movl %r8d, 24(%rsp)
movl 16(%rsp), %edx
movl 24(%rsp), %ecx
movl 8(%rsp), %eax
/APP
# 15 "K:\inline_asm_practice_1.cpp" 1
add %edx, %eax
add %ecx, %eax
add $0x01, %eax
# 0 "" 2
/NO_APP
movl %eax, 8(%rsp)
ret
So I can see where it inputs my code, but what's with the stack manipulations above it? Is there any way I can get rid of them; they seem unnecessary. I should just be able to have
(in main)
movl $3, %r8d
movl $2, %edx
movl $1, %ecx
call _Z3fooiii
(in foo)
add %edx, %ecx
add %r8d, %eax
add $0x01, %eax
ret
How do I make gcc understand that it doesn't need to shove things on the stack and bring them back in a different order? I've fried fastcall and regparam already, and I can't find anything aboout this.
You probably need to enable optimizations via something like -O2 in order to get the compiler to try and write better/faster code, instead simpler/easier to debug/understand code.
Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization