Related
I've seen people define a member function like this:
void getValue(int& v)
{
v = m_value;
}
and also like this:
int getValue()
{
return m_value;
}
I guess the first saves memory? Is that the only time you would use the first type of get-function? The second seems a lot more convenient.
I thought I would godbolt it for you
source
#include <iostream>
struct Foof{
int m_val;
Foof(int v){
m_val = v;
}
void woodle()
{
if(m_val > 42)
m_val++;
else
m_val--;
}
void Get1(int &v)
{
v = m_val;
}
int Get2()
{
return m_val;
}
};
int main(int c, char**v){
int q;
std::cin >> q;
Foof f1(q);
std::cin >> q;
Foof f2(q);
f1.woodle();
f2.woodle();
int k;
f1.Get1(k);
int j = f2.Get2();
std::cout << k << j;
}
the woodle function and the cin to initialize is to make the compiler think a bit
I have 2 foofs otherwise the compiler goes "well I know the answer to this question" when I call Get2 after Get1
compiled with -03 - ie optimize hard. The code comes out as (gcc)
pushq %rbx
movl $_ZSt3cin, %edi
subq $16, %rsp
leaq 12(%rsp), %rsi
call std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
movl 12(%rsp), %ebx
leaq 12(%rsp), %rsi
movl $_ZSt3cin, %edi
call std::basic_istream<char, std::char_traits<char> >::operator>>(int&)
movl 12(%rsp), %eax
movl $_ZSt4cout, %edi
leal 1(%rbx), %edx
cmpl $43, %ebx
leal -1(%rbx), %esi
cmovge %edx, %esi
leal -1(%rax), %ebx
leal 1(%rax), %edx
cmpl $43, %eax
cmovge %edx, %ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movq %rax, %rdi
movl %ebx, %esi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
addq $16, %rsp
xorl %eax, %eax
popq %rbx
ret
I separated out the actual calls to Get1 or Get2 you can see that
the generated code is identical
the compiler is very aggressive at optimizing, there are no function calls etc
Lesson, write your code to be human readable and let the compiler do the heavy lifting
I'm working on the classic "Reverse a String" problem.
Is a good idea to use the position of the null terminator for swap space? The idea is to save the declaration of one variable.
Specifically, starting with Kernighan and Ritchie's algorithm:
void reverse(char s[])
{
int length = strlen(s);
int c, i, j;
for (i = 0, j = length - 1; i < j; i++, j--)
{
c = s[i];
s[i] = s[j];
s[j] = c;
}
}
...can we instead do the following?
void reverseUsingNullPosition(char s[]) {
int length = strlen(s);
int i, j;
for (i = 0, j = length - 1; i < j; i++, j--) {
s[length] = s[i]; // Use last position instead of a new var
s[i] = s[j];
s[j] = s[length];
}
s[length] = 0; // Replace null character
}
Notice how the "c" variable is no longer needed. We simply use the last position in the array--where the null termination resides--as our swap space. When we're done, we simply replace the 0.
Here's the main routine (Xcode):
#include <stdio.h>
#include <string>
int main(int argc, const char * argv[]) {
char cheese[] = { 'c' , 'h' , 'e' , 'd' , 'd' , 'a' , 'r' , 0 };
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
reverse(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: raddehc
reverseUsingNullPosition(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
}
Yes, this can be done. No, this is not a good idea, because it makes your program much harder to optimize.
When you declare char c in the local scope, the optimizer can figure out that the value is not used beyond the s[j] = c; assignment, and could place the temporary in a register. In addition to effectively eliminating the variable for you, the optimizer could even figure out that you are performing a swap, and emit a hardware-specific instruction. All this would save you a memory access per character.
When you use s[length] for your temporary, the optimizer does not have as much freedom. It is forced to emit the write into memory. This could be just as fast due to caching, but on embedded platforms this could have a significant effect.
First of all such microoptimizations are totally irrelevant until proven relevant. We're talking about C++, you have std::string, std::reverse, you shouldn't even think about such facts.
In any case if you compile both code with -Os on Xcode you obtain for reverse:
.cfi_startproc
Lfunc_begin0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
pushq %r14
pushq %rbx
Ltmp6:
.cfi_offset %rbx, -32
Ltmp7:
.cfi_offset %r14, -24
movq %rdi, %r14
Ltmp8:
callq _strlen
Ltmp9:
leal -1(%rax), %ecx
testl %ecx, %ecx
jle LBB0_3
Ltmp10:
movslq %ecx, %rcx
addl $-2, %eax
Ltmp11:
xorl %edx, %edx
LBB0_2:
Ltmp12:
movb (%r14,%rdx), %sil
movb (%r14,%rcx), %bl
movb %bl, (%r14,%rdx)
movb %sil, (%r14,%rcx)
Ltmp13:
incq %rdx
decq %rcx
cmpl %eax, %edx
leal -1(%rax), %eax
jl LBB0_2
Ltmp14:
LBB0_3:
popq %rbx
popq %r14
popq %rbp
ret
Ltmp15:
Lfunc_end0:
.cfi_endproc
and for reverseUsingNullPosition:
.cfi_startproc
Lfunc_begin1:
pushq %rbp
Ltmp19:
.cfi_def_cfa_offset 16
Ltmp20:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp21:
.cfi_def_cfa_register %rbp
pushq %rbx
pushq %rax
Ltmp22:
.cfi_offset %rbx, -24
movq %rdi, %rbx
Ltmp23:
callq _strlen
Ltmp24:
leal -1(%rax), %edx
testl %edx, %edx
Ltmp25:
movslq %eax, %rdi
jle LBB1_3
Ltmp26:
movslq %edx, %rdx
addl $-2, %eax
Ltmp27:
xorl %esi, %esi
LBB1_2:
Ltmp28:
movb (%rbx,%rsi), %cl
movb %cl, (%rbx,%rdi)
movb (%rbx,%rdx), %cl
movb %cl, (%rbx,%rsi)
movb (%rbx,%rdi), %cl
movb %cl, (%rbx,%rdx)
Ltmp29:
incq %rsi
decq %rdx
cmpl %eax, %esi
leal -1(%rax), %eax
jl LBB1_2
Ltmp30:
LBB1_3: ## %._crit_edge
movb $0, (%rbx,%rdi)
addq $8, %rsp
popq %rbx
Ltmp31:
popq %rbp
ret
Ltmp32:
Lfunc_end1:
.cfi_endproc
If you check the inner loop you have
movb (%r14,%rdx), %sil
movb (%r14,%rcx), %bl
movb %bl, (%r14,%rdx)
movb %sil, (%r14,%rcx)
vs
movb (%rbx,%rsi), %cl
movb %cl, (%rbx,%rdi)
movb (%rbx,%rdx), %cl
movb %cl, (%rbx,%rsi)
movb (%rbx,%rdi), %cl
movb %cl, (%rbx,%rdx)
So I wouldn't say you are saving so much overhead as you think (since you are accessing the array more times), maybe yes, maybe no. Which teaches you another thing: thinking that some code is more performant than other code is irrelevant, the only thing that matters is a well-done benchmark and profile of the code.
Legal: Yes
Good idea: No
The cost of an "extra" variable is zero so there is absolutely no reason to avoid it. The stack pointer needs to be changed anyway so it doesn't matter if it needs to cope with an extra int.
Further:
With compiler optimization turned on, the variable c in the original code will most likely not even exists. It will just be a register in the cpu.
With your code: Optimization will be more difficult so it is not easy to say how well the compiler will do. Maybe you'll get the same - maybe you'll get something worse. But you won't get anything better.
So just forget the idea.
We can use printf and the STL and also manually unroll things and use pointers.
#include <stdio.h>
#include <string>
#include <cstring>
void reverse(char s[])
{
char * b=s;
char * e=s+::strlen(s)-4;
while (e - b > 4)
{
std::swap(b[0], e[3]);
std::swap(b[1], e[2]);
std::swap(b[2], e[1]);
std::swap(b[3], e[0]);
b+=4;
e-=4;
}
e+=3;
while (b < e)
{
std::swap(*(b++), *(e--));
}
}
int main(int argc, const char * argv[]) {
char cheese[] = { 'c' , 'h' , 'e' , 'd' , 'd' , 'a' , 'r' , 0 };
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
reverse(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: raddehc
}
Hard to tell if its faster with just the test case of "cheddar"
Should one use dynamic memory allocation when one knows that a variable will not be needed before it goes out of scope?
For example in the following function:
void func(){
int i =56;
//do something with i, i is not needed past this point
for(int t; t<1000000; t++){
//code
}
}
say one only needed i for a small section of the function, is it worthwhile deleting i as it is not needed in the very long for loop?
As Borgleader said:
A) This is micro (and most probably premature) optimization, meaning
don't worry about it. B) In this particular case, dynamically
allocation i might even hurt performance. tl;dr; profile first,
optimize later
As an example, I compiled the following two programs into assembly (using g++ -S flag with no optimisation enabled).
Creating i on the stack:
int main(void)
{
int i = 56;
i += 5;
for(int t = 0; t<1000; t++) {}
return 0;
}
Dynamically:
int main(void)
{
int* i = new int(56);
*i += 5;
delete i;
for(int t = 0; t<1000; t++) {}
return 0;
}
The first program compiled to:
movl $56, -8(%rbp) # Store 56 on stack (int i = 56)
addl $5, -8(%rbp) # Add 5 to i (i += 5)
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
And the second:
subq $16, %rsp # Allocate memory (new)
movl $4, %edi
call _Znwm
movl $56, (%rax) # Store 56 in *i
movq %rax, -16(%rbp)
movq -16(%rbp), %rax # Add 5
movl (%rax), %eax
leal 5(%rax), %edx
movq -16(%rbp), %rax
movl %edx, (%rax)
movq -16(%rbp), %rax # Free memory (delete)
movq %rax, %rdi
call _ZdlPv
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
In the above assembly output, you can see strait away that there is a significant difference between the number of commands being executed. If I compile the same programs with optimisation turned on. The first program produced the result:
xorl %eax, %eax # Equivalent to return 0;
The second produced:
movl $4, %edi
call _Znwm
movl $61, (%rax) # A smart compiler knows 56+5 = 61
movq %rax, %rdi
call _ZdlPv
xorl %eax, %eax
addq $8, %rsp
With optimisation on, the compiler becomes a pretty powerful tool for improving your code, in certain cases it can even detect that a program only returns 0 and get rid of all the unnecessary code. When you use dynamic memory in the code above, the program still has to request and then free the dynamic memory, it can't optimise it out.
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
#include <stdint.h>
#include <iostream>
using namespace std;
uint32_t k[] = {0, 1, 17};
template <typename T>
bool f(T *data, int i) {
return data[0] < (T)(1 << k[i]);
}
int main() {
uint8_t v = 0;
cout << f(&v, 2) << endl;
cout << (0 < (uint8_t)(1 << 17)) << endl;
return 0;
}
g++ a.cpp && ./a.out
1
0
Why am I getting these results?
It looks like gcc reverses the shift and applies it to the other side, and I guess this is a bug.
In C (instead of C++) the same thing happens, and C translated to asm is easier to read, so I'm using C here; also I reduced the test cases (dropping templates and the k array).
foo() is the original buggy f() function, foo1() is what foo() behaves like with gcc but shouldn't, and bar() shows what foo() should look like apart from the pointer read.
I'm on 64-bit, but 32-bit is the same apart from the parameter handling and finding k.
#include <stdint.h>
#include <stdio.h>
uint32_t k = 17;
char foo(uint8_t *data) {
return *data < (uint8_t)(1<<k);
/*
with gcc -O3 -S: (gcc version 4.7.2 (Debian 4.7.2-5))
movzbl (%rdi), %eax
movl k(%rip), %ecx
shrb %cl, %al
testb %al, %al
sete %al
ret
*/
}
char foo1(uint8_t *data) {
return (((uint32_t)*data) >> k) < 1;
/*
movzbl (%rdi), %eax
movl k(%rip), %ecx
shrl %cl, %eax
testl %eax, %eax
sete %al
ret
*/
}
char bar(uint8_t data) {
return data < (uint8_t)(1<<k);
/*
movl k(%rip), %ecx
movl $1, %eax
sall %cl, %eax
cmpb %al, %dil
setb %al
ret
*/
}
int main() {
uint8_t v = 0;
printf("All should be 0: %i %i %i\n", foo(&v), foo1(&v), bar(v));
return 0;
}
If your int is 16-bit long, you're running into undefined behavior and either result is "OK".
Shifting N-bit integers by N or more bit positions left or right results in undefined behavior.
Since this happens with 32-bit ints, this is a bug in the compiler.
Here are some more data points:
basically, it looks like gcc optimizes (even in when the -O flag is off and -g is on):
[variable] < (type-cast)(1 << [variable2])
to
((type-cast)[variable] >> [variable2]) == 0
and
[variable] >= (type-cast)(1 << [variable2])
to
((type-cast)[variable] >> [variable2]) != 0
where [variable] needs to be an array access.
I guess the advantage here is that it doesn't have to load the literal 1 into a register, which saves 1 register.
So here are the data points:
changing 1 to a number > 1 forces it to implement the correct version.
changing any of the variables to a literal forces it to implement the correct version
changing [variable] to a non array access forces it to implement the correct version
[variable] > (type-cast)(1 << [variable2]) implements the correct version.
I suspect this is all trying to save a register. When [variable] is an array access, it needs to also keep an index. Someone probably thought this is so clever, until it's wrong.
Using code from the bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56051
#include <stdio.h>
int main(void)
{
int a, s = 8;
unsigned char data[1] = {0};
a = data[0] < (unsigned char) (1 << s);
printf("%d\n", a);
return 0;
}
compiled with gcc -O2 -S
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $8, %esp
pushl $1 ***** seems it already precomputed the result to be 1
pushl $.LC0
pushl $1
call __printf_chk
xorl %eax, %eax
movl -4(%ebp), %ecx
leave
leal -4(%ecx), %esp
ret
compile with just gcc -S
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
subl $16, %esp
movl $8, -12(%ebp)
movb $0, -17(%ebp)
movb -17(%ebp), %dl
movl -12(%ebp), %eax
movb %dl, %bl
movb %al, %cl
shrb %cl, %bl ****** (unsigned char)data[0] >> s => %bl
movb %bl, %al %bl => %al
testb %al, %al %al = 0?
sete %dl
movl $0, %eax
movb %dl, %al
movl %eax, -16(%ebp)
movl $.LC0, %eax
subl $8, %esp
pushl -16(%ebp)
pushl %eax
call printf
addl $16, %esp
movl $0, %eax
leal -8(%ebp), %esp
addl $0, %esp
popl %ecx
popl %ebx
popl %ebp
leal -4(%ecx), %esp
ret
I guess the next step is to dig through gcc's source code.
I'm pretty sure we're talking undefined behaviour here - converting a "large" integer to a smaller, of a value that doesn't fit in the size of the new value, is undefined as far as I know. 131072 definitely doesn't fit in a uint_8.
Although looking at the code generated, I'd say that it's probably not quite right, since it does "sete" rather than "setb"??? That does seem very suspicios to me.
If I turn the expression around:
return (T)(1<<k[i]) > data[0];
then it uses a "seta" instruction, which is what I'd expect. I'll do a bit more digging - but something seems a bit wrong.
Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization