Does const-correctness matter with short inline functions? - c++

I have written two short tests and compiled both with "g++ -S" (gcc version 4.7 on Arch Linux):
test1.cpp
inline int func(int a, int b) {
return a+b;
}
int main() {
int c = func(5,5);
return 0;
}
test2.cpp
inline int func(const int& a, const int& b) {
return a+b;
}
int main() {
int c = func(5,5);
return 0;
}
diff test1.s test2.s
1,5c1,5
< .file "test1.cpp"
< .section .text._Z4funcii,"axG",#progbits,_Z4funcii,comdat
< .weak _Z4funcii
< .type _Z4funcii, #function
< _Z4funcii:
---
> .file "test2.cpp"
> .section .text._Z4funcRKiS0_,"axG",#progbits,_Z4funcRKiS0_,comdat
> .weak _Z4funcRKiS0_
> .type _Z4funcRKiS0_, #function
> _Z4funcRKiS0_:
12a13,14
> movl 8(%ebp), %eax
> movl (%eax), %edx
14c16
< movl 8(%ebp), %edx
---
> movl (%eax), %eax
22c24
< .size _Z4funcii, .-_Z4funcii
---
> .size _Z4funcRKiS0_, .-_Z4funcRKiS0_
36,38c38,44
< movl $5, 4(%esp)
< movl $5, (%esp)
< call _Z4funcii
---
> movl $5, 20(%esp)
> movl $5, 24(%esp)
> leal 20(%esp), %eax
> movl %eax, 4(%esp)
> leal 24(%esp), %eax
> movl %eax, (%esp)
> call _Z4funcRKiS0_
However, I don't really know how to interpret the results. All I see is that test2 apparently generates longer code, but I can't really tell what the differences are.
A follow-up question: Does it matter with member functions?

const correctness is for humans, not for code generation
it makes it easier for humans to understand code, by introducing constraints on what can change
so, yes it matters for short functions also, since they can be called from code that is not as short and easy to comprehend in full at a glance
that said, your example functions do not illustrate const correctness
so, while this answers the literal question, probably you have misunderstood what const correctness is about, and meant to ask some other question, which i will refrain from guessing at

Those functions aren't equivalent, one takes argument by value and the other by reference. Note that if you drop the references & then you are left with exactly the same functions, except that one enforces that you don't change the value of the copied arguments while the other doesn't.

Related

Hidden parameter in C++ function call [duplicate]

The return value of a function is usually stored on the stack or in a register. But for a large structure, it has to be on the stack. How much copying has to happen in a real compiler for this code? Or is it optimized away?
For example:
struct Data {
unsigned values[256];
};
Data createData()
{
Data data;
// initialize data values...
return data;
}
(Assuming the function cannot be inlined..)
None; no copies are done.
The address of the caller's Data return value is actually passed as a hidden argument to the function, and the createData function simply writes into the caller's stack frame.
This is known as the named return value optimisation. Also see the c++ faq on this topic.
commercial-grade C++ compilers implement return-by-value in a way that lets them eliminate the overhead, at least in simple cases
...
When yourCode() calls rbv(), the compiler secretly passes a pointer to the location where rbv() is supposed to construct the "returned" object.
You can demonstrate that this has been done by adding a destructor with a printf to your struct. The destructor should only be called once if this return-by-value optimisation is in operation, otherwise twice.
Also you can check the assembly to see that this happens:
Data createData()
{
Data data;
// initialize data values...
data.values[5] = 6;
return data;
}
here's the assembly:
__Z10createDatav:
LFB2:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
subl $1032, %esp
LCFI2:
movl 8(%ebp), %eax
movl $6, 20(%eax)
leave
ret $4
LFE2:
Curiously, it allocated enough space on the stack for the data item subl $1032, %esp, but note that it takes the first argument on the stack 8(%ebp) as the base address of the object, and then initialises element 6 of that item. Since we didn't specify any arguments to createData, this is curious until you realise this is the secret hidden pointer to the parent's version of Data.
But for a large structure, it has to be on the heap stack.
Indeed so! A large structure declared as a local variable is allocated on the stack. Glad to have that cleared up.
As for avoiding copying, as others have noted:
Most calling conventions deal with "function returning struct" by passing an additional parameter that points the location in the caller's stack frame in which the struct should be placed. This is definitely a matter for the calling convention and not the language.
With this calling convention, it becomes possible for even a relatively simple compiler to notice when a code path is definitely going to return a struct, and for it to fix assignments to that struct's members so that they go directly into the caller's frame and don't have to be copied. The key is for the compiler to notice that all terminating code paths through the function return the same struct variable. If that's the case, the compiler can safely use the space in the caller's frame, eliminating the need for a copy at the point of return.
There are many examples given, but basically
This question does not have any definite answer. it will depend on the compiler.
C does not specify how large structs are returned from a function.
Here's some tests for one particular compiler, gcc 4.1.2 on x86 RHEL 5.4
gcc trivial case, no copying
[00:05:21 1 ~] $ gcc -O2 -S -c t.c
[00:05:23 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movl $1, 24(%eax)
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc more realistic case , allocate on stack, memcpy to caller
#include <stdlib.h>
struct Data {
unsigned values[256];
};
struct Data createData()
{
struct Data data;
int i;
for(i = 0; i < 256 ; i++)
data.values[i] = rand();
return data;
}
[00:06:08 1 ~] $ gcc -O2 -S -c t.c
[00:06:10 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc 4.4.2### has grown a lot, and does not copy for the above non-trivial case.
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
In addition, VS2008 (compiled the above as C) will reserve struct Data on the stack of createData() and do a rep movsd loop to copy it back to the caller in Debug mode, in Release mode it will move the return value of rand() (%eax) directly back to the caller
typedef struct {
unsigned value[256];
} Data;
Data createData(void) {
Data r;
calcualte(&r);
return r;
}
Data d = createData();
msvc(6,8,9) and gcc mingw(3.4.5,4.4.0) will generate code like the following pseudocode
void createData(Data* r) {
calculate(&r)
}
Data d;
createData(&d);
gcc on linux will issue a memcpy() to copy the struct back on the stack of the caller. If the function has internal linkage, more optimizations become available though.

Performance of chained public member access compared to pointer

Since I couldn't find any question relating to chained member access, but only chained function access, I would like to ask a couple of questions about it.
I have the following situation:
for(int i = 0; i < largeNumber; ++i)
{
//do calculations with the same chained struct:
//myStruct1.myStruct2.myStruct3.myStruct4.member1
//myStruct1.myStruct2.myStruct3.myStruct4.member2
//etc.
}
It is obviously possible to break this down using a pointer:
MyStruct4* myStruct4_pt = &myStruct1.myStruct2.myStruct3.myStruct4;
for(int i = 0; i < largeNumber; ++i)
{
//do calculations with pointer:
//(*myStruct4_pt).member1
//(*myStruct4_pt).member2
//etc.
}
Is there a difference between member access (.) and a function access that, e.g., returns a pointer to a private variable?
Will/Can the first example be optimized by the compiler and does that strongly depend on the compiler?
If no optimizations are done during compilation time, will/can the CPU optimize the behaviour (e.g. keeping it in the L1 cache)?
Does a chained member access make a difference at all in terms of performance, since variables are "wildly reassigned" during compilation time anyway?
I would kindly ask to leave discussions out regarding readability and maintainability of code, as the chained access is, for my purposes, clearer.
Update:
Everything is running in a single thread.
This is a constant offset that you're modifying, a modern compiler will realize that.
But - don't trust me, lets ask a compiler (see here).
#include <stdio.h>
struct D { float _; int i; int j; };
struct C { double _; D d; };
struct B { char _; C c; };
struct A { int _; B b; };
int bar(int i);
int foo(int i);
void foo(A &a) {
for (int i = 0; i < 10; i++) {
a.b.c.d.i += bar(i);
a.b.c.d.j += foo(i);
}
}
Compiles to
foo(A&):
pushq %rbp
movq %rdi, %rbp
pushq %rbx
xorl %ebx, %ebx
subq $8, %rsp
.L3:
movl %ebx, %edi
call bar(int)
addl %eax, 28(%rbp)
movl %ebx, %edi
addl $1, %ebx
call foo(int)
addl %eax, 32(%rbp)
cmpl $10, %ebx
jne .L3
addq $8, %rsp
popq %rbx
popq %rbp
ret
As you see, the chaining has been translated to a single offset in both cases: 28(%rbp) and 32(%rbp).

Why does this loop produce "warning: iteration 3u invokes undefined behavior" and output more than 4 lines?

Compiling this:
#include <iostream>
int main()
{
for (int i = 0; i < 4; ++i)
std::cout << i*1000000000 << std::endl;
}
and gcc produces the following warning:
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
I understand there is a signed integer overflow.
What I cannot get is why i value is broken by that overflow operation?
I've read the answers to Why does integer overflow on x86 with GCC cause an infinite loop?, but I'm still not clear on why this happens - I get that "undefined" means "anything can happen", but what's the underlying cause of this specific behavior?
Online: http://ideone.com/dMrRKR
Compiler: gcc (4.8)
Signed integer overflow (as strictly speaking, there is no such thing as "unsigned integer overflow") means undefined behaviour. And this means anything can happen, and discussing why does it happen under the rules of C++ doesn't make sense.
C++11 draft N3337: §5.4:1
If during the evaluation of an expression, the result is not mathematically defined or not in the range of
representable values for its type, the behavior is undefined. [ Note: most existing implementations of C++
ignore integer overflows. Treatment of division by zero, forming a remainder using a zero divisor, and all
floating point exceptions vary among machines, and is usually adjustable by a library function. —end note ]
Your code compiled with g++ -O3 emits warning (even without -Wall)
a.cpp: In function 'int main()':
a.cpp:11:18: warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
a.cpp:9:2: note: containing loop
for (int i = 0; i < 4; ++i)
^
The only way we can analyze what the program is doing, is by reading the generated assembly code.
Here is the full assembly listing:
.file "a.cpp"
.section .text$_ZNKSt5ctypeIcE8do_widenEc,"x"
.linkonce discard
.align 2
LCOLDB0:
LHOTB0:
.align 2
.p2align 4,,15
.globl __ZNKSt5ctypeIcE8do_widenEc
.def __ZNKSt5ctypeIcE8do_widenEc; .scl 2; .type 32; .endef
__ZNKSt5ctypeIcE8do_widenEc:
LFB860:
.cfi_startproc
movzbl 4(%esp), %eax
ret $4
.cfi_endproc
LFE860:
LCOLDE0:
LHOTE0:
.section .text.unlikely,"x"
LCOLDB1:
.text
LHOTB1:
.p2align 4,,15
.def ___tcf_0; .scl 3; .type 32; .endef
___tcf_0:
LFB1091:
.cfi_startproc
movl $__ZStL8__ioinit, %ecx
jmp __ZNSt8ios_base4InitD1Ev
.cfi_endproc
LFE1091:
.section .text.unlikely,"x"
LCOLDE1:
.text
LHOTE1:
.def ___main; .scl 2; .type 32; .endef
.section .text.unlikely,"x"
LCOLDB2:
.section .text.startup,"x"
LHOTB2:
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB1084:
.cfi_startproc
leal 4(%esp), %ecx
.cfi_def_cfa 1, 0
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
.cfi_escape 0x10,0x5,0x2,0x75,0
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
pushl %ecx
.cfi_escape 0xf,0x3,0x75,0x70,0x6
.cfi_escape 0x10,0x7,0x2,0x75,0x7c
.cfi_escape 0x10,0x6,0x2,0x75,0x78
.cfi_escape 0x10,0x3,0x2,0x75,0x74
xorl %edi, %edi
subl $24, %esp
call ___main
L4:
movl %edi, (%esp)
movl $__ZSt4cout, %ecx
call __ZNSolsEi
movl %eax, %esi
movl (%eax), %eax
subl $4, %esp
movl -12(%eax), %eax
movl 124(%esi,%eax), %ebx
testl %ebx, %ebx
je L15
cmpb $0, 28(%ebx)
je L5
movsbl 39(%ebx), %eax
L6:
movl %esi, %ecx
movl %eax, (%esp)
addl $1000000000, %edi
call __ZNSo3putEc
subl $4, %esp
movl %eax, %ecx
call __ZNSo5flushEv
jmp L4
.p2align 4,,10
L5:
movl %ebx, %ecx
call __ZNKSt5ctypeIcE13_M_widen_initEv
movl (%ebx), %eax
movl 24(%eax), %edx
movl $10, %eax
cmpl $__ZNKSt5ctypeIcE8do_widenEc, %edx
je L6
movl $10, (%esp)
movl %ebx, %ecx
call *%edx
movsbl %al, %eax
pushl %edx
jmp L6
L15:
call __ZSt16__throw_bad_castv
.cfi_endproc
LFE1084:
.section .text.unlikely,"x"
LCOLDE2:
.section .text.startup,"x"
LHOTE2:
.section .text.unlikely,"x"
LCOLDB3:
.section .text.startup,"x"
LHOTB3:
.p2align 4,,15
.def __GLOBAL__sub_I_main; .scl 3; .type 32; .endef
__GLOBAL__sub_I_main:
LFB1092:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
movl $__ZStL8__ioinit, %ecx
call __ZNSt8ios_base4InitC1Ev
movl $___tcf_0, (%esp)
call _atexit
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
LFE1092:
.section .text.unlikely,"x"
LCOLDE3:
.section .text.startup,"x"
LHOTE3:
.section .ctors,"w"
.align 4
.long __GLOBAL__sub_I_main
.lcomm __ZStL8__ioinit,1,1
.ident "GCC: (i686-posix-dwarf-rev1, Built by MinGW-W64 project) 4.9.0"
.def __ZNSt8ios_base4InitD1Ev; .scl 2; .type 32; .endef
.def __ZNSolsEi; .scl 2; .type 32; .endef
.def __ZNSo3putEc; .scl 2; .type 32; .endef
.def __ZNSo5flushEv; .scl 2; .type 32; .endef
.def __ZNKSt5ctypeIcE13_M_widen_initEv; .scl 2; .type 32; .endef
.def __ZSt16__throw_bad_castv; .scl 2; .type 32; .endef
.def __ZNSt8ios_base4InitC1Ev; .scl 2; .type 32; .endef
.def _atexit; .scl 2; .type 32; .endef
I can barely even read assembly, but even I can see the addl $1000000000, %edi line.
The resulting code looks more like
for(int i = 0; /* nothing, that is - infinite loop */; i += 1000000000)
std::cout << i << std::endl;
This comment of #T.C.:
I suspect that it's something like: (1) because every iteration with i of any value larger than 2 has undefined behavior -> (2) we can assume that i <= 2 for optimization purposes -> (3) the loop condition is always true -> (4) it's optimized away into an infinite loop.
gave me idea to compare the assembly code of the OP's code to the assembly code of the following code, with no undefined behaviour.
#include <iostream>
int main()
{
// changed the termination condition
for (int i = 0; i < 3; ++i)
std::cout << i*1000000000 << std::endl;
}
And, in fact, the correct code has termination condition.
; ...snip...
L6:
mov ecx, edi
mov DWORD PTR [esp], eax
add esi, 1000000000
call __ZNSo3putEc
sub esp, 4
mov ecx, eax
call __ZNSo5flushEv
cmp esi, -1294967296 // here it is
jne L7
lea esp, [ebp-16]
xor eax, eax
pop ecx
; ...snip...
Unfortunately this is the consequences of writing buggy code.
Fortunately you can make use of better diagnostics and better debugging tools - that's what they are for:
enable all warnings
-Wall is the gcc option that enables all useful warnings with no false positives. This is a bare minimum that you should always use.
gcc has many other warning options, however, they are not enabled with -Wall as they may warn on false positives
Visual C++ unfortunately is lagging behind with the ability to give useful warnings. At least the IDE enables some by default.
use debug flags for debugging
for integer overflow -ftrapv traps the program on overflow,
Clang compiler is excellent for this: -fcatch-undefined-behavior catches a lot of instances of undefined behaviour (note: "a lot of" != "all of them")
I have a spaghetti mess of a program not written by me that needs to be shipped tomorrow! HELP!!!!!!111oneone
Use gcc's -fwrapv
This option instructs the compiler to assume that signed arithmetic overflow of addition, subtraction and multiplication wraps around using twos-complement representation.
1 - this rule does not apply to "unsigned integer overflow", as §3.9.1.4 says that
Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number
of bits in the value representation of that particular size of integer.
and e.g. result of UINT_MAX + 1 is mathematically defined - by the rules of arithmetic modulo 2n
Short answer, gcc specifically has documented this problem, we can see that in the gcc 4.8 release notes which says (emphasis mine going forward):
GCC now uses a more aggressive analysis to derive an upper bound for
the number of iterations of loops using constraints imposed by
language standards. This may cause non-conforming programs to no
longer work as expected, such as SPEC CPU 2006 464.h264ref and
416.gamess. A new option, -fno-aggressive-loop-optimizations, was added to disable this aggressive analysis. In some loops that have
known constant number of iterations, but undefined behavior is known
to occur in the loop before reaching or during the last iteration, GCC
will warn about the undefined behavior in the loop instead of deriving
lower upper bound of the number of iterations for the loop. The
warning can be disabled with -Wno-aggressive-loop-optimizations.
and indeed if we use -fno-aggressive-loop-optimizations the infinite loop behavior should cease and it does in all the cases I have tested.
The long answer starts with knowing that signed integer overflow is undefined behavior by looking at the draft C++ standard section 5 Expressions paragraph 4 which says:
If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined. [ Note: most existing
implementations of C++ ignore integer overflows. Treatment of division
by zero, forming a remainder using a zero divisor, and all floating
point exceptions vary among machines, and is usually adjustable by a
library function. —end note
We know that the standard says undefined behavior is unpredictable from the note that come with the definition which says:
[ Note: Undefined behavior may be expected when this International
Standard omits any explicit definition of behavior or when a program
uses an erroneous construct or erroneous data. Permissible undefined
behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program
execution in a documented manner characteristic of the environment
(with or without the issuance of a diagnostic message), to terminating
a translation or execution (with the issuance of a diagnostic
message). Many erroneous program constructs do not engender undefined
behavior; they are required to be diagnosed. —end note ]
But what in the world can the gcc optimizer be doing to turn this into an infinite loop? It sounds completely wacky. But thankfully gcc gives us a clue to figuring it out in the warning:
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
The clue is the Waggressive-loop-optimizations, what does that mean? Fortunately for us this is not the first time this optimization has broken code in this way and we are lucky because John Regehr has documented a case in the article GCC pre-4.8 Breaks Broken SPEC 2006 Benchmarks which shows the following code:
int d[16];
int SATD (void)
{
int satd = 0, dd, k;
for (dd=d[k=0]; k<16; dd=d[++k]) {
satd += (dd < 0 ? -dd : dd);
}
return satd;
}
the article says:
The undefined behavior is accessing d[16] just before exiting the
loop. In C99 it is legal to create a pointer to an element one
position past the end of the array, but that pointer must not be
dereferenced.
and later on says:
In detail, here is what’s going on. A C compiler, upon seeing d[++k],
is permitted to assume that the incremented value of k is within the
array bounds, since otherwise undefined behavior occurs. For the code
here, GCC can infer that k is in the range 0..15. A bit later, when
GCC sees k<16, it says to itself: “Aha– that expression is always
true, so we have an infinite loop.” The situation here, where the
compiler uses the assumption of well-definedness to infer a useful
dataflow fact,
So what the compiler must be doing in some cases is assuming since signed integer overflow is undefined behavior then i must always be less than 4 and thus we have an infinite loop.
He explains this is very similar to the infamous Linux kernel null pointer check removal where in seeing this code:
struct foo *s = ...;
int x = s->f;
if (!s) return ERROR;
gcc inferred that since s was deferenced in s->f; and since dereferencing a null pointer is undefined behavior then s must not be null and therefore optimizes away the if (!s) check on the next line.
The lesson here is that modern optimizers are very aggressive about exploiting undefined behavior and most likely will only get more aggressive. Clearly with just a few examples we can see the optimizer does things that seem completely unreasonable to a programmer but in retrospect from the optimizers perspective make sense.
tl;dr The code generates a test that integer + positive integer == negative integer. Usually the optimizer does not optimize this out, but in the specific case of std::endl being used next, the compiler does optimize this test out. I haven't figured out what's special about endl yet.
From the assembly code at -O1 and higher levels, it is clear that gcc refactors the loop to:
i = 0;
do {
cout << i << endl;
i += NUMBER;
}
while (i != NUMBER * 4)
The biggest value that works correctly is 715827882, i.e. floor(INT_MAX/3). The assembly snippet at -O1 is:
L4:
movsbl %al, %eax
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
addl $715827882, %esi
cmpl $-1431655768, %esi
jne L6
// fallthrough to "return" code
Note, the -1431655768 is 4 * 715827882 in 2's complement.
Hitting -O2 optimizes that to the following:
L4:
movsbl %al, %eax
addl $715827882, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
cmpl $-1431655768, %esi
jne L6
leal -8(%ebp), %esp
jne L6
// fallthrough to "return" code
So the optimization that has been made is merely that the addl was moved higher up.
If we recompile with 715827883 instead then the -O1 version is identical apart from the changed number and test value. However, -O2 then makes a change:
L4:
movsbl %al, %eax
addl $715827883, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
jmp L2
Where there was cmpl $-1431655764, %esi at -O1, that line has been removed for -O2. The optimizer must have decided that adding 715827883 to %esi can never equal -1431655764.
This is pretty puzzling. Adding that to INT_MIN+1 does generate the expected result, so the optimizer must have decided that %esi can never be INT_MIN+1 and I'm not sure why it would decide that.
In the working example it seems it'd be equally valid to conclude that adding 715827882 to a number cannot equal INT_MIN + 715827882 - 2 ! (this is only possible if wraparound does actually occur), yet it does not optimize the line out in that example.
The code I was using is:
#include <iostream>
#include <cstdio>
int main()
{
for (int i = 0; i < 4; ++i)
{
//volatile int j = i*715827883;
volatile int j = i*715827882;
printf("%d\n", j);
std::endl(std::cout);
}
}
If the std::endl(std::cout) is removed then the optimization no longer occurs. In fact replacing it with std::cout.put('\n'); std::flush(std::cout); also causes the optimization to not happen, even though std::endl is inlined.
The inlining of std::endl seems to affect the earlier part of the loop structure (which I don't quite understand what it is doing but I'll post it here in case someone else does):
With original code and -O2:
L2:
movl %esi, 28(%esp)
movl 28(%esp), %eax
movl $LC0, (%esp)
movl %eax, 4(%esp)
call _printf
movl __ZSt4cout, %eax
movl -12(%eax), %eax
movl __ZSt4cout+124(%eax), %ebx
testl %ebx, %ebx
je L10
cmpb $0, 28(%ebx)
je L3
movzbl 39(%ebx), %eax
L4:
movsbl %al, %eax
addl $715827883, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
jmp L2 // no test
With mymanual inlining of std::endl, -O2:
L3:
movl %ebx, 28(%esp)
movl 28(%esp), %eax
addl $715827883, %ebx
movl $LC0, (%esp)
movl %eax, 4(%esp)
call _printf
movl $10, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl $__ZSt4cout, (%esp)
call __ZNSo5flushEv
cmpl $-1431655764, %ebx
jne L3
xorl %eax, %eax
One difference between these two is that %esi is used in the original , and %ebx in the second version; is there any difference in semantics defined between %esi and %ebx in general? (I don't know much about x86 assembly).
Another example of this error being reported in gcc is when you have a loop that executes for a constant number of iterations, but you are using the counter variable as an index into an array that has less than that number of items, such as:
int a[50], x;
for( i=0; i < 1000; i++) x = a[i];
The compiler can determine that this loop will try to access memory outside of the array 'a'. The compiler complains about this with this rather cryptic message:
iteration xxu invokes undefined behavior [-Werror=aggressive-loop-optimizations]
What I cannot get is why i value is broken by that overflow operation?
It seems that integer overflow occurs in 4th iteration (for i = 3).
signed integer overflow invokes undefined behavior. In this case nothing can be predicted. The loop may iterate only 4 times or it may go to infinite or anything else!
Result may vary compiler to compiler or even for different versions of same compiler.
C11: 1.3.24 undefined behavior:
behavior for which this International Standard imposes no requirements
[ Note: Undefined behavior may be expected when this International Standard omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed.
—end note ]

Inline functions mechanism

I know that an inline function does not use the stack for copying the parameters but it just replaces the body of the function wherever it is called.
Consider these two functions:
inline void add(int a) {
a++;
} // does nothing, a won't be changed
inline void add(int &a) {
a++;
} // changes the value of a
If the stack is not used for sending the parameters, how does the compiler know if a variable will be modified or not? What does the code looks like after replacing the calls of these two functions?
What makes you think there is a stack ? And even if there is, what makes you think it would be use for passing parameters ?
You have to understand that there are two levels of reasoning:
the language level: where the semantics of what should happen are defined
the machine level: where said semantics, encoded into CPU instructions, are carried out
At the language level, if you pass a parameter by non-const reference it might be modified by the function. The language level knows not what this mysterious "stack" is. Note: the inline keyword has little to no effect on whether a function call is inlined, it just says that the definition is in-line.
At machine level... there are many ways to achieve this. When making a function call, you have to obey a calling convention. This convention defines how the function parameters (and return types) are exchanged between caller and callee and who among them is responsible for saving/restoring the CPU registers. In general, because it is so low-level, this convention changes on a per CPU family basis.
For example, on x86, a couple parameters will be passed directly in CPU registers (if they fit) whilst remaining parameters (if any) will be passed on the stack.
I have checked what at least GCC does with it if you force it to inline the methods:
inline static void add1(int a) __attribute__((always_inline));
void add1(int a) {
a++;
} // does nothing, a won't be changed
inline static void add2(int &a) __attribute__((always_inline));
void add2(int &a) {
a++;
} // changes the value of a
int main() {
label1:
int b = 0;
add1(b);
label2:
int a = 0;
add2(a);
return 0;
}
The assembly output for this looks like:
.file "test.cpp"
.text
.globl main
.type main, #function
main:
.LFB2:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
subl $16, %esp
.L2:
movl $0, -4(%ebp)
movl -4(%ebp), %eax
movl %eax, -8(%ebp)
addl $1, -8(%ebp)
.L3:
movl $0, -12(%ebp)
movl -12(%ebp), %eax
addl $1, %eax
movl %eax, -12(%ebp)
movl $0, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE2:
Interestingly even the first call of add1() that effectively does nothing as a result outside of the function call, isn't optimized out.
If the stack is not used for sending the parameters, how does the
compiler know if a variable will be modified or not?
As Matthieu M. already pointed out the language construction itself knows nothing about stack.You specify inline keyword to the function just to give a compiler a hint and express a wish that you would prefer this routine to be inlined. If this happens depends completely on the compiler.
The compiler tries to predict what the advantages of this process given particular circumstances might be. If the compiler decides that inlining the function will make the code slower, or unacceptably larger, it will not inline it. Or, if it simply cannot because of a syntactical dependency, such as other code using a function pointer for callbacks, or exporting the function externally as in a dynamic/static code library.
What does the code looks like after replacing the calls of these two
functions?
At he moment none of this function is being inlined when compiled with
g++ -finline-functions -S main.cpp
and you can see it because in disassembly of main
void add1(int a) {
a++;
}
void add2(int &a) {
a++;
}
inline void add3(int a) {
a++;
} // does nothing, a won't be changed
inline void add4(int &a) {
a++;
} // changes the value of a
inline int f() { return 43; }
int main(int argc, char** argv) {
int a = 31;
add1(a);
add2(a);
add3(a);
add4(a);
return 0;
}
we see a call to each routine being made:
main:
.LFB8:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $31, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %edi
call _Z4add1i // function call
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z4add2Ri // function call
movl -4(%rbp), %eax
movl %eax, %edi
call _Z4add3i // function call
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z4add4Ri // function call
movl $0, %eax
leave
ret
.cfi_endproc
compiling with -O1 will remove all functions from program at all because they do nothing.
However addition of
__attribute__((always_inline))
allows us to see what happens when code is inlined:
void add1(int a) {
a++;
}
void add2(int &a) {
a++;
}
inline static void add3(int a) __attribute__((always_inline));
inline void add3(int a) {
a++;
} // does nothing, a won't be changed
inline static void add4(int& a) __attribute__((always_inline));
inline void add4(int &a) {
a++;
} // changes the value of a
int main(int argc, char** argv) {
int a = 31;
add1(a);
add2(a);
add3(a);
add4(a);
return 0;
}
now: g++ -finline-functions -S main.cpp results with:
main:
.LFB9:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $31, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %edi
call _Z4add1i // function call
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z4add2Ri // function call
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
addl $1, -8(%rbp) // addition is here, there is no call
movl -4(%rbp), %eax
addl $1, %eax // addition is here, no call again
movl %eax, -4(%rbp)
movl $0, %eax
leave
ret
.cfi_endproc
The inline keyword has two key effects. One effect is that it is a hint to the implementation that "inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism." This usage is a hint, not a mandate, because "an implementation is not required to perform this inline substitution at the point of call".
The other principal effect is how it modifies the one definition rule. Per the ODR, a program must contain exactly one definition of any given non-inline function that is odr-used in the program. That doesn't quite work with an inline function because "An inline function shall be defined in every translation unit in which it is odr-used ...". Use the same inline function in one hundred different translation units and the linker will be confronted with one hundred definitions of the function. This isn't a problem because those multiple implementations of the same function "... shall have exactly the same definition in every case." One way to look at this: There still is only one definition; it just looks like there are a whole bunch to the linker.
Note: All quoted material are from section 7.1.2 of the C++11 standard.

Slow XOR operator

EDIT: Indeed, I had a weird error in my timing code leading to these results. When I fixed my error, the smart version ended up faster as expected. My timing code looked like this:
bool x = false;
before = now();
for (int i=0; i<N; ++i) {
x ^= smart_xor(A[i],B[i]);
}
after = now();
I had done the ^= to discourage my compiler from optimizing the for-loop away. But I think that the ^= somehow interacts strangely with the two xor functions. I changed my timing code to simply fill out an array of the xor results, and then do computation with that array outside of the timed code. And that fixed things.
Should I delete this question?
END EDIT
I defined two C++ functions as follows:
bool smart_xor(bool a, bool b) {
return a^b;
}
bool dumb_xor(bool a, bool b) {
return a?!b:b;
}
My timing tests indicate that dumb_xor() is slightly faster (1.31ns vs 1.90ns when inlined, 1.92ns vs 2.21ns when not inlined). This puzzles me, as the ^ operator should be a single machine operation. I'm wondering if anyone has an explanation.
The assembly looks like this (when not inlined):
.file "xor.cpp"
.text
.p2align 4,,15
.globl _Z9smart_xorbb
.type _Z9smart_xorbb, #function
_Z9smart_xorbb:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %eax
xorl %edi, %eax
ret
.cfi_endproc
.LFE0:
.size _Z9smart_xorbb, .-_Z9smart_xorbb
.p2align 4,,15
.globl _Z8dumb_xorbb
.type _Z8dumb_xorbb, #function
_Z8dumb_xorbb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
ret
.cfi_endproc
.LFE1:
.size _Z8dumb_xorbb, .-_Z8dumb_xorbb
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. I compiled with -O3.
I don't think you benchmarked your code correctly.
We can see in the generated assembly that your smart_xor function is:
movl %esi, %eax
xorl %edi, %eax
while your dumb_xor function is:
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
So obviously, the first one will be faster.
If not, then you have benchmarking issues.
So you may want to tune your benchmarking code... And remember you'll need to run a lot of calls to have a good and meaningful average.
Given that your "dumb XOR" code is significantly longer (and most instructions are dependent on a previous one, so it won't run in parallel), I suspect that you have some sort of measurement error in your results.
The compiler will need to produce two instructions for the out-of-line version of "smart XOR" because the registers that the data comes in as is not the register to give the return result in, so the data has to move from EDI and ESI to EAX. In an inline version, the code should be able to use whatever register the data is in before the call, and if the code allows it, result stays in the register it came in as.
Calling a function is out-of-line is probably at least as long in execution time as the actual code in the function.
It would help if you showes your test-harness that you use for benchmarking too...