Could someone please tell me whether or not such a construction is valid (i.e not an UB) in C++. I have some segfaults because of that and spent couple of days trying to figure out what is going on there.
// Synthetic example
int main(int argc, char** argv)
{
int array[2] = {99, 99};
/*
The point is here. Is it legal? Does it have defined behaviour?
Will it increment first and than access element or vise versa?
*/
std::cout << array[argc += 7]; // Use argc just to avoid some optimisations
}
So, of course I did some analysis, both GCC(5/7) and clang(3.8) generate same code. First add than access.
Clang(3.8): clang++ -O3 -S test.cpp
leal 7(%rdi), %ebx
movl .L_ZZ4mainE5array+28(,%rax,4), %esi
movl $_ZSt4cout, %edi
callq _ZNSolsEi
movl $.L.str, %esi
movl $1, %edx
movq %rax, %rdi
callq _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
GCC(5/7) g++-7 -O3 -S test.cpp
leal 7(%rdi), %ebx
movl $_ZSt4cout, %edi
subq $16, %rsp
.cfi_def_cfa_offset 32
movq %fs:40, %rax
movq %rax, 8(%rsp)
xorl %eax, %eax
movabsq $425201762403, %rax
movq %rax, (%rsp)
movslq %ebx, %rax
movl (%rsp,%rax,4), %esi
call _ZNSolsEi
movl $.LC0, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl %ebx, %esi
So, can I assume such a baheviour is a standard one?
By itself array[argc += 7] is OK, the result of argc + 7 will be used as an index to array.
However, in your example array has just 2 elements, and argc is never negative, so your code will always result in UB due to an out-of-bounds array access.
In case of a[i+=N] the expression i += N will always be evaluated first before accessing the index. But the example that you provided invokes UB as your example array contains only two elements and thus you are accessing out of bounds of the array.
Your case is clearly undefined behaviour, since you will exceed array bounds for the following reasons:
First, expression array[argc += 7] is equal to *((array)+(argc+=7)), and the values of the operands will be evaluated before + is evaluated (cf. here); Operator += is an assignment (and not a side effect), and the value of an assignment is the result of argc (in this case) after the assignment (cf. here). Hence, the +=7 gets effective for subscripting;
Second, argc is defined in C++ to be never negative (cf. here); So argc += 7 will always be >=7 (or a signed integer overflow in very unrealistic scenarious, but still UB then).
Hence, UB.
It's normal behavior. Name of array actualy is a pointer to first element of array. And array[n] is the same as *(array+n)
Related
so I get it that pre-increment is faster than post-increment as no copy of the value is made. But let's say I have this:
char * temp = "abc";
char c = 0;
Now if I want to assign 'a' to c and increment temp so that it now points to 'b' I would do it like this :
c = *temp++;
but pre-increment should be faster so i thought :
c = *temp;
++temp;
but it turns out *temp++ is faster according to my measurements.
Now I don't quite get it why and how, so if someone is willing to enlighten me, please do.
First, pre-increment is only potentially faster for the reason you state. But for basic types like pointers, in practice that's not the case, because the compiler can generate optimized code. Ref. Is there a performance difference between i++ and ++i in C++? for more details.
Second, to a decent optimizing compiler, there's no difference between the two alternatives you mentioned. It's very likely the same exact machine code will be generated for both cases (unless you disabled optimizations).
To illustrate this : on my compiler, the following code is generated when optimizations are disabled :
# c = *temp++;
movq temp(%rip), %rax
leaq 1(%rax), %rdx
movq %rdx, temp(%rip)
movzbl (%rax), %eax
movb %al, c(%rip)
# c = *temp;
movq temp(%rip), %rax
movzbl (%rax), %eax
movb %al, c(%rip)
# ++temp;
movq temp(%rip), %rax
addq $1, %rax
movq %rax, temp(%rip)
Notice there's an additional movq instruction in the latter, which could account for slower run time.
When enabling optimizations however, this turns into :
# c = *temp++;
movq temp(%rip), %rax
leaq 1(%rax), %rdx
movq %rdx, temp(%rip)
movzbl (%rax), %eax
movb %al, c(%rip)
# c = *temp;
# ++temp;
movq temp(%rip), %rax
movzbl (%rax), %edx
addq $1, %rax
movq %rax, temp(%rip)
movb %dl, c(%rip)
Other than a different order of the instructions, and the choice of using addq vs. leaq for the increment, there's no real difference between these two. If you do get (measurably) different performance between these two, then that's likely due to the specific cpu architecture (possibly a more optimal use of the pipeline eg.).
If a variable is not read from ever, it is obviously optimized out. However, the only store operation on that variable is the result of the only read operation of another variable. So, this second variable should also be optimized out. Why is this not being done?
int main() {
timeval a,b,c;
// First and only logical use of a
gettimeofday(&a,NULL);
// Junk function
foo();
// First and only logical use of b
gettimeofday(&b,NULL);
// This gets optimized out as c is never read from.
c.tv_sec = a.tv_sec - b.tv_sec;
//std::cout << c;
}
Aseembly (gcc 4.8.2 with -O3):
subq $40, %rsp
xorl %esi, %esi
movq %rsp, %rdi
call gettimeofday
call foo()
leaq 16(%rsp), %rdi
xorl %esi, %esi
call gettimeofday
xorl %eax, %eax
addq $40, %rsp
ret
subq $8, %rsp
Edit: The results are the same for using rand() .
There's no store operation! There are 2 calls to gettimeofday, yes, but that is a visible effect. And visible effects are precisely the things that may not be optimized away.
In the following code:
int count(){
static int n(5);
n = n + 1;
return n;
}
the variable n is instantiated only once at the first call to the function.
There should be a flag or something so it initialize the variable only once.. I tried to look on the generated assembly code from gcc, but didn't have any clue.
How does the compiler handle this?
This is, of course, compiler-specific.
The reason you didn't see any checks in the generated assembly is that, since n is an int variable, g++ simply treats it as a global variable pre-initialized to 5.
Let's see what happens if we do the same with a std::string:
#include <string>
void count() {
static std::string str;
str += ' ';
}
The generated assembly goes like this:
_Z5countv:
.LFB544:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA544
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
pushq %r13
pushq %r12
pushq %rbx
subq $8, %rsp
movl $_ZGVZ5countvE3str, %eax
movzbl (%rax), %eax
testb %al, %al
jne .L2 ; <======= bypass initialization
.cfi_offset 3, -40
.cfi_offset 12, -32
.cfi_offset 13, -24
movl $_ZGVZ5countvE3str, %edi
call __cxa_guard_acquire ; acquire the lock
testl %eax, %eax
setne %al
testb %al, %al
je .L2 ; check again
movl $0, %ebx
movl $_ZZ5countvE3str, %edi
.LEHB0:
call _ZNSsC1Ev ; call the constructor
.LEHE0:
movl $_ZGVZ5countvE3str, %edi
call __cxa_guard_release ; release the lock
movl $_ZNSsD1Ev, %eax
movl $__dso_handle, %edx
movl $_ZZ5countvE3str, %esi
movq %rax, %rdi
call __cxa_atexit ; schedule the destructor to be called at exit
jmp .L2
.L7:
.L3:
movl %edx, %r12d
movq %rax, %r13
testb %bl, %bl
jne .L5
.L4:
movl $_ZGVZ5countvE3str, %edi
call __cxa_guard_abort
.L5:
movq %r13, %rax
movslq %r12d,%rdx
movq %rax, %rdi
.LEHB1:
call _Unwind_Resume
.L2:
movl $32, %esi
movl $_ZZ5countvE3str, %edi
call _ZNSspLEc
.LEHE1:
addq $8, %rsp
popq %rbx
popq %r12
popq %r13
leave
ret
.cfi_endproc
The line I've marked with the bypass initialization comment is the conditional jump instruction that skips the construction if the variable already points to a valid object.
This is entirely up to the implementation; the language standard says nothing about that.
In practice, the compiler will usually include a hidden flag variable somewhere that indicates whether the static variable has already been instantiated or not. The static variable and the flag will probably be in the static storage area of the program (e.g. the data segment, not the stack segment), not in the function scope memory, so you may have to look around about in the assembly. (The variable can't go on the call stack, for obvious reasons, so it's really like a global variable. "static allocation" really covers all sorts of static variables!)
Update: As #aix points out, if the static variable is initialized to a constant expression, you may not even need a flag, because the initialization can be performed at load time rather than at the first function call. In C++11 you should be able to take advantage of that better than in C++03 thanks to the wider availability of constant expressions.
It's quite likely that this variable will be handled just as ordinary global variable by gcc. That means the initialization will be statically initialized directly in the binary.
This is possible, since you initialize it by a constant. If you initialized it eg. with another function return value, the compiler would add a flag and skip the initialization based on the flag.
Why does printf("hello world") ends up using more CPU instructions in the assembled code (not considering the standard library used) than cout << "hello world"?
For C++ we have:
movl $.LC0, %esi
movl $_ZSt4cout, %edi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
For C:
movl $.LC0, %eax
movq %rax, %rdi
movl $0, %eax
call printf
WHAT are line 2 from the C++ code and lines 2,3 from the C code for?
I'm using gcc version 4.5.2
For 64bit gcc -O3 (4.5.0) on Linux x86_64, this reads for: cout << "Hello World"
movl $11, %edx ; String length in EDX
movl $.LC0, %esi ; String pointer in ESI
movl $_ZSt4cout, %edi ; load virtual table entry of "cout" for "ostream"
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
and, for printf("Hello World")
movl $.LC0, %edi ; String pointer to EDI
xorl %eax, %eax ; clear EAX (maybe flag for printf=>no stack arguments)
call printf
which means, your sequence depends entirely on any specific
compiler implementation, its version and probably compiler options.
Your Edit states,you use gcc 4.5.2 (which is fairly new).
Seems like 4.5.2 introduces additional 64bit register fiddling in
this sequence for whatever reason. It saves the 64bit RAX to RDI
before zeroing it out - which makes absolutely no sense (at least for me).
Much more interesting: 3 Argument call sequence (g++ -O1 -S source.cpp):
void c_proc()
{
printf("%s %s %s", "Hello", "World", "!") ;
}
void cpp_proc()
{
std::cout << "Hello " << "World " << "!";
}
leads to (c_proc):
movl $.LC0, %ecx
movl $.LC1, %edx
movl $.LC2, %esi
movl $.LC3, %edi
movl $0, %eax
call printf
with .LCx being the strings, and no stack pointer involved!
For cpp_proc:
movl $6, %edx
movl $.LC4, %esi
movl $_ZSt4cout, %edi
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
movl $6, %edx
movl $.LC5, %esi
movl $_ZSt4cout, %edi
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
movl $1, %edx
movl $.LC0, %esi
movl $_ZSt4cout, %edi
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
You see now what this is all about.
Regards
rbo
The caller code is most of the time irrelevant to performance.
I guess the line 2 of the C++ code stores the address of std::cout as the implicit 'this' argument of the operator<< method.
and i might be wrong on the C part, but it seems to me that it is incomplete. the 32bit upper part of rax is not initialized in this snippet, it might be initialized earlier. (no, i'm wrong here).
from what i understand (i might be wrong), the problem with 64bit registers, is that most of the time they cannot be initialized by immediates, so you have to play with 32bit operations to get the desired result. so the compiler plays with 32bit registers to initialize the 64bit rdi register.
And it seems that printf takes the value of al (the LSB of eax) as an input that tells printf() how many xmm 128 registers are used as input. It looks like an optimization to be able to pass the input string into the xmm registers or some other funny business.
int printf( const char*, ...) is a variadic function that can take one or more arguments; whereas ostream& operator<< (ostream&, signed char*) takes exactly two. I believe that that accounts for the difference in instructions needed to invoke them.
Line 2 in the C++ disassembly is where it passes the ostream& (in this case cout). so the function knows what stream object it is outputting to.
Since both end up making a function call, the comparison is largely irrelevant; the code executed within the function call will be far more significant. The operator<< is overloaded for a number of right-hand-side types, and is resolved at compile time; printf() on the other hand must parse the format string at runtime to determine the data type so may incur additional overhead. Either way the amount of code executed within the functions will swamp the call overhead in terms of instructions executed, and will almost certainly be dominated by the OS code required to render the text on a graphical display. So in short you are sweating the small stuff.
movl is move long, 32-bit move
movq is move quad, 64-bit move
printf has a return value, either the number of characters written or -1 on failure, and that value is stored into %eax, that's all the extra line is worrying about.
This question already has answers here:
Do temp variables slow down my program?
(5 answers)
Closed 5 years ago.
Let's say we have two functions:
int f();
int g();
I want to get the sum of f() and g().
First way:
int fRes = f();
int gRes = g();
int sum = fRes + gRes;
Second way:
int sum = f() + g();
Will be there any difference in performance in this two cases?
Same question for complex types instead of ints
EDIT
Do I understand right i should not worry about performance in such case (in each situation including frequently performed tasks) and use temporary variables to increase readability and to simplify the code ?
You can answer questions like this for yourself by compiling to assembly language (with optimization on, of course) and inspecting the output. If I flesh your example out to a complete, compilable program...
extern int f();
extern int g();
int direct()
{
return f() + g();
}
int indirect()
{
int F = f();
int G = g();
return F + G;
}
and compile it (g++ -S -O2 -fomit-frame-pointer -fno-exceptions test.cc; the last two switches eliminate a bunch of distractions from the output), I get this (further distractions deleted):
__Z8indirectv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
__Z6directv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
As you can see, the code generated for both functions is identical, so the answer to your question is no, there will be no performance difference. Now let's look at complex numbers -- same code, but s/int/std::complex<double>/g throughout and #include <complex> at the top; same compilation switches --
__Z8indirectv:
subq $72, %rsp
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
__Z6directv:
subq $72, %rsp
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
That's a lot more instructions and the compiler isn't doing a perfect optimization job, it looks like, but nonetheless the code generated for both functions is identical.
I think in the second way it is assigned to a temporary variable when the function returns a value anyway. However, it becomes somewhat significant when you need to use the values from f() and g() more than once case in which storing them to a variable instead of recalculating them each time can help.
If you have optimization turned off, there likely will be. If you have it turned on, they will likely result in identical code. This is especially true of you label the fRes and gRes as const.
Because it's legal for the compiler to elide the call to the copy constructor if fRes and gRes are complex types they will not differ in performance for complex types either.
Someone mentioned using fRes and gRes more than once. And of course, this is obviously potentially less optimal as you would have to call f() or g() more than once.
As you wrote it, there's only a subtle difference (which another answer addresses, that there's a sequence point in the one vs the other).
They would be different if you had done this instead:
int fRes;
int gRes;
fRes = f();
fRes = g();
int sum = fRes + gRes;
(Imagining that int as actually some other type with a non-trivial default constructor.)
In the case here, you invoke default constructors and then assignment operators, which is potentially more work.
It depends entirely on what optimizations the compiler performs. The two could compile to slightly different or exactly the same bytecode. Even if slightly different, you couldn't measure a statistically significant difference in time and space costs for those particular samples.
On my platform with full optimization turned on, a function returning the sum from both different cases compiled to exactly the same machine code.
The only minor difference between the two examples is that the first guarantees the order in which f() and g() are called, so in theory the second allows the compiler slightly more flexibility. Whether this ever makes a difference would depend on what f() and g() actually do and, perhaps, whether they can be inlined.
There is a slight difference between the two examples. In expression f() + g() there is no sequence point, whereas when the calls are made in different statements there are sequence points at the end of each statement.
The absence of a sequence point means the order these two functions are called is unspecified, they can be called in any order, which might help the compiler optimize it.