GCC/g++ cout << vs. printf()

GCC/g++ cout << vs. printf() - c++

Why does printf("hello world") ends up using more CPU instructions in the assembled code (not considering the standard library used) than cout << "hello world"?
For C++ we have:
movl $.LC0, %esi
movl $_ZSt4cout, %edi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
For C:
movl $.LC0, %eax
movq %rax, %rdi
movl $0, %eax
call printf
WHAT are line 2 from the C++ code and lines 2,3 from the C code for?
I'm using gcc version 4.5.2

For 64bit gcc -O3 (4.5.0) on Linux x86_64, this reads for: cout << "Hello World"
movl $11, %edx ; String length in EDX
movl $.LC0, %esi ; String pointer in ESI
movl $_ZSt4cout, %edi ; load virtual table entry of "cout" for "ostream"
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
and, for printf("Hello World")
movl $.LC0, %edi ; String pointer to EDI
xorl %eax, %eax ; clear EAX (maybe flag for printf=>no stack arguments)
call printf
which means, your sequence depends entirely on any specific
compiler implementation, its version and probably compiler options.
Your Edit states,you use gcc 4.5.2 (which is fairly new).
Seems like 4.5.2 introduces additional 64bit register fiddling in
this sequence for whatever reason. It saves the 64bit RAX to RDI
before zeroing it out - which makes absolutely no sense (at least for me).
Much more interesting: 3 Argument call sequence (g++ -O1 -S source.cpp):
void c_proc()
{
printf("%s %s %s", "Hello", "World", "!") ;
}
void cpp_proc()
{
std::cout << "Hello " << "World " << "!";
}
leads to (c_proc):
movl $.LC0, %ecx
movl $.LC1, %edx
movl $.LC2, %esi
movl $.LC3, %edi
movl $0, %eax
call printf
with .LCx being the strings, and no stack pointer involved!
For cpp_proc:
movl $6, %edx
movl $.LC4, %esi
movl $_ZSt4cout, %edi
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
movl $6, %edx
movl $.LC5, %esi
movl $_ZSt4cout, %edi
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
movl $1, %edx
movl $.LC0, %esi
movl $_ZSt4cout, %edi
call _ZSt16__ostream_insertIcSt11char_traits...basic_ostreamIT_T0_ES6_PKS3_l
You see now what this is all about.
Regards
rbo

The caller code is most of the time irrelevant to performance.
I guess the line 2 of the C++ code stores the address of std::cout as the implicit 'this' argument of the operator<< method.
and i might be wrong on the C part, but it seems to me that it is incomplete. the 32bit upper part of rax is not initialized in this snippet, it might be initialized earlier. (no, i'm wrong here).
from what i understand (i might be wrong), the problem with 64bit registers, is that most of the time they cannot be initialized by immediates, so you have to play with 32bit operations to get the desired result. so the compiler plays with 32bit registers to initialize the 64bit rdi register.
And it seems that printf takes the value of al (the LSB of eax) as an input that tells printf() how many xmm 128 registers are used as input. It looks like an optimization to be able to pass the input string into the xmm registers or some other funny business.

int printf( const char*, ...) is a variadic function that can take one or more arguments; whereas ostream& operator<< (ostream&, signed char*) takes exactly two. I believe that that accounts for the difference in instructions needed to invoke them.
Line 2 in the C++ disassembly is where it passes the ostream& (in this case cout). so the function knows what stream object it is outputting to.
Since both end up making a function call, the comparison is largely irrelevant; the code executed within the function call will be far more significant. The operator<< is overloaded for a number of right-hand-side types, and is resolved at compile time; printf() on the other hand must parse the format string at runtime to determine the data type so may incur additional overhead. Either way the amount of code executed within the functions will swamp the call overhead in terms of instructions executed, and will almost certainly be dominated by the OS code required to render the text on a graphical display. So in short you are sweating the small stuff.

movl is move long, 32-bit move
movq is move quad, 64-bit move
printf has a return value, either the number of characters written or -1 on failure, and that value is stored into %eax, that's all the extra line is worrying about.

Related

Increment variable by N inside array index

Could someone please tell me whether or not such a construction is valid (i.e not an UB) in C++. I have some segfaults because of that and spent couple of days trying to figure out what is going on there.
// Synthetic example
int main(int argc, char** argv)
{
int array[2] = {99, 99};
/*
The point is here. Is it legal? Does it have defined behaviour?
Will it increment first and than access element or vise versa?
*/
std::cout << array[argc += 7]; // Use argc just to avoid some optimisations
}
So, of course I did some analysis, both GCC(5/7) and clang(3.8) generate same code. First add than access.
Clang(3.8): clang++ -O3 -S test.cpp
leal 7(%rdi), %ebx
movl .L_ZZ4mainE5array+28(,%rax,4), %esi
movl $_ZSt4cout, %edi
callq _ZNSolsEi
movl $.L.str, %esi
movl $1, %edx
movq %rax, %rdi
callq _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
GCC(5/7) g++-7 -O3 -S test.cpp
leal 7(%rdi), %ebx
movl $_ZSt4cout, %edi
subq $16, %rsp
.cfi_def_cfa_offset 32
movq %fs:40, %rax
movq %rax, 8(%rsp)
xorl %eax, %eax
movabsq $425201762403, %rax
movq %rax, (%rsp)
movslq %ebx, %rax
movl (%rsp,%rax,4), %esi
call _ZNSolsEi
movl $.LC0, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl %ebx, %esi
So, can I assume such a baheviour is a standard one?

By itself array[argc += 7] is OK, the result of argc + 7 will be used as an index to array.
However, in your example array has just 2 elements, and argc is never negative, so your code will always result in UB due to an out-of-bounds array access.

In case of a[i+=N] the expression i += N will always be evaluated first before accessing the index. But the example that you provided invokes UB as your example array contains only two elements and thus you are accessing out of bounds of the array.

Your case is clearly undefined behaviour, since you will exceed array bounds for the following reasons:
First, expression array[argc += 7] is equal to *((array)+(argc+=7)), and the values of the operands will be evaluated before + is evaluated (cf. here); Operator += is an assignment (and not a side effect), and the value of an assignment is the result of argc (in this case) after the assignment (cf. here). Hence, the +=7 gets effective for subscripting;
Second, argc is defined in C++ to be never negative (cf. here); So argc += 7 will always be >=7 (or a signed integer overflow in very unrealistic scenarious, but still UB then).
Hence, UB.

It's normal behavior. Name of array actualy is a pointer to first element of array. And array[n] is the same as *(array+n)

Why does GCC 4.8.2 not propagate 'unused but set' optimization?

If a variable is not read from ever, it is obviously optimized out. However, the only store operation on that variable is the result of the only read operation of another variable. So, this second variable should also be optimized out. Why is this not being done?
int main() {
timeval a,b,c;
// First and only logical use of a
gettimeofday(&a,NULL);
// Junk function
foo();
// First and only logical use of b
gettimeofday(&b,NULL);
// This gets optimized out as c is never read from.
c.tv_sec = a.tv_sec - b.tv_sec;
//std::cout << c;
}
Aseembly (gcc 4.8.2 with -O3):
subq $40, %rsp
xorl %esi, %esi
movq %rsp, %rdi
call gettimeofday
call foo()
leaq 16(%rsp), %rdi
xorl %esi, %esi
call gettimeofday
xorl %eax, %eax
addq $40, %rsp
ret
subq $8, %rsp
Edit: The results are the same for using rand() .

There's no store operation! There are 2 calls to gettimeofday, yes, but that is a visible effect. And visible effects are precisely the things that may not be optimized away.

Inlining of vararg functions

While playing about with optimisation settings, I noticed an interesting phenomenon: functions taking a variable number of arguments (...) never seemed to get inlined. (Obviously this behavior is compiler-specific, but I've tested on a couple of different systems.)
For example, compiling the following small programm:
#include <stdarg.h>
#include <stdio.h>
static inline void test(const char *format, ...)
{
va_list ap;
va_start(ap, format);
vprintf(format, ap);
va_end(ap);
}
int main()
{
test("Hello %s\n", "world");
return 0;
}
will seemingly always result in a (possibly mangled) test symbol appearing in the resulting executable (tested with Clang and GCC in both C and C++ modes on MacOS and Linux). If one modifies the signature of test() to take a plain string which is passed to printf(), the function is inlined from -O1 upwards by both compilers as you'd expect.
I suspect this is to do with the voodoo magic used to implement varargs, but how exactly this is usually done is a mystery to me. Can anybody enlighten me as to how compilers typically implement vararg functions, and why this seemingly prevents inlining?

At least on x86-64, the passing of var_args is quite complex (due to passing arguments in registers). Other architectures may not be quite so complex, but it is rarely trivial. In particular, having a stack-frame or frame pointer to refer to when getting each argument may be required. These sort of rules may well stop the compiler from inlining the function.
The code for x86-64 includes pushing all the integer arguments, and 8 sse registers onto the stack.
This is the function from the original code compiled with Clang:
test: # #test
subq $200, %rsp
testb %al, %al
je .LBB1_2
# BB#1: # %entry
movaps %xmm0, 48(%rsp)
movaps %xmm1, 64(%rsp)
movaps %xmm2, 80(%rsp)
movaps %xmm3, 96(%rsp)
movaps %xmm4, 112(%rsp)
movaps %xmm5, 128(%rsp)
movaps %xmm6, 144(%rsp)
movaps %xmm7, 160(%rsp)
.LBB1_2: # %entry
movq %r9, 40(%rsp)
movq %r8, 32(%rsp)
movq %rcx, 24(%rsp)
movq %rdx, 16(%rsp)
movq %rsi, 8(%rsp)
leaq (%rsp), %rax
movq %rax, 192(%rsp)
leaq 208(%rsp), %rax
movq %rax, 184(%rsp)
movl $48, 180(%rsp)
movl $8, 176(%rsp)
movq stdout(%rip), %rdi
leaq 176(%rsp), %rdx
movl $.L.str, %esi
callq vfprintf
addq $200, %rsp
retq
and from gcc:
test.constprop.0:
.cfi_startproc
subq $216, %rsp
.cfi_def_cfa_offset 224
testb %al, %al
movq %rsi, 40(%rsp)
movq %rdx, 48(%rsp)
movq %rcx, 56(%rsp)
movq %r8, 64(%rsp)
movq %r9, 72(%rsp)
je .L2
movaps %xmm0, 80(%rsp)
movaps %xmm1, 96(%rsp)
movaps %xmm2, 112(%rsp)
movaps %xmm3, 128(%rsp)
movaps %xmm4, 144(%rsp)
movaps %xmm5, 160(%rsp)
movaps %xmm6, 176(%rsp)
movaps %xmm7, 192(%rsp)
.L2:
leaq 224(%rsp), %rax
leaq 8(%rsp), %rdx
movl $.LC0, %esi
movq stdout(%rip), %rdi
movq %rax, 16(%rsp)
leaq 32(%rsp), %rax
movl $8, 8(%rsp)
movl $48, 12(%rsp)
movq %rax, 24(%rsp)
call vfprintf
addq $216, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
In clang for x86, it is much simpler:
test: # #test
subl $28, %esp
leal 36(%esp), %eax
movl %eax, 24(%esp)
movl stdout, %ecx
movl %eax, 8(%esp)
movl %ecx, (%esp)
movl $.L.str, 4(%esp)
calll vfprintf
addl $28, %esp
retl
There's nothing really stopping any of the above code from being inlined as such, so it would appear that it is simply a policy decision on the compiler writer. Of course, for a call to something like printf, it's pretty meaningless to optimise away a call/return pair for the cost of the code expansion - after all, printf is NOT a small short function.
(A decent part of my work for most of the past year has been to implement printf in an OpenCL environment, so I know far more than most people will ever even look up about format specifiers and various other tricky parts of printf)
Edit: The OpenCL compiler we use WILL inline calls to var_args functions, so it is possible to implement such a thing. It won't do it for calls to printf, because it bloats the code very much, but by default, our compiler inlines EVERYTHING, all the time, no matter what it is... And it does work, but we found that having 2-3 copies of printf in the code makes it REALLY huge (with all sorts of other drawbacks, including final code generation taking a lot longer due to some bad choices of algorithms in the compiler backend), so we had to add code to STOP the compiler doing that...

The variable arguments implementation generally have the following algorithm: Take the first address from the stack which is after the format string, and while parsing the input format string use the value at the given position as the required datatype. Now increment the stack parsing pointer with the size of the required datatype, advance in the format string and use the value at the new position as the required datatype ... and so on.
Some values automatically get converted (ie: promoted) to "larger" types (and this is more or less implementation dependant) such as char or short gets promoted to int and float to double.
Certainly, you do not need a format string, but in this case you need to know the type of the arguments passed in (such as: all ints, or all doubles, or the first 3 ints, then 3 more doubles ..).
So this is the short theory.
Now, to the practice, as the comment from n.m. above shows, gcc does not inline functions which have variable argument handling. Possibly there are pretty complex operations going on while handling the variable arguments which would increase the size of the code to an un-optimal size so it is simply not worth inlining these functions.
EDIT:
After doing a quick test with VS2012 I don't seem to be able to convince the compiler to inline the function with the variable arguments.
Regardless of the combination of flags in the "Optimization" tab of the project there is always a call totest and there is always a test method. And indeed:
http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx
says that
Even with __forceinline, the compiler cannot inline code in all circumstances. The compiler cannot inline a function if:
...
The function has a variable argument list.

The point of inlining is that it reduces function call overhead.
But for varargs, there is very little to be gained in general.
Consider this code in the body of that function:
if (blah)
{
printf("%d", va_arg(vl, int));
}
else
{
printf("%s", va_arg(vl, char *));
}
How is the compiler supposed to inline it? Doing that requires the compiler to push everything on the stack in the correct order anyway, even though there isn't any function being called. The only thing that's optimized away is a call/ret instruction pair (and maybe pushing/popping ebp and whatnot). The memory manipulations cannot be optimized away, and the parameters cannot be passed in registers. So it's unlikely that you'll gain anything notable by inlining varargs.

I do not expect that it would ever be possible to inline a varargs function, except in the most trivial case.
A varargs function that had no arguments, or that did not access any of its arguments, or that accessed only the fixed arguments preceding the variable ones could be inlined by rewriting it as an equivalent function that did not use varargs. This is the trivial case.
A varargs function that accesses its variadic arguments does so by executing code generated by the va_start and va_arg macros, which rely on the arguments being laid out in memory in some way. A compiler that performed inlining simply to remove the overhead of a function call would still need to create the data structure to support those macros. A compiler that attempted to remove all the machinery of function call would have to analyse and optimise away those macros as well. And it would still fail if the variadic function made a call to another function passing va_list as an argument.
I do not see a feasible path for this second case.

gcc stack optimization

Hi I have a question on possible stack optimization by gcc (or g++)..
Sample code under FreeBSD (does UNIX variance matter here?):
void main() {
char bing[100];
..
string buffer = ....;
..
}
What I found in gdb for a coredump of this program is that the address
of bing is actually lower than that buffer (namely, &bing[0] < &buffer).
I think this is totally the contrary of was told in textbook. Could there
be some compiler optimization that re-organize the stack layout in such a
way?
This seems to be only possible explanation but I'm not sure..
In case you're interested, the coredump is due to the buffer overflow by
bing to buffer (but that also confirms &bing[0] < &buffer).
Thanks!

Compilers are free to organise stack frames (assuming they even use stacks) any way they wish.
They may do it for alignment reasons, or for performance reasons, or for no reason at all. You would be unwise to assume any specific order.
If you hadn't invoked undefined behavior by overflowing the buffer, you probably never would have known, and that's the way it should be.
A compiler can not only re-organise your variables, it can optimise them out of existence if it can establish they're not used. With the code:
#include <stdio.h>
int main (void) {
char bing[71];
int x = 7;
bing[0] = 11;
return 0;
}
Compare the normal assembler output:
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $80, %esp
movl %gs:20, %eax
movl %eax, 76(%esp)
xorl %eax, %eax
movl $7, (%esp)
movb $11, 5(%esp)
movl $0, %eax
movl 76(%esp), %edx
xorl %gs:20, %edx
je .L3
call __stack_chk_fail
.L3:
leave
ret
with the insanely optimised:
main:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
Notice anything missing from the latter? Yes, there are no stack manipulations to create space for either bing or x. They don't exist. In fact, the entire code sequence boils down to:
set return code to 0.
return.

A compiler is free to layout local variables on the stack (or keep them in register or do something else with them) however it sees fit: the C and C++ language standards don't say anything about these implementation details, and neither does POSIX or UNIX. I doubt that your textbook told you otherwise, and if it did, I would look for a new textbook.

Does the evil cast get trumped by the evil compiler?

This is not academic code or a hypothetical quesiton. The original problem was converting code from HP11 to HP1123 Itanium. Basically it boils down to a compile error on HP1123 Itanium. It has me really scratching my head when reproducing it on Windows for study. I have stripped all but the most basic aspects... You may have to press control D to exit a console window if you run it as is:
#include "stdafx.h"
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
char blah[6];
const int IAMCONST = 3;
int *pTOCONST;
pTOCONST = (int *) &IAMCONST;
(*pTOCONST) = 7;
printf("IAMCONST %d \n",IAMCONST);
printf("WHATISPOINTEDAT %d \n",(*pTOCONST));
printf("Address of IAMCONST %x pTOCONST %x\n",&IAMCONST, (pTOCONST));
cin >> blah;
return 0;
}
Here is the output
IAMCONST 3
WHATISPOINTEDAT 7
Address of IAMCONST 35f9f0 pTOCONST 35f9f0
All I can say is what the heck? Is it undefined to do this? It is the most counterintuitive thing I have seen for such a simple example.
Update:
Indeed after searching for a while the Menu Debug >> Windows >> Disassembly had exactly the optimization that was described below.
printf("IAMCONST %d \n",IAMCONST);
0024360E mov esi,esp
00243610 push 3
00243612 push offset string "IAMCONST %d \n" (2458D0h)
00243617 call dword ptr [__imp__printf (248338h)]
0024361D add esp,8
00243620 cmp esi,esp
00243622 call #ILT+325(__RTC_CheckEsp) (24114Ah)
Thank you all!

Looks like the compiler is optimizing
printf("IAMCONST %d \n",IAMCONST);
into
printf("IAMCONST %d \n",3);
since you said that IAMCONST is a const int.
But since you're taking the address of IAMCONST, it has to actually be located on the stack somewhere, and the constness can't be enforced, so the memory at that location (*pTOCONST) is mutable after all.
In short: you casted away the constness, don't do that. Poor, defenseless C...
Addendum
Using GCC for x86, with -O0 (no optimizations), the generated assembly
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $36, %esp
movl $3, -12(%ebp)
leal -12(%ebp), %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
movl $7, (%eax)
movl -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
call printf
movl -8(%ebp), %eax
movl (%eax), %eax
movl %eax, 4(%esp)
movl $.LC1, (%esp)
call printf
copies from *(bp-12) on the stack to printf's arguments. However, using -O1 (as well as -Os, -O2, -O3, and other optimization levels),
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $20, %esp
movl $3, 4(%esp)
movl $.LC0, (%esp)
call printf
movl $7, 4(%esp)
movl $.LC1, (%esp)
call printf
you can clearly see that the constant 3 is used instead.
If you are using Visual Studio's CL.EXE, /Od disables optimization. This varies from compiler to compiler.
Be warned that the C specification allows the C compiler to assume that the target of any int * pointer never overlaps the memory location of a const int, so you really shouldn't be doing this at all if you want predictable behavior.

The constant value IAMCONST is being inlined into the printf call.
What you're doing is at best wrong and in all likelihood is undefined by the C++ standard. My guess is that the C++ standard leaves the compiler free to inline a const primitive which is local to a function declaration. The reason being that the value should not be able to change.
Then again, it's C++ where should and can are very different words.

You are lucky the compiler is doing the optimization. An alternative treatment would be to place the const integer into read-only memory, whereupon trying to modify the value would cause a core dump.

Writing to a const object through a cast that removes the const is undefined behavior - so at the point where you do this:
(*pTOCONST) = 7;
all bets are off.
From the C++ standard 7.1.5.1 (The cv-qualifiers):
Except that any class member declared mutable (7.1.1) can be modified, any attempt to modify a const
object during its lifetime (3.8) results in undefined behavior.
Because of this, the compiler is free to assume that the value of IAMCONST will not change, so it can optimize away the access to the actual storage. In fact, if the address of the const object is never taken, the compiler may eliminate the storage for the object altogether.
Also note that (again in 7.1.5.1):
A variable of non-volatile const-qualified integral or enumeration type initialized by an integral constant expression can be used in integral constant expressions (5.19).
Which means IAMCONST can be used in compile-time constant expressions (ie., to provide a value for an enumeration or the size of an array). What would it even mean to change that at runtime?

It doesn't matter if the compiler optimizes or not. You asked for trouble and you're lucky you got the trouble yourself instead of waiting for customers to report it to you.
"All I can say is what the heck? Is it undefined to do this? It is the most counterintuitive thing I have seen for such a simple example."
If you really believe that then you need to switch to a language you can understand, or change professions. For the sake of yourself and your customers, stop using C or C++ or C#.
const int IAMCONST = 3;
You said it.
int *pTOCONST;
pTOCONST = (int *) &IAMCONST;
Guess why the compiler complained if you omitted your evil cast. The compiler might have been telling the truth before you lied to it.
"Does the evil cast get trumped by the evil compiler?"
No. The evil cast gets trumped by itself. Whether or not your compiler tried to tell you the truth, the compiler was not evil.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GCC/g++ cout << vs. printf() - c++

movl is move long, 32-bit move movq is move quad, 64-bit move printf has a return value, either the number of characters written or -1 on failure, and that value is stored into %eax, that's all the extra line is worrying about.

Related

Increment variable by N inside array index

Why does GCC 4.8.2 not propagate 'unused but set' optimization?

Inlining of vararg functions

gcc stack optimization

Does the evil cast get trumped by the evil compiler?

Categories

Resources