As is discussed here Which memory area is a const object in in C++?, a compiler will probably not allocate any storage for the constants when compiling the code, they may probable be embedded directly into the machine code. Then how does a compiler get a constant's address?
C++ code:
void f()
{
const int a = 99;
const int *p = &a;
printf("constant's value: %d\n", *p);
}
Whether a constant will be allocated any storage or not is completely dependent on the compiler. Compilers are allowed to perform optimizations as per the As-If rule, as long as the observable behavior of the program doesn't change the compiler may allocate storage for a or may not. Note that these optimizations are not demanded but allowed by the standard.
Obviously, when you take the address of this const the compiler will have to return you a address by which you can refer a so it will have to place a in memory or atleast pretend it does so.
All variables must be addressable. The compiler may optimize out constant variables, but in a case like your where you use the variable address it can't do it.
compiler might do many tricks that it is allowed to do up to the point when it will affect visible behaviour of the program (see as-if rule). So it might not allocate storage for const object const int a = 99; ,but in the case when you take an address of variable - it has to be allocated some storage or at least pretended to be and you will get memory address which allows you to refer toa.
code:
#include <cstdlib>
using namespace std;
#include <cstdio>
const int a = 98;
void f()
{
const int a = 99;
const int *p = &a;
printf("constant's value: %d\n", *p);
}
int main(int argc, char** argv)
{
int b=100;
f();
return 0;
}
gcc -S main.cpp:
.file "main.cpp"
.section .rodata
.LC0:
.string "constant's value: %d\n"
.text
.globl _Z1fv
.type _Z1fv, #function
_Z1fv:
.LFB4:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
movl $99, -4(%rbp)
leaq -4(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
movl (%rax), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
leave
ret
.cfi_endproc
.LFE4:
.size _Z1fv, .-_Z1fv
.globl main
.type main, #function
main:
.LFB5:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $100, -4(%rbp)
call _Z1fv
movl $0, %eax
leave
ret
.cfi_endproc
.LFE5:
.size main, .-main
.section .rodata
.align 4
.type _ZL1a, #object
.size _ZL1a, 4
_ZL1a:
.long 98
.ident "GCC: (Ubuntu/Linaro 4.4.7-2ubuntu1) 4.4.7"
.section .note.GNU-stack,"",#progbits
so what we see is that the variable const int a = 99; is in fact built in machine code, do not reside in specific area of memory (there is no memory on the stack, the heap or the data segment allocated to it). Please correct me if I am wrong.
Related
The return value of a function is usually stored on the stack or in a register. But for a large structure, it has to be on the stack. How much copying has to happen in a real compiler for this code? Or is it optimized away?
For example:
struct Data {
unsigned values[256];
};
Data createData()
{
Data data;
// initialize data values...
return data;
}
(Assuming the function cannot be inlined..)
None; no copies are done.
The address of the caller's Data return value is actually passed as a hidden argument to the function, and the createData function simply writes into the caller's stack frame.
This is known as the named return value optimisation. Also see the c++ faq on this topic.
commercial-grade C++ compilers implement return-by-value in a way that lets them eliminate the overhead, at least in simple cases
...
When yourCode() calls rbv(), the compiler secretly passes a pointer to the location where rbv() is supposed to construct the "returned" object.
You can demonstrate that this has been done by adding a destructor with a printf to your struct. The destructor should only be called once if this return-by-value optimisation is in operation, otherwise twice.
Also you can check the assembly to see that this happens:
Data createData()
{
Data data;
// initialize data values...
data.values[5] = 6;
return data;
}
here's the assembly:
__Z10createDatav:
LFB2:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
subl $1032, %esp
LCFI2:
movl 8(%ebp), %eax
movl $6, 20(%eax)
leave
ret $4
LFE2:
Curiously, it allocated enough space on the stack for the data item subl $1032, %esp, but note that it takes the first argument on the stack 8(%ebp) as the base address of the object, and then initialises element 6 of that item. Since we didn't specify any arguments to createData, this is curious until you realise this is the secret hidden pointer to the parent's version of Data.
But for a large structure, it has to be on the heap stack.
Indeed so! A large structure declared as a local variable is allocated on the stack. Glad to have that cleared up.
As for avoiding copying, as others have noted:
Most calling conventions deal with "function returning struct" by passing an additional parameter that points the location in the caller's stack frame in which the struct should be placed. This is definitely a matter for the calling convention and not the language.
With this calling convention, it becomes possible for even a relatively simple compiler to notice when a code path is definitely going to return a struct, and for it to fix assignments to that struct's members so that they go directly into the caller's frame and don't have to be copied. The key is for the compiler to notice that all terminating code paths through the function return the same struct variable. If that's the case, the compiler can safely use the space in the caller's frame, eliminating the need for a copy at the point of return.
There are many examples given, but basically
This question does not have any definite answer. it will depend on the compiler.
C does not specify how large structs are returned from a function.
Here's some tests for one particular compiler, gcc 4.1.2 on x86 RHEL 5.4
gcc trivial case, no copying
[00:05:21 1 ~] $ gcc -O2 -S -c t.c
[00:05:23 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movl $1, 24(%eax)
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc more realistic case , allocate on stack, memcpy to caller
#include <stdlib.h>
struct Data {
unsigned values[256];
};
struct Data createData()
{
struct Data data;
int i;
for(i = 0; i < 256 ; i++)
data.values[i] = rand();
return data;
}
[00:06:08 1 ~] $ gcc -O2 -S -c t.c
[00:06:10 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc 4.4.2### has grown a lot, and does not copy for the above non-trivial case.
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
In addition, VS2008 (compiled the above as C) will reserve struct Data on the stack of createData() and do a rep movsd loop to copy it back to the caller in Debug mode, in Release mode it will move the return value of rand() (%eax) directly back to the caller
typedef struct {
unsigned value[256];
} Data;
Data createData(void) {
Data r;
calcualte(&r);
return r;
}
Data d = createData();
msvc(6,8,9) and gcc mingw(3.4.5,4.4.0) will generate code like the following pseudocode
void createData(Data* r) {
calculate(&r)
}
Data d;
createData(&d);
gcc on linux will issue a memcpy() to copy the struct back on the stack of the caller. If the function has internal linkage, more optimizations become available though.
The C++ object model is such that it does not contain any table for non virtual member functions. When there is a call of such a function
a.my_function();
with name mangling it becomes something like
my_function__5AclassKd(&a)
The object contains only data members. There are no table for non virtual functions. So in such a circumstances how calling mechanism finds out which function to call?
What's going on under the hood?
Formally the standard doesn't require them to work in any specific way, but usually they work exactly like plain functions, but with an extra invisible parameter: a pointer to the object instance they're called on.
Of course a compiler might be able to optimize that, e.g. don't pass the pointer if the member function doesn't use this or any member variables or member functions requiring this.
The compiler's job is to lay out the data and code the program needs into memory addresses. Each non-virtual function - whether member or non-member - gets a fixed virtual memory address at which it can be called. Calling machine code then hardcodes an absolute (or with position independent code a calling-address-relative offset) address of the function to call.
For example, say your compiler is compiling a non-virtual member function that takes 20 bytes of machine code, and it's putting the executable code at virtual addresses from offset 0x1000 and has already generated 10 bytes of executable code for other functions, then it will start the code of this function at virtual address 0x100A. Code that wants to call the function then generates machine code for "call 0x100A" after pushing any function call arguments (including a this pointer to the object to be operated upon) onto the stack.
You can easily see all this happening:
~/dev > cat example.cc
#include <cstdio>
struct X
{
int f(int n) { return n + 3; }
};
int main()
{
X x;
printf("%d\n", x.f(7));
}
~/dev > g++ example.cc -S; c++filt < example.s
.file "example.cc"
.section .text._ZN1X1fEi,"axG",#progbits,X::f(int),comdat
.align 2
.weak X::f(int)
.type X::f(int), #function
X::f(int): // code to execute X::f(int) starts at label .LFB0
.LFB0: // when this assembly is covered to machine code
.cfi_startproc // it's given a virtual address
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movl %esi, -12(%rbp)
movl -12(%rbp), %eax
addl $3, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size X::f(int), .-X::f(int)
.section .rodata
.LC0:
.string "%d\n"
.text
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
leaq -9(%rbp), %rax
movl $7, %esi
movq %rax, %rdi
call X::f(int) // call non-member member function
// machine code will hardcoded address
movl %eax, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf#PLT
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L5
call __stack_chk_fail#PLT
.L5:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 7.2.0-8ubuntu3) 7.2.0"
.section .note.GNU-stack,"",#progbits
If you compile a program then look at the disassembly it'll usually show the actual virtual address offsets too.
With non-virtual functions, there is no need to determine at runtime which function to call; so the resulting machine code will typically look the same as a normal function call, just with an extra argument for this as indicated in your example. (Though it's not always identical - for example, I think MSVC compiling 32-bit programs, in at least some versions, passes this in the ECX register instead of on the stack as for usual function parameters.)
Thus, the determination of which function to call is made by the compiler at compile time. At that time, it has the information determined from parsing class declarations that it can use, for example to do method overload resolution, and from there to either calculate or look up the mangled name to put into assembly code.
EDIT: Indeed, I had a weird error in my timing code leading to these results. When I fixed my error, the smart version ended up faster as expected. My timing code looked like this:
bool x = false;
before = now();
for (int i=0; i<N; ++i) {
x ^= smart_xor(A[i],B[i]);
}
after = now();
I had done the ^= to discourage my compiler from optimizing the for-loop away. But I think that the ^= somehow interacts strangely with the two xor functions. I changed my timing code to simply fill out an array of the xor results, and then do computation with that array outside of the timed code. And that fixed things.
Should I delete this question?
END EDIT
I defined two C++ functions as follows:
bool smart_xor(bool a, bool b) {
return a^b;
}
bool dumb_xor(bool a, bool b) {
return a?!b:b;
}
My timing tests indicate that dumb_xor() is slightly faster (1.31ns vs 1.90ns when inlined, 1.92ns vs 2.21ns when not inlined). This puzzles me, as the ^ operator should be a single machine operation. I'm wondering if anyone has an explanation.
The assembly looks like this (when not inlined):
.file "xor.cpp"
.text
.p2align 4,,15
.globl _Z9smart_xorbb
.type _Z9smart_xorbb, #function
_Z9smart_xorbb:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %eax
xorl %edi, %eax
ret
.cfi_endproc
.LFE0:
.size _Z9smart_xorbb, .-_Z9smart_xorbb
.p2align 4,,15
.globl _Z8dumb_xorbb
.type _Z8dumb_xorbb, #function
_Z8dumb_xorbb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
ret
.cfi_endproc
.LFE1:
.size _Z8dumb_xorbb, .-_Z8dumb_xorbb
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. I compiled with -O3.
I don't think you benchmarked your code correctly.
We can see in the generated assembly that your smart_xor function is:
movl %esi, %eax
xorl %edi, %eax
while your dumb_xor function is:
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
So obviously, the first one will be faster.
If not, then you have benchmarking issues.
So you may want to tune your benchmarking code... And remember you'll need to run a lot of calls to have a good and meaningful average.
Given that your "dumb XOR" code is significantly longer (and most instructions are dependent on a previous one, so it won't run in parallel), I suspect that you have some sort of measurement error in your results.
The compiler will need to produce two instructions for the out-of-line version of "smart XOR" because the registers that the data comes in as is not the register to give the return result in, so the data has to move from EDI and ESI to EAX. In an inline version, the code should be able to use whatever register the data is in before the call, and if the code allows it, result stays in the register it came in as.
Calling a function is out-of-line is probably at least as long in execution time as the actual code in the function.
It would help if you showes your test-harness that you use for benchmarking too...
I know that the postfix versions of the increment/decrement operators will generally be optimised by the compiler for built-in types (i.e. no copy will be made), but is this the case for iterators?
They are essentially just overloaded operators, and could be implemented in any number of ways, but since their behaviour is strictly defined, can they be optimised, and if so, are they by any/many compilers?
#include <vector>
void foo(std::vector<int>& v){
for (std::vector<int>::iterator i = v.begin();
i!=v.end();
i++){ //will this get optimised by the compiler?
*i += 20;
}
}
In the specific case of std::vector on GNU GCC's STL implementation (version 4.6.1), I don't think there would be a performance difference on sufficiently high optimization levels.
The implementation for forward iterators on vector is provided by __gnu_cxx::__normal_iterator<typename _Iterator, typename _Container>. Let's look at its constructor and postfix ++ operator:
explicit
__normal_iterator(const _Iterator& __i) : _M_current(__i) { }
__normal_iterator
operator++(int)
{ return __normal_iterator(_M_current++); }
And its instantiation in vector:
typedef __gnu_cxx::__normal_iterator<pointer, vector> iterator;
As you can see, it internally performs a postfix increment on an ordinary pointer, then passes the original value through its own constructor, which saves it to a local member. This code should be trivial to eliminate through dead value analysis.
But is it optimized really? Let's find out. Test code:
#include <vector>
void test_prefix(std::vector<int>::iterator &it)
{
++it;
}
void test_postfix(std::vector<int>::iterator &it)
{
it++;
}
Output assembly (on -Os):
.file "test.cpp"
.text
.globl _Z11test_prefixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE
.type _Z11test_prefixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE, #function
_Z11test_prefixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE:
.LFB442:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
movl 8(%ebp), %eax
addl $4, (%eax)
popl %ebp
.cfi_def_cfa 4, 4
.cfi_restore 5
ret
.cfi_endproc
.LFE442:
.size _Z11test_prefixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE, .-_Z11test_prefixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE
.globl _Z12test_postfixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE
.type _Z12test_postfixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE, #function
_Z12test_postfixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE:
.LFB443:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
movl 8(%ebp), %eax
addl $4, (%eax)
popl %ebp
.cfi_def_cfa 4, 4
.cfi_restore 5
ret
.cfi_endproc
.LFE443:
.size _Z12test_postfixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE, .-_Z12test_postfixRN9__gnu_cxx17__normal_iteratorIPiSt6vectorIiSaIiEEEE
.ident "GCC: (Debian 4.6.0-10) 4.6.1 20110526 (prerelease)"
.section .note.GNU-stack,"",#progbits
As you can see, exactly the same assembly is output in both cases.
Of course, this may not necessarily be the case for custom iterators, or more complex data types. But it appears that, for vector specifically, prefix and postfix (without capturing the postfix return value) have identical performance.
I guessed it but was still surprised to see that the output of these two programs, written in C and C++, when compiled were very different.
That makes me think that the concept of objects still exist at even the lowest level.
Does this add overhead? If so is it currently an impossible optimization to convert object oriented code to procedural style or just very hard to do?
helloworld.c
#include <stdio.h>
int main(void) {
printf("Hello World!\n");
return 0;
}
helloworld.cpp
#include <iostream>
int main() {
std::cout << "Hello World!" << std::endl;
return 0;
}
Compiled like this:
gcc helloworld.cpp -o hwcpp.S -S -O2
gcc helloworld.c -o hwc.S -S -O2
Produced this code:
C assembly
.file "helloworld.c"
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "Hello World!\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $.LC0, 4(%esp)
movl $1, (%esp)
call __printf_chk
xorl %eax, %eax
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
C++ assembly
.file "helloworld.cpp"
.text
.p2align 4,,15
.type _GLOBAL__I_main, #function
_GLOBAL__I_main:
.LFB1007:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
subl $24, %esp
movl $_ZStL8__ioinit, (%esp)
call _ZNSt8ios_base4InitC1Ev
movl $__dso_handle, 8(%esp)
movl $_ZStL8__ioinit, 4(%esp)
movl $_ZNSt8ios_base4InitD1Ev, (%esp)
call __cxa_atexit
leave
ret
.cfi_endproc
.LFE1007:
.size _GLOBAL__I_main, .-_GLOBAL__I_main
.section .ctors,"aw",#progbits
.align 4
.long _GLOBAL__I_main
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "Hello World!"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB997:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
andl $-16, %esp
pushl %ebx
subl $28, %esp
movl $12, 8(%esp)
movl $.LC0, 4(%esp)
movl $_ZSt4cout, (%esp)
.cfi_escape 0x10,0x3,0x7,0x55,0x9,0xf0,0x1a,0x9,0xfc,0x22
call _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_i
movl _ZSt4cout, %eax
movl -12(%eax), %eax
movl _ZSt4cout+124(%eax), %ebx
testl %ebx, %ebx
je .L9
cmpb $0, 28(%ebx)
je .L5
movzbl 39(%ebx), %eax
.L6:
movsbl %al,%eax
movl %eax, 4(%esp)
movl $_ZSt4cout, (%esp)
call _ZNSo3putEc
movl %eax, (%esp)
call _ZNSo5flushEv
addl $28, %esp
xorl %eax, %eax
popl %ebx
movl %ebp, %esp
popl %ebp
ret
.p2align 4,,7
.p2align 3
.L5:
movl %ebx, (%esp)
call _ZNKSt5ctypeIcE13_M_widen_initEv
movl (%ebx), %eax
movl $10, 4(%esp)
movl %ebx, (%esp)
call *24(%eax)
jmp .L6
.L9:
call _ZSt16__throw_bad_castv
.cfi_endproc
.LFE997:
.size main, .-main
.local _ZStL8__ioinit
.comm _ZStL8__ioinit,1,1
.weakref _ZL20__gthrw_pthread_oncePiPFvvE,pthread_once
.weakref _ZL27__gthrw_pthread_getspecificj,pthread_getspecific
.weakref _ZL27__gthrw_pthread_setspecificjPKv,pthread_setspecific
.weakref _ZL22__gthrw_pthread_createPmPK14pthread_attr_tPFPvS3_ES3_,pthread_create
.weakref _ZL20__gthrw_pthread_joinmPPv,pthread_join
.weakref _ZL21__gthrw_pthread_equalmm,pthread_equal
.weakref _ZL20__gthrw_pthread_selfv,pthread_self
.weakref _ZL22__gthrw_pthread_detachm,pthread_detach
.weakref _ZL22__gthrw_pthread_cancelm,pthread_cancel
.weakref _ZL19__gthrw_sched_yieldv,sched_yield
.weakref _ZL26__gthrw_pthread_mutex_lockP15pthread_mutex_t,pthread_mutex_lock
.weakref _ZL29__gthrw_pthread_mutex_trylockP15pthread_mutex_t,pthread_mutex_trylock
.weakref _ZL31__gthrw_pthread_mutex_timedlockP15pthread_mutex_tPK8timespec,pthread_mutex_timedlock
.weakref _ZL28__gthrw_pthread_mutex_unlockP15pthread_mutex_t,pthread_mutex_unlock
.weakref _ZL26__gthrw_pthread_mutex_initP15pthread_mutex_tPK19pthread_mutexattr_t,pthread_mutex_init
.weakref _ZL29__gthrw_pthread_mutex_destroyP15pthread_mutex_t,pthread_mutex_destroy
.weakref _ZL30__gthrw_pthread_cond_broadcastP14pthread_cond_t,pthread_cond_broadcast
.weakref _ZL27__gthrw_pthread_cond_signalP14pthread_cond_t,pthread_cond_signal
.weakref _ZL25__gthrw_pthread_cond_waitP14pthread_cond_tP15pthread_mutex_t,pthread_cond_wait
.weakref _ZL30__gthrw_pthread_cond_timedwaitP14pthread_cond_tP15pthread_mutex_tPK8timespec,pthread_cond_timedwait
.weakref _ZL28__gthrw_pthread_cond_destroyP14pthread_cond_t,pthread_cond_destroy
.weakref _ZL26__gthrw_pthread_key_createPjPFvPvE,pthread_key_create
.weakref _ZL26__gthrw_pthread_key_deletej,pthread_key_delete
.weakref _ZL30__gthrw_pthread_mutexattr_initP19pthread_mutexattr_t,pthread_mutexattr_init
.weakref _ZL33__gthrw_pthread_mutexattr_settypeP19pthread_mutexattr_ti,pthread_mutexattr_settype
.weakref _ZL33__gthrw_pthread_mutexattr_destroyP19pthread_mutexattr_t,pthread_mutexattr_destroy
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
Different compilers produce different code. An early version of gcc versus the current version of gcc likely produce different code.
More importantly, iostream handles a lot of things stdio doesn't, so there's obviously going to be some substantial overhead. I understand that, in theory, these could be compiled down to indentical code, but what they're doing is not technically identical.
Your issue here isn't about objects or optimization: it's that printf and cout are fundamentally very different beasts. For a more fair comparison, replace your cout statement in the C++ code with printf. Optimization is a moot point when you're outputting to stdout, as the bottleneck will certainly be the terminal's buffer.
You're not calling the same functions in the C++ example as the C example. Replace the std::cout pipes with plain old printf just like the C code and you should see a much greater correlation between the output of the two compilers.
You have to realize that there are a whole lot of "other" things going on in C++.
Global constructors for example. Also the libraries are different.
the C++ stream object is far more complicated than C io, and if you look through the assembler you can see all the code for pthreads in the C++ version.
It's not necessarily slower but it's certainly different.
I guessed it but was still surprised to see that the output of these two programs, written in C and C++, when compiled were very different.
I'm surprised you were surprised - they're totally different programs.
That makes me think that the concept of objects still exist at even the lowest level.
Absolutely... objects are the way memory is laid out and used during program execution (subject to optimisations).
Does this add overhead?
Not necessarily or typically - the same data would have to be somewhere anyway if the work was being coordinated in the same logical way.
If so is it currently an impossible optimization to convert object oriented code to procedural style or just very hard to do?
The issue has nothing to do with OO vs procedural code, or one being more efficient than the other. The main issue you observe here is that C++'s ostreams require a bit more setup and tear-down, and have more of the I/O coordinated by inline code, while printf() has more out-of-line in the precompiled library so you can't see it in your little code listing. It's not clear really which is "better", and unless you have a performance problem that profiling shows is related you should forget about it and get some useful programming done.
EDIT in response to comment:
Fair call - was a bit harshly worded - sorry. It's a difficult distinction to make actually... "only the compiler [knows] of objects" is true in one sense - they're not encapsulated, half-holy discrete "things" to the compiler the way they can be to the programmer. And, we could write an object that could be used exactly like you have used cout that would disappear during compilation and produce code that was equivalent to the printf() version. But, cout and iostreams involves some setup because it's thread safe and more inlined and handles different locales, and it's a real object with storage requirements because it carries around more independent information about error state, whether you want exceptions thrown, end-of-file conditions (printf() affects "errno", which is then clobbered by the next library/OS call)....
What you might find more insightful is to compare how much extra code is generated when you print one more string, as the amount of code is basically some constant overhead + some per-usage amount, and in latter regard ostream-based code can be as or more efficient than printf(), depending on the types and formatting requested. It's also worth noting that...
std::cout << "Hello world!\n";
...is correct and more analogous to your printf() statement... std::endl explicitly requests an unnecessary flushing operation, as a Standard-compliant C++ program will flush and close its buffers as the stream goes out of scope anyway (that said, there's an interesting post today where it seems someone's Microsoft VisualC++ compiler's not doing that for them! - worth keeping an eye on but hard to believe).