What is the mechanism of calling nonvirtual member function in C++?

What is the mechanism of calling nonvirtual member function in C++? - c++

The C++ object model is such that it does not contain any table for non virtual member functions. When there is a call of such a function
a.my_function();
with name mangling it becomes something like
my_function__5AclassKd(&a)
The object contains only data members. There are no table for non virtual functions. So in such a circumstances how calling mechanism finds out which function to call?
What's going on under the hood?

Formally the standard doesn't require them to work in any specific way, but usually they work exactly like plain functions, but with an extra invisible parameter: a pointer to the object instance they're called on.
Of course a compiler might be able to optimize that, e.g. don't pass the pointer if the member function doesn't use this or any member variables or member functions requiring this.

The compiler's job is to lay out the data and code the program needs into memory addresses. Each non-virtual function - whether member or non-member - gets a fixed virtual memory address at which it can be called. Calling machine code then hardcodes an absolute (or with position independent code a calling-address-relative offset) address of the function to call.
For example, say your compiler is compiling a non-virtual member function that takes 20 bytes of machine code, and it's putting the executable code at virtual addresses from offset 0x1000 and has already generated 10 bytes of executable code for other functions, then it will start the code of this function at virtual address 0x100A. Code that wants to call the function then generates machine code for "call 0x100A" after pushing any function call arguments (including a this pointer to the object to be operated upon) onto the stack.
You can easily see all this happening:
~/dev > cat example.cc
#include <cstdio>
struct X
{
int f(int n) { return n + 3; }
};
int main()
{
X x;
printf("%d\n", x.f(7));
}
~/dev > g++ example.cc -S; c++filt < example.s
.file "example.cc"
.section .text._ZN1X1fEi,"axG",#progbits,X::f(int),comdat
.align 2
.weak X::f(int)
.type X::f(int), #function
X::f(int): // code to execute X::f(int) starts at label .LFB0
.LFB0: // when this assembly is covered to machine code
.cfi_startproc // it's given a virtual address
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movl %esi, -12(%rbp)
movl -12(%rbp), %eax
addl $3, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size X::f(int), .-X::f(int)
.section .rodata
.LC0:
.string "%d\n"
.text
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
leaq -9(%rbp), %rax
movl $7, %esi
movq %rax, %rdi
call X::f(int) // call non-member member function
// machine code will hardcoded address
movl %eax, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf#PLT
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L5
call __stack_chk_fail#PLT
.L5:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 7.2.0-8ubuntu3) 7.2.0"
.section .note.GNU-stack,"",#progbits
If you compile a program then look at the disassembly it'll usually show the actual virtual address offsets too.

With non-virtual functions, there is no need to determine at runtime which function to call; so the resulting machine code will typically look the same as a normal function call, just with an extra argument for this as indicated in your example. (Though it's not always identical - for example, I think MSVC compiling 32-bit programs, in at least some versions, passes this in the ECX register instead of on the stack as for usual function parameters.)
Thus, the determination of which function to call is made by the compiler at compile time. At that time, it has the information determined from parsing class declarations that it can use, for example to do method overload resolution, and from there to either calculate or look up the mangled name to put into assembly code.

Related

Hidden parameter in C++ function call [duplicate]

The return value of a function is usually stored on the stack or in a register. But for a large structure, it has to be on the stack. How much copying has to happen in a real compiler for this code? Or is it optimized away?
For example:
struct Data {
unsigned values[256];
};
Data createData()
{
Data data;
// initialize data values...
return data;
}
(Assuming the function cannot be inlined..)

None; no copies are done.
The address of the caller's Data return value is actually passed as a hidden argument to the function, and the createData function simply writes into the caller's stack frame.
This is known as the named return value optimisation. Also see the c++ faq on this topic.
commercial-grade C++ compilers implement return-by-value in a way that lets them eliminate the overhead, at least in simple cases
...
When yourCode() calls rbv(), the compiler secretly passes a pointer to the location where rbv() is supposed to construct the "returned" object.
You can demonstrate that this has been done by adding a destructor with a printf to your struct. The destructor should only be called once if this return-by-value optimisation is in operation, otherwise twice.
Also you can check the assembly to see that this happens:
Data createData()
{
Data data;
// initialize data values...
data.values[5] = 6;
return data;
}
here's the assembly:
__Z10createDatav:
LFB2:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
subl $1032, %esp
LCFI2:
movl 8(%ebp), %eax
movl $6, 20(%eax)
leave
ret $4
LFE2:
Curiously, it allocated enough space on the stack for the data item subl $1032, %esp, but note that it takes the first argument on the stack 8(%ebp) as the base address of the object, and then initialises element 6 of that item. Since we didn't specify any arguments to createData, this is curious until you realise this is the secret hidden pointer to the parent's version of Data.

But for a large structure, it has to be on the heap stack.
Indeed so! A large structure declared as a local variable is allocated on the stack. Glad to have that cleared up.
As for avoiding copying, as others have noted:
Most calling conventions deal with "function returning struct" by passing an additional parameter that points the location in the caller's stack frame in which the struct should be placed. This is definitely a matter for the calling convention and not the language.
With this calling convention, it becomes possible for even a relatively simple compiler to notice when a code path is definitely going to return a struct, and for it to fix assignments to that struct's members so that they go directly into the caller's frame and don't have to be copied. The key is for the compiler to notice that all terminating code paths through the function return the same struct variable. If that's the case, the compiler can safely use the space in the caller's frame, eliminating the need for a copy at the point of return.

There are many examples given, but basically
This question does not have any definite answer. it will depend on the compiler.
C does not specify how large structs are returned from a function.
Here's some tests for one particular compiler, gcc 4.1.2 on x86 RHEL 5.4
gcc trivial case, no copying
[00:05:21 1 ~] $ gcc -O2 -S -c t.c
[00:05:23 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movl $1, 24(%eax)
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc more realistic case , allocate on stack, memcpy to caller
#include <stdlib.h>
struct Data {
unsigned values[256];
};
struct Data createData()
{
struct Data data;
int i;
for(i = 0; i < 256 ; i++)
data.values[i] = rand();
return data;
}
[00:06:08 1 ~] $ gcc -O2 -S -c t.c
[00:06:10 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc 4.4.2### has grown a lot, and does not copy for the above non-trivial case.
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
In addition, VS2008 (compiled the above as C) will reserve struct Data on the stack of createData() and do a rep movsd loop to copy it back to the caller in Debug mode, in Release mode it will move the return value of rand() (%eax) directly back to the caller

typedef struct {
unsigned value[256];
} Data;
Data createData(void) {
Data r;
calcualte(&r);
return r;
}
Data d = createData();
msvc(6,8,9) and gcc mingw(3.4.5,4.4.0) will generate code like the following pseudocode
void createData(Data* r) {
calculate(&r)
}
Data d;
createData(&d);

gcc on linux will issue a memcpy() to copy the struct back on the stack of the caller. If the function has internal linkage, more optimizations become available though.

Inline functions mechanism

I know that an inline function does not use the stack for copying the parameters but it just replaces the body of the function wherever it is called.
Consider these two functions:
inline void add(int a) {
a++;
} // does nothing, a won't be changed
inline void add(int &a) {
a++;
} // changes the value of a
If the stack is not used for sending the parameters, how does the compiler know if a variable will be modified or not? What does the code looks like after replacing the calls of these two functions?

What makes you think there is a stack ? And even if there is, what makes you think it would be use for passing parameters ?
You have to understand that there are two levels of reasoning:
the language level: where the semantics of what should happen are defined
the machine level: where said semantics, encoded into CPU instructions, are carried out
At the language level, if you pass a parameter by non-const reference it might be modified by the function. The language level knows not what this mysterious "stack" is. Note: the inline keyword has little to no effect on whether a function call is inlined, it just says that the definition is in-line.
At machine level... there are many ways to achieve this. When making a function call, you have to obey a calling convention. This convention defines how the function parameters (and return types) are exchanged between caller and callee and who among them is responsible for saving/restoring the CPU registers. In general, because it is so low-level, this convention changes on a per CPU family basis.
For example, on x86, a couple parameters will be passed directly in CPU registers (if they fit) whilst remaining parameters (if any) will be passed on the stack.

I have checked what at least GCC does with it if you force it to inline the methods:
inline static void add1(int a) __attribute__((always_inline));
void add1(int a) {
a++;
} // does nothing, a won't be changed
inline static void add2(int &a) __attribute__((always_inline));
void add2(int &a) {
a++;
} // changes the value of a
int main() {
label1:
int b = 0;
add1(b);
label2:
int a = 0;
add2(a);
return 0;
}
The assembly output for this looks like:
.file "test.cpp"
.text
.globl main
.type main, #function
main:
.LFB2:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
subl $16, %esp
.L2:
movl $0, -4(%ebp)
movl -4(%ebp), %eax
movl %eax, -8(%ebp)
addl $1, -8(%ebp)
.L3:
movl $0, -12(%ebp)
movl -12(%ebp), %eax
addl $1, %eax
movl %eax, -12(%ebp)
movl $0, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE2:
Interestingly even the first call of add1() that effectively does nothing as a result outside of the function call, isn't optimized out.

If the stack is not used for sending the parameters, how does the
compiler know if a variable will be modified or not?
As Matthieu M. already pointed out the language construction itself knows nothing about stack.You specify inline keyword to the function just to give a compiler a hint and express a wish that you would prefer this routine to be inlined. If this happens depends completely on the compiler.
The compiler tries to predict what the advantages of this process given particular circumstances might be. If the compiler decides that inlining the function will make the code slower, or unacceptably larger, it will not inline it. Or, if it simply cannot because of a syntactical dependency, such as other code using a function pointer for callbacks, or exporting the function externally as in a dynamic/static code library.
What does the code looks like after replacing the calls of these two
functions?
At he moment none of this function is being inlined when compiled with
g++ -finline-functions -S main.cpp
and you can see it because in disassembly of main
void add1(int a) {
a++;
}
void add2(int &a) {
a++;
}
inline void add3(int a) {
a++;
} // does nothing, a won't be changed
inline void add4(int &a) {
a++;
} // changes the value of a
inline int f() { return 43; }
int main(int argc, char** argv) {
int a = 31;
add1(a);
add2(a);
add3(a);
add4(a);
return 0;
}
we see a call to each routine being made:
main:
.LFB8:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $31, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %edi
call _Z4add1i // function call
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z4add2Ri // function call
movl -4(%rbp), %eax
movl %eax, %edi
call _Z4add3i // function call
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z4add4Ri // function call
movl $0, %eax
leave
ret
.cfi_endproc
compiling with -O1 will remove all functions from program at all because they do nothing.
However addition of
__attribute__((always_inline))
allows us to see what happens when code is inlined:
void add1(int a) {
a++;
}
void add2(int &a) {
a++;
}
inline static void add3(int a) __attribute__((always_inline));
inline void add3(int a) {
a++;
} // does nothing, a won't be changed
inline static void add4(int& a) __attribute__((always_inline));
inline void add4(int &a) {
a++;
} // changes the value of a
int main(int argc, char** argv) {
int a = 31;
add1(a);
add2(a);
add3(a);
add4(a);
return 0;
}
now: g++ -finline-functions -S main.cpp results with:
main:
.LFB9:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $31, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %edi
call _Z4add1i // function call
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z4add2Ri // function call
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
addl $1, -8(%rbp) // addition is here, there is no call
movl -4(%rbp), %eax
addl $1, %eax // addition is here, no call again
movl %eax, -4(%rbp)
movl $0, %eax
leave
ret
.cfi_endproc

The inline keyword has two key effects. One effect is that it is a hint to the implementation that "inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism." This usage is a hint, not a mandate, because "an implementation is not required to perform this inline substitution at the point of call".
The other principal effect is how it modifies the one definition rule. Per the ODR, a program must contain exactly one definition of any given non-inline function that is odr-used in the program. That doesn't quite work with an inline function because "An inline function shall be defined in every translation unit in which it is odr-used ...". Use the same inline function in one hundred different translation units and the linker will be confronted with one hundred definitions of the function. This isn't a problem because those multiple implementations of the same function "... shall have exactly the same definition in every case." One way to look at this: There still is only one definition; it just looks like there are a whole bunch to the linker.
Note: All quoted material are from section 7.1.2 of the C++11 standard.

Slow XOR operator

EDIT: Indeed, I had a weird error in my timing code leading to these results. When I fixed my error, the smart version ended up faster as expected. My timing code looked like this:
bool x = false;
before = now();
for (int i=0; i<N; ++i) {
x ^= smart_xor(A[i],B[i]);
}
after = now();
I had done the ^= to discourage my compiler from optimizing the for-loop away. But I think that the ^= somehow interacts strangely with the two xor functions. I changed my timing code to simply fill out an array of the xor results, and then do computation with that array outside of the timed code. And that fixed things.
Should I delete this question?
END EDIT
I defined two C++ functions as follows:
bool smart_xor(bool a, bool b) {
return a^b;
}
bool dumb_xor(bool a, bool b) {
return a?!b:b;
}
My timing tests indicate that dumb_xor() is slightly faster (1.31ns vs 1.90ns when inlined, 1.92ns vs 2.21ns when not inlined). This puzzles me, as the ^ operator should be a single machine operation. I'm wondering if anyone has an explanation.
The assembly looks like this (when not inlined):
.file "xor.cpp"
.text
.p2align 4,,15
.globl _Z9smart_xorbb
.type _Z9smart_xorbb, #function
_Z9smart_xorbb:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %eax
xorl %edi, %eax
ret
.cfi_endproc
.LFE0:
.size _Z9smart_xorbb, .-_Z9smart_xorbb
.p2align 4,,15
.globl _Z8dumb_xorbb
.type _Z8dumb_xorbb, #function
_Z8dumb_xorbb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
ret
.cfi_endproc
.LFE1:
.size _Z8dumb_xorbb, .-_Z8dumb_xorbb
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. I compiled with -O3.

I don't think you benchmarked your code correctly.
We can see in the generated assembly that your smart_xor function is:
movl %esi, %eax
xorl %edi, %eax
while your dumb_xor function is:
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
So obviously, the first one will be faster.
If not, then you have benchmarking issues.
So you may want to tune your benchmarking code... And remember you'll need to run a lot of calls to have a good and meaningful average.

Given that your "dumb XOR" code is significantly longer (and most instructions are dependent on a previous one, so it won't run in parallel), I suspect that you have some sort of measurement error in your results.
The compiler will need to produce two instructions for the out-of-line version of "smart XOR" because the registers that the data comes in as is not the register to give the return result in, so the data has to move from EDI and ESI to EAX. In an inline version, the code should be able to use whatever register the data is in before the call, and if the code allows it, result stays in the register it came in as.
Calling a function is out-of-line is probably at least as long in execution time as the actual code in the function.
It would help if you showes your test-harness that you use for benchmarking too...

How does a compiler get a const's address in C++?

As is discussed here Which memory area is a const object in in C++?, a compiler will probably not allocate any storage for the constants when compiling the code, they may probable be embedded directly into the machine code. Then how does a compiler get a constant's address?
C++ code:
void f()
{
const int a = 99;
const int *p = &a;
printf("constant's value: %d\n", *p);
}

Whether a constant will be allocated any storage or not is completely dependent on the compiler. Compilers are allowed to perform optimizations as per the As-If rule, as long as the observable behavior of the program doesn't change the compiler may allocate storage for a or may not. Note that these optimizations are not demanded but allowed by the standard.
Obviously, when you take the address of this const the compiler will have to return you a address by which you can refer a so it will have to place a in memory or atleast pretend it does so.

All variables must be addressable. The compiler may optimize out constant variables, but in a case like your where you use the variable address it can't do it.

compiler might do many tricks that it is allowed to do up to the point when it will affect visible behaviour of the program (see as-if rule). So it might not allocate storage for const object const int a = 99; ,but in the case when you take an address of variable - it has to be allocated some storage or at least pretended to be and you will get memory address which allows you to refer toa.
code:
#include <cstdlib>
using namespace std;
#include <cstdio>
const int a = 98;
void f()
{
const int a = 99;
const int *p = &a;
printf("constant's value: %d\n", *p);
}
int main(int argc, char** argv)
{
int b=100;
f();
return 0;
}
gcc -S main.cpp:
.file "main.cpp"
.section .rodata
.LC0:
.string "constant's value: %d\n"
.text
.globl _Z1fv
.type _Z1fv, #function
_Z1fv:
.LFB4:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
movl $99, -4(%rbp)
leaq -4(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
movl (%rax), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
leave
ret
.cfi_endproc
.LFE4:
.size _Z1fv, .-_Z1fv
.globl main
.type main, #function
main:
.LFB5:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $100, -4(%rbp)
call _Z1fv
movl $0, %eax
leave
ret
.cfi_endproc
.LFE5:
.size main, .-main
.section .rodata
.align 4
.type _ZL1a, #object
.size _ZL1a, 4
_ZL1a:
.long 98
.ident "GCC: (Ubuntu/Linaro 4.4.7-2ubuntu1) 4.4.7"
.section .note.GNU-stack,"",#progbits
so what we see is that the variable const int a = 99; is in fact built in machine code, do not reside in specific area of memory (there is no memory on the stack, the heap or the data segment allocated to it). Please correct me if I am wrong.

Why are C and C++ different even after compilation?

I guessed it but was still surprised to see that the output of these two programs, written in C and C++, when compiled were very different.
That makes me think that the concept of objects still exist at even the lowest level.
Does this add overhead? If so is it currently an impossible optimization to convert object oriented code to procedural style or just very hard to do?
helloworld.c
#include <stdio.h>
int main(void) {
printf("Hello World!\n");
return 0;
}
helloworld.cpp
#include <iostream>
int main() {
std::cout << "Hello World!" << std::endl;
return 0;
}
Compiled like this:
gcc helloworld.cpp -o hwcpp.S -S -O2
gcc helloworld.c -o hwc.S -S -O2
Produced this code:
C assembly
.file "helloworld.c"
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "Hello World!\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $.LC0, 4(%esp)
movl $1, (%esp)
call __printf_chk
xorl %eax, %eax
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
C++ assembly
.file "helloworld.cpp"
.text
.p2align 4,,15
.type _GLOBAL__I_main, #function
_GLOBAL__I_main:
.LFB1007:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
subl $24, %esp
movl $_ZStL8__ioinit, (%esp)
call _ZNSt8ios_base4InitC1Ev
movl $__dso_handle, 8(%esp)
movl $_ZStL8__ioinit, 4(%esp)
movl $_ZNSt8ios_base4InitD1Ev, (%esp)
call __cxa_atexit
leave
ret
.cfi_endproc
.LFE1007:
.size _GLOBAL__I_main, .-_GLOBAL__I_main
.section .ctors,"aw",#progbits
.align 4
.long _GLOBAL__I_main
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "Hello World!"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB997:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
andl $-16, %esp
pushl %ebx
subl $28, %esp
movl $12, 8(%esp)
movl $.LC0, 4(%esp)
movl $_ZSt4cout, (%esp)
.cfi_escape 0x10,0x3,0x7,0x55,0x9,0xf0,0x1a,0x9,0xfc,0x22
call _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_i
movl _ZSt4cout, %eax
movl -12(%eax), %eax
movl _ZSt4cout+124(%eax), %ebx
testl %ebx, %ebx
je .L9
cmpb $0, 28(%ebx)
je .L5
movzbl 39(%ebx), %eax
.L6:
movsbl %al,%eax
movl %eax, 4(%esp)
movl $_ZSt4cout, (%esp)
call _ZNSo3putEc
movl %eax, (%esp)
call _ZNSo5flushEv
addl $28, %esp
xorl %eax, %eax
popl %ebx
movl %ebp, %esp
popl %ebp
ret
.p2align 4,,7
.p2align 3
.L5:
movl %ebx, (%esp)
call _ZNKSt5ctypeIcE13_M_widen_initEv
movl (%ebx), %eax
movl $10, 4(%esp)
movl %ebx, (%esp)
call *24(%eax)
jmp .L6
.L9:
call _ZSt16__throw_bad_castv
.cfi_endproc
.LFE997:
.size main, .-main
.local _ZStL8__ioinit
.comm _ZStL8__ioinit,1,1
.weakref _ZL20__gthrw_pthread_oncePiPFvvE,pthread_once
.weakref _ZL27__gthrw_pthread_getspecificj,pthread_getspecific
.weakref _ZL27__gthrw_pthread_setspecificjPKv,pthread_setspecific
.weakref _ZL22__gthrw_pthread_createPmPK14pthread_attr_tPFPvS3_ES3_,pthread_create
.weakref _ZL20__gthrw_pthread_joinmPPv,pthread_join
.weakref _ZL21__gthrw_pthread_equalmm,pthread_equal
.weakref _ZL20__gthrw_pthread_selfv,pthread_self
.weakref _ZL22__gthrw_pthread_detachm,pthread_detach
.weakref _ZL22__gthrw_pthread_cancelm,pthread_cancel
.weakref _ZL19__gthrw_sched_yieldv,sched_yield
.weakref _ZL26__gthrw_pthread_mutex_lockP15pthread_mutex_t,pthread_mutex_lock
.weakref _ZL29__gthrw_pthread_mutex_trylockP15pthread_mutex_t,pthread_mutex_trylock
.weakref _ZL31__gthrw_pthread_mutex_timedlockP15pthread_mutex_tPK8timespec,pthread_mutex_timedlock
.weakref _ZL28__gthrw_pthread_mutex_unlockP15pthread_mutex_t,pthread_mutex_unlock
.weakref _ZL26__gthrw_pthread_mutex_initP15pthread_mutex_tPK19pthread_mutexattr_t,pthread_mutex_init
.weakref _ZL29__gthrw_pthread_mutex_destroyP15pthread_mutex_t,pthread_mutex_destroy
.weakref _ZL30__gthrw_pthread_cond_broadcastP14pthread_cond_t,pthread_cond_broadcast
.weakref _ZL27__gthrw_pthread_cond_signalP14pthread_cond_t,pthread_cond_signal
.weakref _ZL25__gthrw_pthread_cond_waitP14pthread_cond_tP15pthread_mutex_t,pthread_cond_wait
.weakref _ZL30__gthrw_pthread_cond_timedwaitP14pthread_cond_tP15pthread_mutex_tPK8timespec,pthread_cond_timedwait
.weakref _ZL28__gthrw_pthread_cond_destroyP14pthread_cond_t,pthread_cond_destroy
.weakref _ZL26__gthrw_pthread_key_createPjPFvPvE,pthread_key_create
.weakref _ZL26__gthrw_pthread_key_deletej,pthread_key_delete
.weakref _ZL30__gthrw_pthread_mutexattr_initP19pthread_mutexattr_t,pthread_mutexattr_init
.weakref _ZL33__gthrw_pthread_mutexattr_settypeP19pthread_mutexattr_ti,pthread_mutexattr_settype
.weakref _ZL33__gthrw_pthread_mutexattr_destroyP19pthread_mutexattr_t,pthread_mutexattr_destroy
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits

Different compilers produce different code. An early version of gcc versus the current version of gcc likely produce different code.
More importantly, iostream handles a lot of things stdio doesn't, so there's obviously going to be some substantial overhead. I understand that, in theory, these could be compiled down to indentical code, but what they're doing is not technically identical.

Your issue here isn't about objects or optimization: it's that printf and cout are fundamentally very different beasts. For a more fair comparison, replace your cout statement in the C++ code with printf. Optimization is a moot point when you're outputting to stdout, as the bottleneck will certainly be the terminal's buffer.

You're not calling the same functions in the C++ example as the C example. Replace the std::cout pipes with plain old printf just like the C code and you should see a much greater correlation between the output of the two compilers.

You have to realize that there are a whole lot of "other" things going on in C++.
Global constructors for example. Also the libraries are different.
the C++ stream object is far more complicated than C io, and if you look through the assembler you can see all the code for pthreads in the C++ version.
It's not necessarily slower but it's certainly different.

I guessed it but was still surprised to see that the output of these two programs, written in C and C++, when compiled were very different.
I'm surprised you were surprised - they're totally different programs.
That makes me think that the concept of objects still exist at even the lowest level.
Absolutely... objects are the way memory is laid out and used during program execution (subject to optimisations).
Does this add overhead?
Not necessarily or typically - the same data would have to be somewhere anyway if the work was being coordinated in the same logical way.
If so is it currently an impossible optimization to convert object oriented code to procedural style or just very hard to do?
The issue has nothing to do with OO vs procedural code, or one being more efficient than the other. The main issue you observe here is that C++'s ostreams require a bit more setup and tear-down, and have more of the I/O coordinated by inline code, while printf() has more out-of-line in the precompiled library so you can't see it in your little code listing. It's not clear really which is "better", and unless you have a performance problem that profiling shows is related you should forget about it and get some useful programming done.
EDIT in response to comment:
Fair call - was a bit harshly worded - sorry. It's a difficult distinction to make actually... "only the compiler [knows] of objects" is true in one sense - they're not encapsulated, half-holy discrete "things" to the compiler the way they can be to the programmer. And, we could write an object that could be used exactly like you have used cout that would disappear during compilation and produce code that was equivalent to the printf() version. But, cout and iostreams involves some setup because it's thread safe and more inlined and handles different locales, and it's a real object with storage requirements because it carries around more independent information about error state, whether you want exceptions thrown, end-of-file conditions (printf() affects "errno", which is then clobbered by the next library/OS call)....
What you might find more insightful is to compare how much extra code is generated when you print one more string, as the amount of code is basically some constant overhead + some per-usage amount, and in latter regard ostream-based code can be as or more efficient than printf(), depending on the types and formatting requested. It's also worth noting that...
std::cout << "Hello world!\n";
...is correct and more analogous to your printf() statement... std::endl explicitly requests an unnecessary flushing operation, as a Standard-compliant C++ program will flush and close its buffers as the stream goes out of scope anyway (that said, there's an interesting post today where it seems someone's Microsoft VisualC++ compiler's not doing that for them! - worth keeping an eye on but hard to believe).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js