How to enable optimization in G++ with #pragma - c++

I want to enable optimization in g++ without command line parameter.
I know GCC can do it by writing #pragma GCC optimize (2) in my code.
But it seems won't work in G++.
This page may help: http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
My compiler version:
$ g++ --version
g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
<suppressed copyright message>
$ gcc --version
gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
<suppressed copyright message>
I worte some code like this:
#pragma GCC optimize (2)
int main(){
long x;
x=11;
x+=12;
x*=13;
x/=14;
return 0;
}
And compiled it with GCC Not G++. Then I used objdump, which output
08048300 <main>:
8048300: 55 push %ebp
8048301: 31 c0 xor %eax,%eax
8048303: 89 e5 mov %esp,%ebp
8048305: 5d pop %ebp
8048306: c3 ret
8048307: 90 nop
When I removed #param GCC optimize(2) . objdump output:
080483b4 <main>:
80483b4: 55 push %ebp
80483b5: 89 e5 mov %esp,%ebp
80483b7: 83 ec 10 sub $0x10,%esp
80483ba: c7 45 fc 0b 00 00 00 movl $0xb,-0x4(%ebp)
80483c1: 83 45 fc 0c addl $0xc,-0x4(%ebp)
80483c5: 8b 55 fc mov -0x4(%ebp),%edx
80483c8: 89 d0 mov %edx,%eax
80483ca: 01 c0 add %eax,%eax
80483cc: 01 d0 add %edx,%eax
80483ce: c1 e0 02 shl $0x2,%eax
80483d1: 01 d0 add %edx,%eax
80483d3: 89 45 fc mov %eax,-0x4(%ebp)
80483d6: 8b 4d fc mov -0x4(%ebp),%ecx
80483d9: ba 93 24 49 92 mov $0x92492493,%edx
80483de: 89 c8 mov %ecx,%eax
80483e0: f7 ea imul %edx
80483e2: 8d 04 0a lea (%edx,%ecx,1),%eax
80483e5: 89 c2 mov %eax,%edx
80483e7: c1 fa 03 sar $0x3,%edx
80483ea: 89 c8 mov %ecx,%eax
80483ec: c1 f8 1f sar $0x1f,%eax
80483ef: 89 d1 mov %edx,%ecx
80483f1: 29 c1 sub %eax,%ecx
80483f3: 89 c8 mov %ecx,%eax
80483f5: 89 45 fc mov %eax,-0x4(%ebp)
80483f8: b8 00 00 00 00 mov $0x0,%eax
80483fd: c9 leave
80483fe: c3 ret
80483ff: 90 nop
However, it won't work with G++!

This appears to be a bug in g++ (Bug 48026, references another related issue.)
As a workaround, you can mark each function with __attribute__((optimize("whatever"))). Not great.
int main() __attribute__((optimize("-O2")));
int main()
{
long x;
x=11;
x+=12;
x*=13;
x/=14;
return 0;
}
$ g++ -Wall -c t.c
$ objdump -d t.o
t.o: file format elf64-x86-64
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 55 push %rbp
1: 31 c0 xor %eax,%eax
3: 48 89 e5 mov %rsp,%rbp
6: 5d pop %rbp
7: c3 retq

Related

finstrument-functions-exclude-function-list appears to not handle commas properly

Attempting to compile with finstrument-functions and exclude a template function with multiple template parameters, using the \ method to escape commas (as described for exclude-file-list here) fails to properly disable instrumenting the function passed.
GCC command used:
gcc -finstrument-functions -finstrument-functions-exclude-function-list='test<float\, int>' main.cpp -o a.out -O0
Above creates a binary file with the "test" function instrumented. Assembly snippet and main.cpp file included below
gcc -dumpversion returns "6.2.0", above command run on red hat enterprise linux, version 7.4
Contents of main.cpp:
template<class T, class U>
T test(int a, T b){
int res = 0;
for(int i = 0; i < 1000; i++){
res += i;
}
return(res);
}
int main(int argc, char** argv){
float a = test<float, int>(argc, 1.0);
return(0);
}
objdumped output for "test" function:
000000000040059f <float test<float, int>(int, float)>:
40059f: 55 push %rbp
4005a0: 48 89 e5 mov %rsp,%rbp
4005a3: 48 83 ec 20 sub $0x20,%rsp
4005a7: 89 7d ec mov %edi,-0x14(%rbp)
4005aa: f3 0f 11 45 e8 movss %xmm0,-0x18(%rbp)
4005af: 48 8b 45 08 mov 0x8(%rbp),%rax
4005b3: 48 89 c6 mov %rax,%rsi
4005b6: bf 9f 05 40 00 mov $0x40059f,%edi
4005bb: e8 70 fe ff ff callq 400430 <__cyg_profile_func_enter#plt>
4005c0: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
4005c7: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4005ce: 81 7d fc e7 03 00 00 cmpl $0x3e7,-0x4(%rbp)
4005d5: 7f 11 jg 4005e8 <float test<float, int>(int, float)+0x49>
4005d7: 8b 55 f8 mov -0x8(%rbp),%edx
4005da: 8b 45 fc mov -0x4(%rbp),%eax
4005dd: 01 d0 add %edx,%eax
4005df: 89 45 f8 mov %eax,-0x8(%rbp)
4005e2: 83 45 fc 01 addl $0x1,-0x4(%rbp)
4005e6: eb e6 jmp 4005ce <float test<float, int>(int, float)+0x2f>
4005e8: 8b 45 f8 mov -0x8(%rbp),%eax
4005eb: 66 0f ef c9 pxor %xmm1,%xmm1
4005ef: f3 0f 2a c8 cvtsi2ss %eax,%xmm1
4005f3: f3 0f 11 4d e4 movss %xmm1,-0x1c(%rbp)
4005f8: 48 8b 45 08 mov 0x8(%rbp),%rax
4005fc: 48 89 c6 mov %rax,%rsi
4005ff: bf 9f 05 40 00 mov $0x40059f,%edi
400604: e8 17 fe ff ff callq 400420 <__cyg_profile_func_exit#plt>
400609: f3 0f 10 45 e4 movss -0x1c(%rbp),%xmm0
40060e: c9 leaveq
40060f: c3 retq
I expected the test function to not be instrumented, but it is. Does anyone know why this is?
Compiler explorer example
Just in case anyone comes by this way, both this and finstrument-functions-exclude-function-list not respecting namespace parts of a function are bugs, and I have filed against both. Hopefully a fix will be implemented soon (working on one currently).
Namespace / class mishandling
Comma mishandling

Very strange segfault calling WinUsb_GetOverlappedResult

I have this code:
void GetResult(WINUSB_INTERFACE_HANDLE InterfaceHandle, LPOVERLAPPED lpOverlapped)
{
DWORD numBytes = 0;
WinUsb_GetOverlappedResult(
InterfaceHandle,
lpOverlapped,
&numBytes,
TRUE
);
return;
uint8_t stack[64];
}
WinUsb_GetOverlappedResult is a __stdcall function declared as follows:
WINBOOL WINAPI WinUsb_GetOverlappedResult (WINUSB_INTERFACE_HANDLE InterfaceHandle, LPOVERLAPPED lpOverlapped, LPDWORD lpNumberOfBytesTransferred, WINBOOL bWait);
Compiling in debug mode with GCC 5.3.0 (MinGW) it all works fine. (I can't compile with VC++ because I'm using GCC extensions.)
However if I change it to stack[80] then it segfaults!!
Here is the disassembly in each case. 64 (doesn't crash):
Dump of assembler code for function GetResult(void*, _OVERLAPPED*):
88 {
0x00408523 <+0>: push %ebp
0x00408524 <+1>: mov %esp,%ebp
0x00408526 <+3>: sub $0x68,%esp
89 DWORD numBytes = 0;
0x00408529 <+6>: movl $0x0,-0xc(%ebp)
90 WinUsb_GetOverlappedResult(
91 InterfaceHandle,
92 lpOverlapped,
93 &numBytes,
94 TRUE
95 );
=> 0x00408530 <+13>: movl $0x1,0xc(%esp)
0x00408538 <+21>: lea -0xc(%ebp),%eax
0x0040853b <+24>: mov %eax,0x8(%esp)
0x0040853f <+28>: mov 0xc(%ebp),%eax
0x00408542 <+31>: mov %eax,0x4(%esp)
0x00408546 <+35>: mov 0x8(%ebp),%eax
0x00408549 <+38>: mov %eax,(%esp)
0x0040854c <+41>: call 0x409d58 <WinUsb_GetOverlappedResult#16>
0x00408551 <+46>: sub $0x10,%esp
96 return;
0x00408554 <+49>: nop
97
98 uint8_t stack[64];
99 }
0x00408555 <+50>: leave
0x00408556 <+51>: ret
And 80 (does crash):
Dump of assembler code for function GetResult(void*, _OVERLAPPED*):
88 {
0x00408523 <+0>: push %ebp
0x00408524 <+1>: mov %esp,%ebp
0x00408526 <+3>: sub $0x78,%esp
89 DWORD numBytes = 0;
0x00408529 <+6>: movl $0x0,-0xc(%ebp)
90 WinUsb_GetOverlappedResult(
91 InterfaceHandle,
92 lpOverlapped,
93 &numBytes,
94 TRUE
95 );
=> 0x00408530 <+13>: movl $0x1,0xc(%esp)
0x00408538 <+21>: lea -0xc(%ebp),%eax
0x0040853b <+24>: mov %eax,0x8(%esp)
0x0040853f <+28>: mov 0xc(%ebp),%eax
0x00408542 <+31>: mov %eax,0x4(%esp)
0x00408546 <+35>: mov 0x8(%ebp),%eax
0x00408549 <+38>: mov %eax,(%esp)
0x0040854c <+41>: call 0x409d58 <WinUsb_GetOverlappedResult#16>
0x00408551 <+46>: sub $0x10,%esp
96 return;
0x00408554 <+49>: nop
97
98 uint8_t stack[80];
99 }
0x00408555 <+50>: leave
0x00408556 <+51>: ret
The effect of the __stdcall is to add the line sub $0x10,%esp which I guess is to cancel out ret $0x10 in the function.
In any case these seem very similar and I have no idea why it is crashing. I'm not even 100% sure where it is crashing (GDB is rather unhelpful) but it is somewhere around WinUsb function call.
It's quite hard to debug because if I run the debugger with any breakpoints set, it doesn't crash. I suspect it may be timing related -
I can also prevent the crash with a few extra Sleep(100)s. Once it seemed to crash in PerfIncrementULongLongCounterValue() but who knows...
Does anyone have any clue why this might be happening?
Edit
WinUsb_GetOverlappedResult() just calls straight through to GetOverlappedResult() according to its assembly, so I replace the call with that. Now you need stack[96] to cause the crash, but when it does it at least tells me where the real crash is (I think)!
Here is the disassembly of GetOverlappedResult(). It crashes where indicated because ebp is 0.
0x76feaba0 8b ff mov %edi,%edi
0x76feaba2 <+0x0002> 55 push %ebp
0x76feaba3 <+0x0003> 8b ec mov %esp,%ebp
0x76feaba5 <+0x0005> 83 ec 0c sub $0xc,%esp
0x76feaba8 <+0x0008> a1 94 4b 09 77 mov 0x77094b94,%eax
0x76feabad <+0x000d> 33 c5 xor %ebp,%eax
0x76feabaf <+0x000f> 89 45 fc mov %eax,-0x4(%ebp)
0x76feabb2 <+0x0012> 83 7d 14 00 cmpl $0x0,0x14(%ebp)
0x76feabb6 <+0x0016> 53 push %ebx
0x76feabb7 <+0x0017> 56 push %esi
0x76feabb8 <+0x0018> 57 push %edi
0x76feabb9 <+0x0019> 0f 84 b3 00 00 00 je 0x76feac72 <KERNELBASE!GetOverlappedResult+210>
0x76feabbf <+0x001f> 83 cf ff or $0xffffffff,%edi
0x76feabc2 <+0x0022> 8b 5d 08 mov 0x8(%ebp),%ebx
0x76feabc5 <+0x0025> 83 cb 01 or $0x1,%ebx
0x76feabc8 <+0x0028> 85 ff test %edi,%edi
0x76feabca <+0x002a> 0f 84 a9 00 00 00 je 0x76feac79 <KERNELBASE!GetOverlappedResult+217>
0x76feabd0 <+0x0030> b8 01 00 00 00 mov $0x1,%eax
0x76feabd5 <+0x0035> c7 45 f4 01 00 00 00 movl $0x1,-0xc(%ebp)
0x76feabdc <+0x003c> 89 45 f8 mov %eax,-0x8(%ebp)
0x76feabdf <+0x003f> 84 d8 test %bl,%al
0x76feabe1 <+0x0041> 0f 84 5e f3 03 00 je 0x77029f45 <KERNELBASE!GetCurrentProcess+43221>
0x76feabe7 <+0x0047> 6a 00 push $0x0
0x76feabe9 <+0x0049> 68 dc 10 f2 76 push $0x76f210dc
0x76feabee <+0x004e> 50 push %eax
0x76feabef <+0x004f> 68 ab ab ab ab push $0xabababab
0x76feabf4 <+0x0054> ff 15 68 80 09 77 call *0x77098068
0x76feabfa <+0x005a> 8b f0 mov %eax,%esi
0x76feabfc <+0x005c> 85 f6 test %esi,%esi
0x76feabfe <+0x005e> 74 0e je 0x76feac0e <KERNELBASE!GetOverlappedResult+110>
0x76feac00 <+0x0060> 8d 45 f4 lea -0xc(%ebp),%eax
0x76feac03 <+0x0063> 8b ce mov %esi,%ecx
0x76feac05 <+0x0065> 50 push %eax
0x76feac06 <+0x0066> ff 15 5c 8a 09 77 call *0x77098a5c
0x76feac0c <+0x006c> ff d6 call *%esi
0x76feac0e <+0x006e> 33 c0 xor %eax,%eax
0x76feac10 <+0x0070> 83 e3 fe and $0xfffffffe,%ebx
0x76feac13 <+0x0073> 89 45 f8 mov %eax,-0x8(%ebp)
0x76feac16 <+0x0076> 39 45 f4 cmp %eax,-0xc(%ebp)
0x76feac19 <+0x0079> 0f 85 26 f3 03 00 jne 0x77029f45 <KERNELBASE!GetCurrentProcess+43221>
0x76feac1f <+0x007f> 8b 75 0c mov 0xc(%ebp),%esi
0x76feac22 <+0x0082> 81 3e 03 01 00 00 cmpl $0x103,(%esi)
0x76feac28 <+0x0088> 74 26 je 0x76feac50 <KERNELBASE!GetOverlappedResult+176>
Crash:
0x76feac2a <+0x008a> 8b 45 10 mov 0x10(%ebp),%eax
0x76feac2d <+0x008d> 8b 4e 04 mov 0x4(%esi),%ecx
0x76feac30 <+0x0090> 89 08 mov %ecx,(%eax)
0x76feac32 <+0x0092> 8b 0e mov (%esi),%ecx
0x76feac34 <+0x0094> 85 c9 test %ecx,%ecx
0x76feac36 <+0x0096> 78 31 js 0x76feac69 <KERNELBASE!GetOverlappedResult+201>
0x76feac38 <+0x0098> b8 01 00 00 00 mov $0x1,%eax
0x76feac3d <+0x009d> 8b 4d fc mov -0x4(%ebp),%ecx
0x76feac40 <+0x00a0> 5f pop %edi
0x76feac41 <+0x00a1> 5e pop %esi
0x76feac42 <+0x00a2> 33 cd xor %ebp,%ecx
0x76feac44 <+0x00a4> 5b pop %ebx
0x76feac45 <+0x00a5> e8 0b f0 02 00 call 0x77019c55 <PerfIncrementULongLongCounterValue+197>
0x76feac4a <+0x00aa> 8b e5 mov %ebp,%esp
0x76feac4c <+0x00ac> 5d pop %ebp
0x76feac4d <+0x00ad> c2 10 00 ret $0x10
0x76feac50 <+0x00b0> 8b 46 10 mov 0x10(%esi),%eax
0x76feac53 <+0x00b3> 85 c0 test %eax,%eax
0x76feac55 <+0x00b5> 74 46 je 0x76feac9d <KERNELBASE!GetOverlappedResult+253>
0x76feac57 <+0x00b7> 6a 00 push $0x0
0x76feac59 <+0x00b9> 57 push %edi
0x76feac5a <+0x00ba> 50 push %eax
0x76feac5b <+0x00bb> e8 50 01 00 00 call 0x76feadb0 <WaitForSingleObjectEx>
0x76feac60 <+0x00c0> 85 c0 test %eax,%eax
0x76feac62 <+0x00c2> 74 c6 je 0x76feac2a <KERNELBASE!GetOverlappedResult+138>
0x76feac64 <+0x00c4> e9 fb f2 03 00 jmp 0x77029f64 <KERNELBASE!GetCurrentProcess+43252>
0x76feac69 <+0x00c9> e8 d2 f1 ff ff call 0x76fe9e40 <OpenThreadToken+64>
0x76feac6e <+0x00ce> 33 c0 xor %eax,%eax
0x76feac70 <+0x00d0> eb cb jmp 0x76feac3d <KERNELBASE!GetOverlappedResult+157>
0x76feac72 <+0x00d2> 33 ff xor %edi,%edi
0x76feac74 <+0x00d4> e9 49 ff ff ff jmp 0x76feabc2 <KERNELBASE!GetOverlappedResult+34>
0x76feac79 <+0x00d9> 8b 75 0c mov 0xc(%ebp),%esi
0x76feac7c <+0x00dc> 81 3e 03 01 00 00 cmpl $0x103,(%esi)
0x76feac82 <+0x00e2> 74 0a je 0x76feac8e <KERNELBASE!GetOverlappedResult+238>
0x76feac84 <+0x00e4> 33 c9 xor %ecx,%ecx
0x76feac86 <+0x00e6> 8d 45 f8 lea -0x8(%ebp),%eax
0x76feac89 <+0x00e9> f0 09 08 lock or %ecx,(%eax)
0x76feac8c <+0x00ec> eb 9c jmp 0x76feac2a <KERNELBASE!GetOverlappedResult+138>
0x76feac8e <+0x00ee> 68 e4 03 00 00 push $0x3e4
0x76feac93 <+0x00f3> ff 15 c4 80 09 77 call *0x770980c4
0x76feac99 <+0x00f9> 33 c0 xor %eax,%eax
0x76feac9b <+0x00fb> eb a0 jmp 0x76feac3d <KERNELBASE!GetOverlappedResult+157>
0x76feac9d <+0x00fd> 8b c3 mov %ebx,%eax
0x76feac9f <+0x00ff> eb b6 jmp 0x76feac57 <KERNELBASE!GetOverlappedResult+183>
0x76feaca1 <+0x0101> cc int3
0x76feaca2 <+0x0102> cc int3
0x76feaca3 <+0x0103> cc int3
0x76feaca4 <+0x0104> cc int3
0x76feaca5 <+0x0105> cc int3
0x76feaca6 <+0x0106> cc int3
0x76feaca7 <+0x0107> cc int3
0x76feaca8 <+0x0108> cc int3
0x76feaca9 <+0x0109> cc int3
0x76feacaa <+0x010a> cc int3
0x76feacab <+0x010b> cc int3
0x76feacac <+0x010c> cc int3
0x76feacad <+0x010d> cc int3
0x76feacae <+0x010e> cc int3
0x76feacaf <+0x010f> cc int3
Well I think I figured this out. Maybe. The thing I changed is that I don't move my OVERLAPPED structure any more. I can only assume that WinUsb retains a pointer to the OVERLAPPED you pass when you start the write. If it moves then presumably things break.
This isn't mentioned anywhere I can find int the documentation for OVERLAPPED but changing my code so that the OVERLAPPED is dynamically allocated once and never moved seems to stop the crashes.
Unfortunately I never found a good way to debug it. The best way would be a reversible debugger but they don't seem to exist for Windows.

strange behavior when trying to compile a source with tcc against gcc generated .o file

I am trying to compile a source with tcc (ver 0.9.26) against a gcc-generated .o file, but it has strange behavior. The gcc (ver 5.3.0)is from MinGW 64 bit.
More specifically, I have the following two files (te1.c te2.c). I did the following commands on windows7 box
c:\tcc> gcc -c te1.c
c:\tcc> objcopy -O elf64-x86-64 te1.o #this is needed because te1.o from previous step is in COFF format, tcc only understand ELF format
c:\tcc> tcc te2.c te1.o
c:\tcc> te2.exe
567in dummy!!!
Note that it cut off 4 bytes from the string 1234567in dummy!!!\n. Wonder if what could have gone wrong.
Thanks
Jin
========file te1.c===========
#include <stdio.h>
void dummy () {
printf1("1234567in dummy!!!\n");
}
========file te2.c===========
#include <stdio.h>
void printf1(char *p) {
printf("%s\n",p);
}
extern void dummy();
int main(int argc, char *argv[]) {
dummy();
return 0;
}
Update 1
Saw a difference in assembly between te1.o (te1.c compiled by tcc) and te1_gcc.o (te1.c compiled by gcc). In the tcc compiled, I saw lea -0x4(%rip),%rcx, on the gcc compiled, I saw lea 0x0(%rip),%rcx.
Not sure why.
C:\temp>objdump -d te1.o
te1.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <dummy>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 81 ec 20 00 00 00 sub $0x20,%rsp
b: 48 8d 0d fc ff ff ff lea -0x4(%rip),%rcx # e <dummy+0xe>
12: e8 fc ff ff ff callq 13 <dummy+0x13>
17: c9 leaveq
18: c3 retq
19: 00 00 add %al,(%rax)
1b: 00 01 add %al,(%rcx)
1d: 04 02 add $0x2,%al
1f: 05 04 03 01 50 add $0x50010304,%eax
C:\temp>objdump -d te1_gcc.o
te1_gcc.o: file format pe-x86-64
Disassembly of section .text:
0000000000000000 <dummy>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 20 sub $0x20,%rsp
8: 48 8d 0d 00 00 00 00 lea 0x0(%rip),%rcx # f <dummy+0xf>
f: e8 00 00 00 00 callq 14 <dummy+0x14>
14: 90 nop
15: 48 83 c4 20 add $0x20,%rsp
19: 5d pop %rbp
1a: c3 retq
1b: 90 nop
1c: 90 nop
1d: 90 nop
1e: 90 nop
1f: 90 nop
Update2
Using a binary editor, I changed the machine code in te1.o (produced by gcc) and changed lea 0(%rip),%rcx to lea -0x4(%rip),%rcx and using the tcc to link it, the resulted exe works fine.
More precisely, I did
c:\tcc> gcc -c te1.c
c:\tcc> objcopy -O elf64-x86-64 te1.o
c:\tcc> use a binary editor to the change the bytes from (48 8d 0d 00 00 00 00) to (48 8d 0d fc ff ff ff)
c:\tcc> tcc te2.c te1.o
c:\tcc> te2
1234567in dummy!!!
Update 3
As requested, here is the output of objdump -r te1.o
C:\temp>gcc -c te1.c
C:\temp>objdump -r te1.o
te1.o: file format pe-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
000000000000000b R_X86_64_PC32 .rdata
0000000000000010 R_X86_64_PC32 printf1
RELOCATION RECORDS FOR [.pdata]:
OFFSET TYPE VALUE
0000000000000000 rva32 .text
0000000000000004 rva32 .text
0000000000000008 rva32 .xdata
C:\temp>objdump -d te1.o
te1.o: file format pe-x86-64
Disassembly of section .text:
0000000000000000 <dummy>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 20 sub $0x20,%rsp
8: 48 8d 0d 00 00 00 00 lea 0x0(%rip),%rcx # f <dummy+0xf>
f: e8 00 00 00 00 callq 14 <dummy+0x14>
14: 90 nop
15: 48 83 c4 20 add $0x20,%rsp
19: 5d pop %rbp
1a: c3 retq
1b: 90 nop
1c: 90 nop
1d: 90 nop
1e: 90 nop
1f: 90 nop
Has nothing to do with tcc or calling conventions. It has to do with different linker conventions for elf64-x86-64 and pe-x86-64 formats.
With PE, the linker will subtract 4 implicitly to calculate the final offset.
With ELF, it does not do this. Because of this, 0 is the correct initial value for PE, and -4 is correct for ELF.
Unfortunately, objcopy does not convert this -> bug in objcopy.
add
extern void printf1(char *p);
to your te1.c file
Or: the compiler will assume argument 32 bit integer since there's no prototype, and pointers are 64-bit long.
Edit: this is still not working. I found out that the function never returns (since calling the printf1 a second time does nothing!). Seems that the 4 first bytes are consumed as return address or something like that. In gcc 32-bit mode it works fine.
Sounds like a calling convention problem to me but still cannot figure it out.
Another clue: calling printf from te1.c side (gcc, using tcc stdlib bindings) crashes with segv.
I disassembled the executable. First part is repeated call from tcc side
40104f: 48 8d 05 b3 0f 00 00 lea 0xfb3(%rip),%rax # 0x402009
401056: 48 89 45 f8 mov %rax,-0x8(%rbp)
40105a: 48 8b 4d f8 mov -0x8(%rbp),%rcx
40105e: e8 9d ff ff ff callq 0x401000
401063: 48 8b 4d f8 mov -0x8(%rbp),%rcx
401067: e8 94 ff ff ff callq 0x401000
40106c: 48 8b 4d f8 mov -0x8(%rbp),%rcx
401070: e8 8b ff ff ff callq 0x401000
401075: 48 8b 4d f8 mov -0x8(%rbp),%rcx
401079: e8 82 ff ff ff callq 0x401000
40107e: e8 0d 00 00 00 callq 0x401090
401083: b8 00 00 00 00 mov $0x0,%eax
401088: e9 00 00 00 00 jmpq 0x40108d
40108d: c9 leaveq
40108e: c3 retq
Second part is repeated (6 times) call to the same function. As you can see the address is different (shifted by 4 bytes, like your data) !!! It kind of works just once because the 4 first instructions are the following:
401000: 55 push %rbp
401001: 48 89 e5 mov %rsp,%rbp
so stack is destroyed if those are skipped!!
40109f: 48 89 45 f8 mov %rax,-0x8(%rbp)
4010a3: 48 8b 45 f8 mov -0x8(%rbp),%rax
4010a7: 48 89 c1 mov %rax,%rcx
4010aa: e8 55 ff ff ff callq 0x401004
4010af: 48 8b 45 f8 mov -0x8(%rbp),%rax
4010b3: 48 89 c1 mov %rax,%rcx
4010b6: e8 49 ff ff ff callq 0x401004
4010bb: 48 8b 45 f8 mov -0x8(%rbp),%rax
4010bf: 48 89 c1 mov %rax,%rcx
4010c2: e8 3d ff ff ff callq 0x401004
4010c7: 48 8b 45 f8 mov -0x8(%rbp),%rax
4010cb: 48 89 c1 mov %rax,%rcx
4010ce: e8 31 ff ff ff callq 0x401004
4010d3: 48 8b 45 f8 mov -0x8(%rbp),%rax
4010d7: 48 89 c1 mov %rax,%rcx
4010da: e8 25 ff ff ff callq 0x401004
4010df: 48 8b 45 f8 mov -0x8(%rbp),%rax
4010e3: 48 89 c1 mov %rax,%rcx
4010e6: e8 19 ff ff ff callq 0x401004
4010eb: 90 nop

Why does GAS inline assembly wrapped in a function generate different instructions for the caller than a pure assembly function

I've been writing some basic functions using GCC's asm to practice for an actual application.
My functions pretty, wrap, and pure generate the same instructions to unpack a 64 bit integer into a 128 bit vector. add1 and add2 which call pretty and wrap respectively also generate the same instructions. But add3 differs by saving its xmm0 register by pushing it to the stack rather than by copying it to another xmm register. This I don't understand because the compiler can see the details of pure to know none of the other xmm registers will be clobbered.
Here is the C++
#include <immintrin.h>
__m128i pretty(long long b) { return (__m128i){b,b}; }
__m128i wrap(long long b) {
asm ("mov qword ptr [rsp-0x10], rdi\n"
"vmovddup xmm0, qword ptr [rsp-0x10]\n"
:
: "r"(b)
);
}
extern "C" __m128i pure(long long b);
asm (".text\n.global pure\n\t.type pure, #function\n"
"pure:\n\t"
"mov qword ptr [rsp-0x10], rdi\n\t"
"vmovddup xmm0, qword ptr [rsp-0x10]\n\t"
"ret\n\t"
);
__m128i add1(__m128i in, long long in2) { return in + pretty(in2);}
__m128i add2(__m128i in, long long in2) { return in + wrap(in2);}
__m128i add3(__m128i in, long long in2) { return in + pure(in2);}
Compiled with g++ -c so.cpp -march=native -masm=intel -O3 -fno-inline and disassembled with objdump -d -M intel so.o | c++filt.
so.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <pure>:
0: 48 89 7c 24 f0 mov QWORD PTR [rsp-0x10],rdi
5: c5 fb 12 44 24 f0 vmovddup xmm0,QWORD PTR [rsp-0x10]
b: c3 ret
c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000010 <pretty(long long)>:
10: 48 89 7c 24 f0 mov QWORD PTR [rsp-0x10],rdi
15: c5 fb 12 44 24 f0 vmovddup xmm0,QWORD PTR [rsp-0x10]
1b: c3 ret
1c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000020 <wrap(long long)>:
20: 48 89 7c 24 f0 mov QWORD PTR [rsp-0x10],rdi
25: c5 fb 12 44 24 f0 vmovddup xmm0,QWORD PTR [rsp-0x10]
2b: c3 ret
2c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000030 <add1(long long __vector(2), long long)>:
30: c5 f8 28 c8 vmovaps xmm1,xmm0
34: 48 83 ec 08 sub rsp,0x8
38: e8 00 00 00 00 call 3d <add1(long long __vector(2), long long)+0xd>
3d: 48 83 c4 08 add rsp,0x8
41: c5 f9 d4 c1 vpaddq xmm0,xmm0,xmm1
45: c3 ret
46: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
4d: 00 00 00
0000000000000050 <add2(long long __vector(2), long long)>:
50: c5 f8 28 c8 vmovaps xmm1,xmm0
54: 48 83 ec 08 sub rsp,0x8
58: e8 00 00 00 00 call 5d <add2(long long __vector(2), long long)+0xd>
5d: 48 83 c4 08 add rsp,0x8
61: c5 f9 d4 c1 vpaddq xmm0,xmm0,xmm1
65: c3 ret
66: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
6d: 00 00 00
0000000000000070 <add3(long long __vector(2), long long)>:
70: 48 83 ec 18 sub rsp,0x18
74: c5 f8 29 04 24 vmovaps XMMWORD PTR [rsp],xmm0
79: e8 00 00 00 00 call 7e <add3(long long __vector(2), long long)+0xe>
7e: c5 f9 d4 04 24 vpaddq xmm0,xmm0,XMMWORD PTR [rsp]
83: 48 83 c4 18 add rsp,0x18
87: c3 ret
GCC does not understand assembly language.
Since pure is an external function it cannot determine which registers it alters so according to the ABI has to assume all the xmm registers are changed.
wrap has undefined behaviour as the asm statement clobbers xmm0 and [rsp-0x10] which are not listed as clobbers or outputs (to a value which may or may not depend on b), and the function has no return statement.
Edit: The ABI does not apply to inline assembly, I expect your program will not work if you remove -fno-inline from the command line.

Is this an optimization bug in g++?

I'm not sure whether I've found a bug in g++ (4.4.1-4ubuntu9), or if I'm doing
something wrong. What I believe I'm seeing is a bug introduced by enabling
optimization with g++ -O2. I've tried to distill the code down to just the
relevant parts.
When optimization is enabled, I have an ASSERT which is failing. When
optimization is disabled, the same ASSERT does not fail. I think I've tracked
it down to the optimization of one function and its callers.
The System
Language: C++
Ubuntu 9.10
g++-4.4.real (Ubuntu 4.4.1-4ubuntu9) 4.4.1
Linux 2.6.31-22-server x86_64
Optimization Enabled
Object compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O2 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
And here is the relevant code from objdump -dg file.o.
00000000000018b0 <helper_function>:
;; This function takes two parameters:
;; pointer to int: %rdi
;; pointer to int[]: %rsi
18b0: 0f b6 07 movzbl (%rdi),%eax
18b3: 83 f8 12 cmp $0x12,%eax
18b6: 74 60 je 1918 <helper_function+0x68>
18b8: 83 f8 17 cmp $0x17,%eax
18bb: 74 5b je 1918 <helper_function+0x68>
...
1918: c7 06 32 00 00 00 movl $0x32,(%rsi)
191e: 66 90 xchg %ax,%ax
1920: c3 retq
0000000000005290 <buggy_invoker>:
... snip ...
52a0: 48 81 ec c8 01 00 00 sub $0x1c8,%rsp
52a7: 48 8d 84 24 a0 01 00 lea 0x1a0(%rsp),%rax
52ae: 00
52af: 48 c7 84 24 a0 01 00 movq $0x0,0x1a0(%rsp)
52b6: 00 00 00 00 00
52bb: 48 c7 84 24 a8 01 00 movq $0x0,0x1a8(%rsp)
52c2: 00 00 00 00 00
52c7: c7 84 24 b0 01 00 00 movl $0x0,0x1b0(%rsp)
52ce: 00 00 00 00
52d2: 4c 8d 7c 24 20 lea 0x20(%rsp),%r15
52d7: 48 89 c6 mov %rax,%rsi
52da: 48 89 44 24 08 mov %rax,0x8(%rsp)
;; ***** BUG HERE *****
;; Pointer to int[] loaded into %rsi
;; But where is %rdi populated?
52df: e8 cc c5 ff ff callq 18b0 <helper_function>
0000000000005494 <perfectly_fine_invoker>:
5494: 48 83 ec 20 sub $0x20,%rsp
5498: 0f ae f0 mfence
549b: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi
54a0: 48 89 e6 mov %rsp,%rsi
54a3: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
54aa: 00
54ab: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
54b2: 00 00
54b4: c7 44 24 10 00 00 00 movl $0x0,0x10(%rsp)
54bb: 00
;; Non buggy invocation here: both %rdi and %rsi loaded correctly.
54bc: e8 ef c3 ff ff callq 18b0 <helper_function>
Optimization Disabled
Now compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O0 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
0000000000008d27 <helper_function>:
;; Still the same parameters here, but it looks a little different.
... snip ...
8d2b: 48 89 7d e8 mov %rdi,-0x18(%rbp)
8d2f: 48 89 75 e0 mov %rsi,-0x20(%rbp)
8d33: 48 8b 45 e8 mov -0x18(%rbp),%rax
8d37: 0f b6 00 movzbl (%rax),%eax
8d3a: 0f b6 c0 movzbl %al,%eax
8d3d: 89 45 fc mov %eax,-0x4(%rbp)
8d40: 8b 45 fc mov -0x4(%rbp),%eax
8d43: 83 f8 17 cmp $0x17,%eax
8d46: 74 40 je 8d88 <helper_function+0x61>
...
000000000000948a <buggy_invoker>:
948a: 55 push %rbp
948b: 48 89 e5 mov %rsp,%rbp
948e: 41 54 push %r12
9490: 53 push %rbx
9491: 48 81 ec c0 01 00 00 sub $0x1c0,%rsp
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
949f: 48 89 b5 30 fe ff ff mov %rsi,-0x1d0(%rbp)
94a6: 48 c7 45 c0 00 00 00 movq $0x0,-0x40(%rbp)
94ad: 00
94ae: 48 c7 45 c8 00 00 00 movq $0x0,-0x38(%rbp)
94b5: 00
94b6: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
94bd: 48 8d 55 c0 lea -0x40(%rbp),%rdx
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
;; ***** NOT BUGGY HERE *****
;; Now, without optimization, both %rdi and %rsi loaded correctly.
94ce: e8 54 f8 ff ff callq 8d27 <helper_function>
0000000000008eec <different_perfectly_fine_invoker>:
8eec: 55 push %rbp
8eed: 48 89 e5 mov %rsp,%rbp
8ef0: 48 83 ec 30 sub $0x30,%rsp
8ef4: 48 89 7d d8 mov %rdi,-0x28(%rbp)
8ef8: 48 c7 45 e0 00 00 00 movq $0x0,-0x20(%rbp)
8eff: 00
8f00: 48 c7 45 e8 00 00 00 movq $0x0,-0x18(%rbp)
8f07: 00
8f08: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
8f0f: 48 8d 55 e0 lea -0x20(%rbp),%rdx
8f13: 48 8b 45 d8 mov -0x28(%rbp),%rax
8f17: 48 89 d6 mov %rdx,%rsi
8f1a: 48 89 c7 mov %rax,%rdi
;; Another example of non-optimized call to that function.
8f1d: e8 05 fe ff ff callq 8d27 <helper_function>
The Original C++ Code
This is a sanitized version of the original C++. I've just changed some names
and removed irrelevant code. Forgive my paranoia, I just don't want to expose
too much code from unpublished and unreleased work :-).
static void helper_function(my_struct_t *e, int *outArr)
{
unsigned char event_type = e->header.type;
if (event_type == event_A || event_type == event_B) {
outArr[0] = action_one;
} else if (event_type == event_C) {
outArr[0] = action_one;
outArr[1] = action_two;
} else if (...) { ... }
}
static void buggy_invoker(my_struct_t *e, predicate_t pred)
{
// MAX_ACTIONS is #defined to 5
int action_array[MAX_ACTIONS] = {0};
helper_function(e, action_array);
...
}
static int has_any_actions(my_struct_t *e)
{
int actions[MAX_ACTIONS] = {0};
helper_function(e, actions);
return actions[0] != 0;
}
// *** ENTRY POINT to this code is this function (note not static).
void perfectly_fine_invoker(my_struct_t e, predicate_t pred)
{
memfence();
if (has_any_actions(&e)) {
buggy_invoker(&e, pred);
}
...
}
If you think I've obfuscated or eliminiated too much, let me know. Users of
this code call 'perfectly_fine_invoker'. With optimization, g++ optimizes the
'has_any_actions' function away into a direct call to 'helper_function', which
you can see in the assembly.
The Question
So, my question is, does it look like a buggy optimization to anyone else?
If it would be helpful, I could post a sanitized version of the original C++ code.
This is my first posting to Stack Overflow, so please let me know if I can do
anything to make the question clearer, or provide any additional information.
The Answer
Edit (several days after the fact):
I accepted an answer below to my question -- it was not an optimization bug in g++, I was just looking at the assembly code wrong.
However, for whoever may be viewing this question in the future, I've found the answer. I did some reading on undefined behavior in C ( http://blog.regehr.org/archives/213 and http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html ) and some of the descriptions of the compiler optimizing away functions with undefined behavior seemed eerily familiar.
I added some NULL-pointer checks to the function 'helper_function' and lo and behold... bug goes away. I should have had the NULL-pointer checks to begin with, but apparently not having them allowed g++ to do whatever it wanted (in my case, optimize away the call).
Hope this information helps someone down the road.
I think you are looking at the wrong thing. I imagine the compiler notice that your function is short and doesn't touch the %rdi register so it just leaves it alone (you have the same variable as the first parameter, which I guess is what is placed in %rdi. See page 21 here http://www.x86-64.org/documentation/abi.pdf)
If you look at the unoptimized version it saves the %rdi register on this line
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
...and then later just before calling helper_function it moves the saved value into %rax that is moved into %rdi.
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
When optimizing it the compiler just get rid of all that moving back and forth.