I have written the following function:
inline void putc(int c)
{
static Cons_serial serial;
if (serial.enabled())
serial.putc(c);
}
Where Cons_serial is a class with a non-trivial default constructor. I believe the exact class definition is not important here but you can correct me on that. I'm compiling for x86 32 bit with g++ using the following flags: -m32 -fno-PIC -ffreestanding -fno-rtti -fno-exceptions -fno-threadsafe-statics -O0, the generated assembly code for putc looks like this:
00100221 <_Z4putci>:
100221: 55 push %ebp
100222: 89 e5 mov %esp,%ebp
100224: 83 ec 08 sub $0x8,%esp
100227: b8 f8 02 10 00 mov $0x1002f8,%eax
10022c: 0f b6 00 movzbl (%eax),%eax
10022f: 84 c0 test %al,%al
100231: 75 18 jne 10024b <_Z4putci+0x2a>
100233: 83 ec 0c sub $0xc,%esp
100236: 68 f0 02 10 00 push $0x1002f0
10023b: e8 36 fe ff ff call 100076 <_ZN11Cons_serialC1Ev>
100240: 83 c4 10 add $0x10,%esp
100243: b8 f8 02 10 00 mov $0x1002f8,%eax
100248: c6 00 01 movb $0x1,(%eax)
10024b: 83 ec 0c sub $0xc,%esp
10024e: 68 f0 02 10 00 push $0x1002f0
100253: e8 4c fe ff ff call 1000a4 <_ZNK11Cons_serial7enabledEv>
100258: 83 c4 10 add $0x10,%esp
10025b: 84 c0 test %al,%al
10025d: 74 13 je 100272 <_Z4putci+0x51>
10025f: 83 ec 08 sub $0x8,%esp
100262: ff 75 08 pushl 0x8(%ebp)
100265: 68 f0 02 10 00 push $0x1002f0
10026a: e8 41 fe ff ff call 1000b0 <_ZN11Cons_serial4putcEi>
10026f: 83 c4 10 add $0x10,%esp
100272: 90 nop
100273: c9 leave
100274: c3 ret
During execution, the jump at 100231 is taken the first time the function runs, thus Cons_serial is never called. Why knowledge of x86 assembly is questionable, what do the instructions leading up to that one actually do? I assume the code is meant to skip the constructor call on subsequent function calls. But then why is it skipped the first time the function runs as well?
EDIT: This code is part of a kernel I'm writing and I suspect the root cause might be an issue with my kernel's .bss section, here is the linker script I use:
OUTPUT_FORMAT("elf32-i386")
ENTRY(_start)
SECTIONS
{
. = 0x100000;
.text : AT(0x100000) {
*(.text)
}
.data : SUBALIGN(2) {
*(.data);
*(.rodata*);
}
.bss : SUBALIGN(4) {
__bss_start = .;
*(.COMMON);
*(.bss*)
. = ALIGN(4);
__bss_end = .;
}
/DISCARD/ : {
*(.eh_frame)
*(.comment)
}
}
And here's the code I use to zero the .bss section:
extern uint32_t __bss_start;
extern uint32_t __bss_end;
void zero_bss()
{
for (uint32_t bss_addr = __bss_start; bss_addr < __bss_end; ++bss_addr)
*reinterpret_cast<uint8_t *>(bss_addr) = 0x00;
}
But when zero_bss runs, __bss_start is 0x27 and __bss_end is 0x101 which is not at all what I'd except (the BSS should encompass address 0x1002f8 after all).
I've solved it now, the hint from #user3124812 was what got me there, thanks again.
My zero_bss code was faulty, I needed to take the addresses of the __bss* markers from the linker script, i.e.:
extern uint8_t __bss_start;
extern uint8_t __bss_end;
void zero_bss()
{
uint8_t *bss_start = reinterpret_cast<uint8_t *>(&__bss_start);
uint8_t *bss_end = reinterpret_cast<uint8_t *>(&__bss_end);
for (uint8_t *bss_addr = bss_start; bss_addr < bss_end; ++bss_addr)
*bss_addr = 0x00;
}
Now everything works.
Related
In a large framework which used to use many smart pointers and now uses raw pointers, I come across situations like this quite often:
class A {
public:
int* m;
A() : m() {}
};
The reason is because int* m used to be a smart pointer and so the initializer list called a default constructor. Now that int* m is a raw pointer I am not certain if this is equivalent to:
class A {
public:
int* m;
A() : m(nullptr) {}
};
Without the explicit nullptr is A::m still initialized to zero? A look at no optimization objdump -d makes it appear to be yes but I am not certain. The reason I feel that the answer is yes is due to this line in the objdump -d (I posted more of the objdump -d below):
400644: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
Little program that tries to find undefined behavior:
class A {
public:
int* m;
A() : m(nullptr) {}
};
int main() {
A buf[1000000];
unsigned int count = 0;
for (unsigned int i = 0; i < 1000000; ++i) {
count += buf[i].m ? 1 : 0;
}
return count;
}
Compilation, execution, and return value:
g++ -std=c++14 -O0 foo.cpp
./a.out; echo $?
0
Relevant assembly sections from objdump -d:
00000000004005b8 <main>:
4005b8: 55 push %rbp
4005b9: 48 89 e5 mov %rsp,%rbp
4005bc: 41 54 push %r12
4005be: 53 push %rbx
4005bf: 48 81 ec 10 12 7a 00 sub $0x7a1210,%rsp
4005c6: 48 8d 85 e0 ed 85 ff lea -0x7a1220(%rbp),%rax
4005cd: bb 3f 42 0f 00 mov $0xf423f,%ebx
4005d2: 49 89 c4 mov %rax,%r12
4005d5: eb 10 jmp 4005e7 <main+0x2f>
4005d7: 4c 89 e7 mov %r12,%rdi
4005da: e8 59 00 00 00 callq 400638 <_ZN1AC1Ev>
4005df: 49 83 c4 08 add $0x8,%r12
4005e3: 48 83 eb 01 sub $0x1,%rbx
4005e7: 48 83 fb ff cmp $0xffffffffffffffff,%rbx
4005eb: 75 ea jne 4005d7 <main+0x1f>
4005ed: c7 45 ec 00 00 00 00 movl $0x0,-0x14(%rbp)
4005f4: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp)
4005fb: eb 23 jmp 400620 <main+0x68>
4005fd: 8b 45 e8 mov -0x18(%rbp),%eax
400600: 48 8b 84 c5 e0 ed 85 mov -0x7a1220(%rbp,%rax,8),%rax
400607: ff
400608: 48 85 c0 test %rax,%rax
40060b: 74 07 je 400614 <main+0x5c>
40060d: b8 01 00 00 00 mov $0x1,%eax
400612: eb 05 jmp 400619 <main+0x61>
400614: b8 00 00 00 00 mov $0x0,%eax
400619: 01 45 ec add %eax,-0x14(%rbp)
40061c: 83 45 e8 01 addl $0x1,-0x18(%rbp)
400620: 81 7d e8 3f 42 0f 00 cmpl $0xf423f,-0x18(%rbp)
400627: 76 d4 jbe 4005fd <main+0x45>
400629: 8b 45 ec mov -0x14(%rbp),%eax
40062c: 48 81 c4 10 12 7a 00 add $0x7a1210,%rsp
400633: 5b pop %rbx
400634: 41 5c pop %r12
400636: 5d pop %rbp
400637: c3 retq
0000000000400638 <_ZN1AC1Ev>:
400638: 55 push %rbp
400639: 48 89 e5 mov %rsp,%rbp
40063c: 48 89 7d f8 mov %rdi,-0x8(%rbp)
400640: 48 8b 45 f8 mov -0x8(%rbp),%rax
400644: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
40064b: 5d pop %rbp
40064c: c3 retq
40064d: 0f 1f 00 nopl (%rax)
Empty () initializer stands for default-initialization in C++98 and for value-initialization in C++03 and later. For scalar types (including pointers) value-initialization/default-initialization leads to zero-initialization.
Which means that in your case m() and m(nullptr) will have exactly the same effect: in both cases m is initialized as a null pointer. In C++ it was like that since the beginning of standardized times.
I have the following code:
void function(char *str)
{
int i;
char buffer[strlen(str) + 1];
strcpy(buffer, str);
buffer[strlen(str)] = '\0';
printf("Buffer: %s\n", buffer);
}
I would expect this code to throw a compile time error, as the 'buffer' being allocated on the stack has a runtime dependent length (based on strlen()). However in GCC the compilation passes. How does this work? Is the buffer dynamically allocated, or if it is still stack local, what is the size allocated?
C99 allows variable length arrays. Not compiling your code in C99 will not give any error because GCC also allow variable length array as an extension.
6.19 Arrays of Variable Length:
Variable-length automatic arrays are allowed in ISO C99, and as an extension GCC accepts them in C90 mode and in C++.
By disassembling your function you could easily verify this:
$ objdump -S <yourprogram>
...
void function(char *str)
{
4011a0: 55 push %ebp
4011a1: 89 e5 mov %esp,%ebp
4011a3: 53 push %ebx
4011a4: 83 ec 24 sub $0x24,%esp
4011a7: 89 e0 mov %esp,%eax
4011a9: 89 c3 mov %eax,%ebx
int i;
char buffer[strlen(str) + 1];
4011ab: 8b 45 08 mov 0x8(%ebp),%eax
4011ae: 89 04 24 mov %eax,(%esp)
4011b1: e8 42 01 00 00 call 4012f8 <_strlen>
4011b6: 83 c0 01 add $0x1,%eax
4011b9: 89 c2 mov %eax,%edx
4011bb: 83 ea 01 sub $0x1,%edx
4011be: 89 55 f4 mov %edx,-0xc(%ebp)
4011c1: ba 10 00 00 00 mov $0x10,%edx
4011c6: 83 ea 01 sub $0x1,%edx
4011c9: 01 d0 add %edx,%eax
4011cb: b9 10 00 00 00 mov $0x10,%ecx
4011d0: ba 00 00 00 00 mov $0x0,%edx
4011d5: f7 f1 div %ecx
4011d7: 6b c0 10 imul $0x10,%eax,%eax
4011da: e8 6d 00 00 00 call 40124c <___chkstk_ms>
4011df: 29 c4 sub %eax,%esp
4011e1: 8d 44 24 08 lea 0x8(%esp),%eax
4011e5: 83 c0 00 add $0x0,%eax
4011e8: 89 45 f0 mov %eax,-0x10(%ebp)
....
The relevant piece of assembly here is sub %eax,%esp anyway. This shows that the stack was expanded based on whatever strlen returned earlier to get space for your buffer.
i read some old articles about the local scoped static variable initialzation order problem from
C++ scoped static initialization is not thread-safe back in 2004, and
Function Static Variables in Multi-Threaded Environments in 2006.
then I start to produce an example and check my compiler, gcc 4.4.7
int calcSomething(){}
void foo(){
static int x = calcSomething();
}
int main(){
foo();
return 0;
}
the result from objdump shows:
000000000040061a <_Z3foov>:
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
400628: 75 28 jne 400652 <_Z3foov+0x38>
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
400652: c9 leaveq
400653: c3 retq
unfortunately, my knowledge of asssmbly code is so limited that I cannot tell what compiler does here. Can anyone shed me some light, what this assembly code do? and is it still not thread-safe? I really appreciate some "pseudo code" showing what gcc is doing here.
EDIT-1:
as Jerry commented, I enabled optimization with O2, the assembly code is:
0000000000400620 <_Z3foov>:
400620: 48 83 ec 08 sub $0x8,%rsp
400624: 80 3d 85 04 20 00 00 cmpb $0x0,0x200485(%rip) # 600ab0 <_ZGVZ3foovE1x>
40062b: 74 0b je 400638 <_Z3foov+0x18>
40062d: 48 83 c4 08 add $0x8,%rsp
400631: c3 retq
400632: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400638: bf b0 0a 60 00 mov $0x600ab0,%edi
40063d: e8 9e fe ff ff callq 4004e0 <__cxa_guard_acquire#plt>
400642: 85 c0 test %eax,%eax
400644: 74 e7 je 40062d <_Z3foov+0xd>
400646: c7 05 68 04 20 00 00 movl $0x0,0x200468(%rip) # 600ab8 <_ZZ3foovE1x>
40064d: 00 00 00
400650: bf b0 0a 60 00 mov $0x600ab0,%edi
400655: 48 83 c4 08 add $0x8,%rsp
400659: e9 a2 fe ff ff jmpq 400500 <__cxa_guard_release#plt>
40065e: 66 90 xchg %ax,%ax
Yes. In pseudocode (for the un-optimized case) it's something like:
if (flag_val() != 0) goto done;
if (guard_acquire() != 0) goto done;
x = calcSomething();
guard_release_and_set_flag();
// Note releasing the guard lock causes later
// calls to flag_val() to return non-zero.
done: return
The flag_val() is really a non-blocking check, apparently for efficiency to avoid calling the acquire primitive unless necessary. The flag must be set by guard_release as shown. The acquire seems to be the synchronized call to grab the lock. Only one thread will get a true value back and perform the initialization. After it releases the lock, the non-zero flag prevents any further touches of the lock.
Another interesting tidbit is that the guard data structure is 8 bytes away from the value of x itself in static memory.
Those familiar with the singleton pattern in languages with built-in threads e.g. Java will recognize this!
Addition
A bit more time now, so in a bit more detail:
000000000040061a <_Z3foov>:
; Prepare to access stack variables (never used in un-optimized code).
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
; Test a byte 8 away from the static int x. This is apparently an "initialized" flag.
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
; Goto the end of the function if the byte was no-zero.
400628: 75 28 jne 400652 <_Z3foov+0x38>
; Load the same byte address in di: the argument for the call to
; acquire the guard lock.
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
; Test the return value. Goto end of function if not zero (non-optimized code).
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
; Call the user's initialization function and move result into x.
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
; Load the guard byte's address again and call the release routine.
; This must set the flag to non-zero.
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
; Restore state and return.
400652: c9 leaveq
400653: c3 retq
This listing, although for the LLVM compiler rather than g++ (are you running OS X? OS X aliases g++ to LLVM), agrees with the guesswork above. The set_initialized routine is setting a flag value in guard_release.
Consider the following code:
typedef void (*Fn)();
volatile long sum = 0;
inline void accu() {
sum+=4;
}
static const Fn map[4] = {&accu, &accu, &accu, &accu};
int main(int argc, char** argv) {
static const long N = 10000000L;
if (argc == 1)
{
for (long i = 0; i < N; i++)
{
accu();
accu();
accu();
accu();
}
}
else
{
for (long i = 0; i < N; i++)
{
for (int j = 0; j < 4; j++)
(*map[j])();
}
}
}
When I compiled it with:
g++ -O3 test.cpp
I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.
But the results surprised me:
time ./a.out
real 0m0.108s
user 0m0.104s
sys 0m0.000s
time ./a.out 1
real 0m0.095s
user 0m0.088s
sys 0m0.004s
I don't understand why, so I did an objdump:
objdump -DStTrR a.out > a.s
and the disassembly doesn't seem to explain the performance result I got:
8048300 <main>:
8048300: 55 push %ebp
8048301: 89 e5 mov %esp,%ebp
8048303: 53 push %ebx
8048304: bb 80 96 98 00 mov $0x989680,%ebx
8048309: 83 e4 f0 and $0xfffffff0,%esp
804830c: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048310: 74 27 je 8048339 <main+0x39>
8048312: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
8048318: e8 23 01 00 00 call 8048440 <_Z4accuv>
804831d: e8 1e 01 00 00 call 8048440 <_Z4accuv>
8048322: e8 19 01 00 00 call 8048440 <_Z4accuv>
8048327: e8 14 01 00 00 call 8048440 <_Z4accuv>
804832c: 83 eb 01 sub $0x1,%ebx
804832f: 90 nop
8048330: 75 e6 jne 8048318 <main+0x18>
8048332: 31 c0 xor %eax,%eax
8048334: 8b 5d fc mov -0x4(%ebp),%ebx
8048337: c9 leave
8048338: c3 ret
8048339: b8 80 96 98 00 mov $0x989680,%eax
804833e: 66 90 xchg %ax,%ax
8048340: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048346: 83 c2 04 add $0x4,%edx
8048349: 89 15 18 a0 04 08 mov %edx,0x804a018
804834f: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048355: 83 c2 04 add $0x4,%edx
8048358: 89 15 18 a0 04 08 mov %edx,0x804a018
804835e: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048364: 83 c2 04 add $0x4,%edx
8048367: 89 15 18 a0 04 08 mov %edx,0x804a018
804836d: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048373: 83 c2 04 add $0x4,%edx
8048376: 83 e8 01 sub $0x1,%eax
8048379: 89 15 18 a0 04 08 mov %edx,0x804a018
804837f: 75 bf jne 8048340 <main+0x40>
8048381: eb af jmp 8048332 <main+0x32>
8048383: 90 nop
...
8048440 <_Z4accuv>:
8048440: a1 18 a0 04 08 mov 0x804a018,%eax
8048445: 83 c0 04 add $0x4,%eax
8048448: a3 18 a0 04 08 mov %eax,0x804a018
804844d: c3 ret
804844e: 90 nop
804844f: 90 nop
It seems the direct call branch is definitely doing less than the function pointer branch.
But why does the function pointer branch run faster than the direct call?
And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.
It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.
As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):
mov %edx,0x804a018
mov 0x804a018,%edx
Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:
... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.
The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.
Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:
long x1 = 0;
for (long i = 0; i < N; i++)
{
x1 ^= i<<8;
accu();
x1 ^= i<<1;
accu();
x1 ^= i<<2;
accu();
x1 ^= i<<4;
accu();
}
sum += x1;
I'm not sure whether I've found a bug in g++ (4.4.1-4ubuntu9), or if I'm doing
something wrong. What I believe I'm seeing is a bug introduced by enabling
optimization with g++ -O2. I've tried to distill the code down to just the
relevant parts.
When optimization is enabled, I have an ASSERT which is failing. When
optimization is disabled, the same ASSERT does not fail. I think I've tracked
it down to the optimization of one function and its callers.
The System
Language: C++
Ubuntu 9.10
g++-4.4.real (Ubuntu 4.4.1-4ubuntu9) 4.4.1
Linux 2.6.31-22-server x86_64
Optimization Enabled
Object compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O2 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
And here is the relevant code from objdump -dg file.o.
00000000000018b0 <helper_function>:
;; This function takes two parameters:
;; pointer to int: %rdi
;; pointer to int[]: %rsi
18b0: 0f b6 07 movzbl (%rdi),%eax
18b3: 83 f8 12 cmp $0x12,%eax
18b6: 74 60 je 1918 <helper_function+0x68>
18b8: 83 f8 17 cmp $0x17,%eax
18bb: 74 5b je 1918 <helper_function+0x68>
...
1918: c7 06 32 00 00 00 movl $0x32,(%rsi)
191e: 66 90 xchg %ax,%ax
1920: c3 retq
0000000000005290 <buggy_invoker>:
... snip ...
52a0: 48 81 ec c8 01 00 00 sub $0x1c8,%rsp
52a7: 48 8d 84 24 a0 01 00 lea 0x1a0(%rsp),%rax
52ae: 00
52af: 48 c7 84 24 a0 01 00 movq $0x0,0x1a0(%rsp)
52b6: 00 00 00 00 00
52bb: 48 c7 84 24 a8 01 00 movq $0x0,0x1a8(%rsp)
52c2: 00 00 00 00 00
52c7: c7 84 24 b0 01 00 00 movl $0x0,0x1b0(%rsp)
52ce: 00 00 00 00
52d2: 4c 8d 7c 24 20 lea 0x20(%rsp),%r15
52d7: 48 89 c6 mov %rax,%rsi
52da: 48 89 44 24 08 mov %rax,0x8(%rsp)
;; ***** BUG HERE *****
;; Pointer to int[] loaded into %rsi
;; But where is %rdi populated?
52df: e8 cc c5 ff ff callq 18b0 <helper_function>
0000000000005494 <perfectly_fine_invoker>:
5494: 48 83 ec 20 sub $0x20,%rsp
5498: 0f ae f0 mfence
549b: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi
54a0: 48 89 e6 mov %rsp,%rsi
54a3: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
54aa: 00
54ab: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
54b2: 00 00
54b4: c7 44 24 10 00 00 00 movl $0x0,0x10(%rsp)
54bb: 00
;; Non buggy invocation here: both %rdi and %rsi loaded correctly.
54bc: e8 ef c3 ff ff callq 18b0 <helper_function>
Optimization Disabled
Now compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O0 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
0000000000008d27 <helper_function>:
;; Still the same parameters here, but it looks a little different.
... snip ...
8d2b: 48 89 7d e8 mov %rdi,-0x18(%rbp)
8d2f: 48 89 75 e0 mov %rsi,-0x20(%rbp)
8d33: 48 8b 45 e8 mov -0x18(%rbp),%rax
8d37: 0f b6 00 movzbl (%rax),%eax
8d3a: 0f b6 c0 movzbl %al,%eax
8d3d: 89 45 fc mov %eax,-0x4(%rbp)
8d40: 8b 45 fc mov -0x4(%rbp),%eax
8d43: 83 f8 17 cmp $0x17,%eax
8d46: 74 40 je 8d88 <helper_function+0x61>
...
000000000000948a <buggy_invoker>:
948a: 55 push %rbp
948b: 48 89 e5 mov %rsp,%rbp
948e: 41 54 push %r12
9490: 53 push %rbx
9491: 48 81 ec c0 01 00 00 sub $0x1c0,%rsp
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
949f: 48 89 b5 30 fe ff ff mov %rsi,-0x1d0(%rbp)
94a6: 48 c7 45 c0 00 00 00 movq $0x0,-0x40(%rbp)
94ad: 00
94ae: 48 c7 45 c8 00 00 00 movq $0x0,-0x38(%rbp)
94b5: 00
94b6: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
94bd: 48 8d 55 c0 lea -0x40(%rbp),%rdx
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
;; ***** NOT BUGGY HERE *****
;; Now, without optimization, both %rdi and %rsi loaded correctly.
94ce: e8 54 f8 ff ff callq 8d27 <helper_function>
0000000000008eec <different_perfectly_fine_invoker>:
8eec: 55 push %rbp
8eed: 48 89 e5 mov %rsp,%rbp
8ef0: 48 83 ec 30 sub $0x30,%rsp
8ef4: 48 89 7d d8 mov %rdi,-0x28(%rbp)
8ef8: 48 c7 45 e0 00 00 00 movq $0x0,-0x20(%rbp)
8eff: 00
8f00: 48 c7 45 e8 00 00 00 movq $0x0,-0x18(%rbp)
8f07: 00
8f08: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
8f0f: 48 8d 55 e0 lea -0x20(%rbp),%rdx
8f13: 48 8b 45 d8 mov -0x28(%rbp),%rax
8f17: 48 89 d6 mov %rdx,%rsi
8f1a: 48 89 c7 mov %rax,%rdi
;; Another example of non-optimized call to that function.
8f1d: e8 05 fe ff ff callq 8d27 <helper_function>
The Original C++ Code
This is a sanitized version of the original C++. I've just changed some names
and removed irrelevant code. Forgive my paranoia, I just don't want to expose
too much code from unpublished and unreleased work :-).
static void helper_function(my_struct_t *e, int *outArr)
{
unsigned char event_type = e->header.type;
if (event_type == event_A || event_type == event_B) {
outArr[0] = action_one;
} else if (event_type == event_C) {
outArr[0] = action_one;
outArr[1] = action_two;
} else if (...) { ... }
}
static void buggy_invoker(my_struct_t *e, predicate_t pred)
{
// MAX_ACTIONS is #defined to 5
int action_array[MAX_ACTIONS] = {0};
helper_function(e, action_array);
...
}
static int has_any_actions(my_struct_t *e)
{
int actions[MAX_ACTIONS] = {0};
helper_function(e, actions);
return actions[0] != 0;
}
// *** ENTRY POINT to this code is this function (note not static).
void perfectly_fine_invoker(my_struct_t e, predicate_t pred)
{
memfence();
if (has_any_actions(&e)) {
buggy_invoker(&e, pred);
}
...
}
If you think I've obfuscated or eliminiated too much, let me know. Users of
this code call 'perfectly_fine_invoker'. With optimization, g++ optimizes the
'has_any_actions' function away into a direct call to 'helper_function', which
you can see in the assembly.
The Question
So, my question is, does it look like a buggy optimization to anyone else?
If it would be helpful, I could post a sanitized version of the original C++ code.
This is my first posting to Stack Overflow, so please let me know if I can do
anything to make the question clearer, or provide any additional information.
The Answer
Edit (several days after the fact):
I accepted an answer below to my question -- it was not an optimization bug in g++, I was just looking at the assembly code wrong.
However, for whoever may be viewing this question in the future, I've found the answer. I did some reading on undefined behavior in C ( http://blog.regehr.org/archives/213 and http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html ) and some of the descriptions of the compiler optimizing away functions with undefined behavior seemed eerily familiar.
I added some NULL-pointer checks to the function 'helper_function' and lo and behold... bug goes away. I should have had the NULL-pointer checks to begin with, but apparently not having them allowed g++ to do whatever it wanted (in my case, optimize away the call).
Hope this information helps someone down the road.
I think you are looking at the wrong thing. I imagine the compiler notice that your function is short and doesn't touch the %rdi register so it just leaves it alone (you have the same variable as the first parameter, which I guess is what is placed in %rdi. See page 21 here http://www.x86-64.org/documentation/abi.pdf)
If you look at the unoptimized version it saves the %rdi register on this line
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
...and then later just before calling helper_function it moves the saved value into %rax that is moved into %rdi.
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
When optimizing it the compiler just get rid of all that moving back and forth.