The program crashes with a segmentation fault, so I ran it with gdb. I found out that a member of a class is overridden during the runtime. Few facts:
the class is not used in the function where it is overridden.
the piece of the code where it is overridden is in omp critical section.
I checked the line where the debugger stops - the variables that are accessed in this line have different memory address.
Piece of gdb code:
(gdb) p SrcVelField
$1 = (SNGM::FieldVector *) 0x555555a237a0
(gdb) p *SrcVelField
$2 = {MeshRef = 0x55555592e530, Values = {data_ = 0x555555a237f0}, NbVar = 3,
NbVal = 8820, VarName = {data_ = 0x55555592f000},
FieldName = "VelocityField"}
(gdb) p (*SrcVelField).NbVar
$3 = 3
(gdb) p &(*SrcVelField).NbVar
$4 = (unsigned int *) 0x555555a237b0
(gdb) watch -l *0x555555a237b0
Hardware watchpoint 2: -location *0x555555a237b0
(gdb) cont
Thread 1 "SNGM" hit Hardware watchpoint 2: -location *0x555555a237b0
Old value = 3
New value = -118991381
SNGM::GlobalEval<SNGM::FilterGaussian, 3u, 1u>::PerformGlobalEval ()
at ../SNGM/Numerics/GlobalEval.cpp:288
288 for (unsigned ii = 0; ii < nx; ii++){
(gdb) p &ii
$5 = (unsigned int *) 0x7fffffffc93c
(gdb) p &nx
$6 = (const unsigned int *) 0x7fffffffc950
I will be glad for hints why the value of (*SrcVelField).NbVar is changing there.
UPDATE
Disassembling the code in this line gives next:
- problematic rax value
(gdb) info registers
rax 0x555555a237b0 93824997275568
rbx 0x61a8 25000
And instructions to reach it
280 delete[] point;
0x0000555555595bcc <+1113>: cmpq $0x0,-0x58(%rbp)
0x0000555555595bd1 <+1118>: je 0x555555595adc
<PerformGlobalEvalEv._omp_fn.0(void)+873>
0x0000555555595bd7 <+1124>: mov -0x58(%rbp),%rax
0x0000555555595bdb <+1128>: mov %rax,%rdi
0x0000555555595bde <+1131>: callq 0x555555564758
0x0000555555595be3 <+1136>: jmpq 0x555555595adc
<PerformGlobalEvalEv._omp_fn.0(void)+873>
281 } //End for all particles
282
283 #pragma omp critical
0x0000555555595f5b <+2024>: callq 0x555555564ae8
0x0000555555595f78 <+2053>: callq 0x555555564a40
284 {
285
286 for (unsigned kk = 0; kk < nz; kk++){
0x0000555555595f60 <+2029>: movl $0x0,-0x15c(%rbp)
0x0000555555595f6a <+2039>: mov -0x15c(%rbp),%eax
0x0000555555595f70 <+2045>: cmp -0x148(%rbp),%eax
0x0000555555595f76 <+2051>: jb 0x555555595fed
<PerformGlobalEvalEv._omp_fn.0(void)+2170>
0x0000555555596005 <+2194>: addl $0x1,-0x15c(%rbp)
0x000055555559600c <+2201>: jmpq 0x555555595f6a
<PerformGlobalEvalEv._omp_fn.0(void)+2039>
287 for (unsigned jj = 0; jj < ny; jj++){
0x0000555555595fed <+2170>: movl $0x0,-0x158(%rbp)
0x0000555555595ff7 <+2180>: mov -0x158(%rbp),%eax
0x0000555555595ffd <+2186>: cmp -0x144(%rbp),%eax
0x0000555555596003 <+2192>: jb 0x555555596011
<PerformGlobalEvalEv._omp_fn.0(void)+2206>
0x0000555555596029 <+2230>: addl $0x1,-0x158(%rbp)
0x0000555555596030 <+2237>: jmp 0x555555595ff7
<PerformGlobalEvalEv._omp_fn.0(void)+2180>
288 for (unsigned ii = 0; ii < nx; ii++){
0x0000555555596011 <+2206>: movl $0x0,-0x154(%rbp)
0x000055555559601b <+2216>: mov -0x154(%rbp),%eax
0x0000555555596021 <+2222>: cmp -0x140(%rbp),%eax
0x0000555555596027 <+2228>: jb 0x555555596032
<PerformGlobalEvalEv._omp_fn.0(void)+2239>
=> 0x0000555555596177 <+2564>: addl $0x1,-0x154(%rbp)
This is common behaviour when running out of stack/heap size or when an array is indexed out of bounds. Check your stack and heap sizes and verify that you are not exceeding them. Then check all instances of arrays/lists and verify that they are protected against out-of-bounds errors.
Related
C++ standard says that it is unspecified whether or not a reference requires storage (3.7).. However, as far as I understand, gcc implements C++ references as pointers and as such they can be corrupted.
Is it possible to get an address of a reference in gdb and put a hardware breakpoint on that address in order to find out what corrupts the memory where the reference resides? How can one set such a breakpoint?
GDB may does hardware watchpointing. You can use command watch for this. Example:
main.cpp:
int main(int argc, char **argv)
{
int a = 0;
int& b = a;
int* c = &a;
*c = 1;
return 0;
}
Start debugging and set breakpoint on start main function and end main function:
(gdb) b main
Breakpoint 1 at 0x401bc8: file /../main.cpp, line 60.
(gdb) b main.cpp:65
Breakpoint 2 at 0x401be9: file /../main.cpp, line 65.
(gdb) r
Get address of reference b:
Breakpoint 1, main (argc=1, argv=0x7fffffffddd8) at /../main.cpp:60
60 int a = 0;
(gdb) disas /m
Dump of assembler code for function main(int, char**):
59 {
... Something code
60 int a = 0;
=> 0x0000000000401bc8 <+11>: movl $0x0,-0x14(%rbp)
61 int& b = a;
0x0000000000401bcf <+18>: lea -0x14(%rbp),%rax
0x0000000000401bd3 <+22>: mov %rax,-0x10(%rbp)
62 int* c = &a;
0x0000000000401bd7 <+26>: lea -0x14(%rbp),%rax
0x0000000000401bdb <+30>: mov %rax,-0x8(%rbp)
63 *c = 1;
0x0000000000401bdf <+34>: mov -0x8(%rbp),%rax
0x0000000000401be3 <+38>: movl $0x1,(%rax)
64
65 return 0;
0x0000000000401be9 <+44>: mov $0x0,%eax
66 }
0x0000000000401bee <+49>: pop %rbp
0x0000000000401bef <+50>: retq
End of assembler dump.
(gdb) p $rbp-0x10
$1 = (void *) 0x7fffffffdce0
p $rbp-0x10 is printing address of reference b. It is 0x7fffffffdce0.
Set this address for watching:
(gdb) watch *0x7fffffffdce0
Hardware watchpoint 3: *0x7fffffffdce0
(gdb) c
GDB break only if value is changed:
(gdb) c
Continuing.
Hardware watchpoint 3: *0x7fffffffdce0
Old value = -8752
New value = -8996
main (argc=1, argv=0x7fffffffddd8) at /../main.cpp:62
62 int* c = &a;
Sorry for my english!
According to https://www.ethicalhacker.net/columns/heffner/intro-to-assembly-and-reverse-engineering
mov 0xffffffb4,0x1
moves the number 1 into 0xffffffb4.
So, I decided to test this on my own.
In GDB, x is the command to print the value of memory address.
However, when I run
x 0x00000000004004fc
I'm not getting the value of 133 (decimal) or 85 (hexadecimal)
Instead, I'm getting 0x85f445c7. Any idea what is this?
me#box:~/c$ gdb -q test
Reading symbols from test...done.
(gdb) l
1 #include <stdio.h>
2
3 int main(){
4 int a = 1;
5 int b = 13;
6 int c = 133;
7 printf("Value of C : %d\n",c);
8 return 0;
9 }
(gdb) b 7
Breakpoint 1 at 0x400503: file test.c, line 7.
(gdb) r
Starting program: /home/me/c/test
Breakpoint 1, main () at test.c:7
7 printf("Value of C : %d\n",c);
(gdb)
Disassemble
(gdb) disas
Dump of assembler code for function main:
0x00000000004004e6 <+0>: push %rbp
0x00000000004004e7 <+1>: mov %rsp,%rbp
0x00000000004004ea <+4>: sub $0x10,%rsp
0x00000000004004ee <+8>: movl $0x1,-0x4(%rbp)
0x00000000004004f5 <+15>: movl $0xd,-0x8(%rbp)
0x00000000004004fc <+22>: movl $0x85,-0xc(%rbp)
=> 0x0000000000400503 <+29>: mov -0xc(%rbp),%eax
0x0000000000400506 <+32>: mov %eax,%esi
0x0000000000400508 <+34>: mov $0x4005a4,%edi
0x000000000040050d <+39>: mov $0x0,%eax
0x0000000000400512 <+44>: callq 0x4003c0 <printf#plt>
0x0000000000400517 <+49>: mov $0x0,%eax
0x000000000040051c <+54>: leaveq
0x000000000040051d <+55>: retq
End of assembler dump.
(gdb) x 0x00000000004004fc
0x4004fc <main+22>: 0x85f445c7
(gdb)
;DRTL
To print a value in GDB use print or (p in short form) command.
in your command
x 0x00000000004004fc
You have missed p command. You have to use x with p command pair to print value as hexadecimal format, like below:
(gdb) p/x 0x00000000004004fc
If the memory address is some pointer to some structure then you have to cast the memory location before using the pointer. For example,
struct node {
int data;
struct node *next
};
is some structure and you have the address of that structure pointer, then to view the contents of that memory location you have to use
(gdb) p *(struct node *) 0x00000000004004fc
Notable:
The command
x 0x00000000004004fc
Will look at the instruction and related data for this instruction:
0x00000000004004fc <+22>: movl $0x85,-0xc(%rbp)
... as you can see that the left column (address) is equal to the value used for the command (the address to read)
In the instruction 0x85 is clearly the destination address for the mov, and reflected in the printed value; 0x85f445c7 - which stored as MSB (most significant byte) at the address.
I've got a very simple main function, with a for loop inside it, like below:
#include<stdio.h>
int main()
{
for(int i=0;i<30;++i)
printf("%d\n",i);
return 0;
}
I tried to compile it like:
gcc 4.c -g
Then I debug it with gdb:
$ gdb a.out
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04)...
Reading symbols from a.out...done.
(gdb) list
1 #include<stdio.h>
2 int main()
3 {
4 for(int i=0;i<30;++i)
5 printf("%d\n",i);
6 return 0;
7 }
(gdb) b 5
Breakpoint 1 at 0x400537: file 4.c, line 5.
(gdb) b 6
Breakpoint 2 at 0x400555: file 4.c, line 6.
(gdb) r
Starting program: /home/a/cpp/a.out
Breakpoint 1, main () at 4.c:5
5 printf("%d\n",i);
(gdb) p i
$1 = 0
(gdb) u
0
4 for(int i=0;i<30;++i)
(gdb) u //not exiting for loop?
Breakpoint 1, main () at 4.c:5
5 printf("%d\n",i);
(gdb)
1
4 for(int i=0;i<30;++i)
(gdb) u
Seems the "u" command doesn't help to execute the whole for loop and come to next break point, but something like a "n" command.
Why? Any misunderstanding from my description?
Thanks.
It appears that gdb has to go through the loop once in order understand the loop structure.
(gdb) list
1 #include<stdio.h>
2 int main()
3 {
4 for(int i=0;i<5;++i)
5 {
6 printf("%d\n",i);
7 }
8 return 0;
9 }
10
(gdb) b main
Breakpoint 1 at 0x400535: file junk.cpp, line 4.
(gdb) b 8
Breakpoint 2 at 0x40055c: file junk.cpp, line 8.
(gdb) r
Starting program: /tmp/local/matcher_server/bin/a.out
Breakpoint 1, main () at junk.cpp:4
4 for(int i=0;i<30;++i)
(gdb) n
6 printf("%d\n",i);
(gdb) n
0
4 for(int i=0;i<30;++i)
(gdb) u
1
2
3
4
Breakpoint 2, main () at junk.cpp:8
8 return 0;
To understand why, we need to look at the assembler for main
(gdb) disass
Dump of assembler code for function main():
0x000000000040052d <+0>: push %rbp
0x000000000040052e <+1>: mov %rsp,%rbp
0x0000000000400531 <+4>: sub $0x10,%rsp
0x0000000000400535 <+8>: movl $0x0,-0x4(%rbp)
0x000000000040053c <+15>: jmp 0x400556 <main()+41>
0x000000000040053e <+17>: mov -0x4(%rbp),%eax
0x0000000000400541 <+20>: mov %eax,%esi
0x0000000000400543 <+22>: mov $0x4005f4,%edi
0x0000000000400548 <+27>: mov $0x0,%eax
0x000000000040054d <+32>: callq 0x400410 <printf#plt>
=> 0x0000000000400552 <+37>: addl $0x1,-0x4(%rbp)
0x0000000000400556 <+41>: cmpl $0x1d,-0x4(%rbp)
0x000000000040055a <+45>: jle 0x40053e <main()+17>
0x000000000040055c <+47>: mov $0x0,%eax
0x0000000000400561 <+52>: leaveq
0x0000000000400562 <+53>: retq
End of assembler dump.
and the line details from dwarfdump
.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 0x0000000b):
<pc> [row,col] NS BB ET PE EB IS= DI= uri: "filepath"
NS new statement, BB new basic block, ET end of text sequence
PE prologue end, EB epilogue begin
IA=val ISA number, DI=val discriminator value
0x0040052d [ 3, 0] NS uri: "/tmp/local/matcher_server/bin/junk.cpp"
0x00400535 [ 4, 0] NS
0x0040053e [ 6, 0] NS DI=0x2
0x00400552 [ 4, 0] NS DI=0x2
0x00400556 [ 4, 0] DI=0x1
0x0040055c [ 8, 0] NS
0x00400561 [ 9, 0] NS
0x00400563 [ 9, 0] NS ET
The [ 3,0] column is the line and column number. As we can see the loop causes the line numbers to be non-sequential, 3,4,6,4.
I suspect that first time the program hits line 6 and the 'u' command is given gdb is confused about the loop in the DWARF symbols. On the second loop, it gets it right however. Perhaps a small bug or an artifact of how the 'u' command is implemented.
Note that gdb will still hit breakpoints during the 'u' command. In your example, you will need to remove the breakpoint on the printf.
(gdb) n
134 a = b = c = 0xdeadbeef + ((uint32_t)length) + initval;
(gdb) n
(gdb) p a
$30 = <value optimized out>
(gdb) p b
$31 = <value optimized out>
(gdb) p c
$32 = 3735928563
How can gdb optimize out my value??
It means you compiled with e.g. gcc -O3 and the gcc optimiser found that some of your variables were redundant in some way that allowed them to be optimised away. In this particular case you appear to have three variables a, b, c with the same value and presumably they can all be aliassed to a single variable. Compile with optimisation disabled, e.g. gcc -O0, if you want to see such variables (this is generally a good idea for debug builds in any case).
Minimal runnable example with disassembly analysis
As usual, I like to see some disassembly to get a better understanding of what is going on.
In this case, the insight we obtain is that if a variable is optimized to be stored only in a register rather than the stack, and then the register it was in gets overwritten, then it shows as <optimized out> as mentioned by R..
Of course, this can only happen if the variable in question is not needed anymore, otherwise the program would lose its value. Therefore it tends to happen that at the start of the function you can see the variable value, but then at the end it becomes <optimized out>.
One typical case which we often are interested in of this is that of the function arguments themselves, since these are:
always defined at the start of the function
may not get used towards the end of the function as more intermediate values are calculated.
tend to get overwritten by further function subcalls which must setup the exact same registers to satisfy the calling convention
This understanding actually has a concrete application: when using reverse debugging, you might be able to recover the value of variables of interest simply by stepping back to their last point of usage: How do I view the value of an <optimized out> variable in C++?
main.c
#include <stdio.h>
int __attribute__((noinline)) f3(int i) {
return i + 1;
}
int __attribute__((noinline)) f2(int i) {
return f3(i) + 1;
}
int __attribute__((noinline)) f1(int i) {
int j = 1, k = 2, l = 3;
i += 1;
j += f2(i);
k += f2(j);
l += f2(k);
return l;
}
int main(int argc, char *argv[]) {
printf("%d\n", f1(argc));
return 0;
}
Compile and run:
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
gdb -q -nh main.out
Then inside GDB, we have the following session:
Breakpoint 1, f1 (i=1) at main.c:13
13 i += 1;
(gdb) disas
Dump of assembler code for function f1:
=> 0x00005555555546c0 <+0>: add $0x1,%edi
0x00005555555546c3 <+3>: callq 0x5555555546b0 <f2>
0x00005555555546c8 <+8>: lea 0x1(%rax),%edi
0x00005555555546cb <+11>: callq 0x5555555546b0 <f2>
0x00005555555546d0 <+16>: lea 0x2(%rax),%edi
0x00005555555546d3 <+19>: callq 0x5555555546b0 <f2>
0x00005555555546d8 <+24>: add $0x3,%eax
0x00005555555546db <+27>: retq
End of assembler dump.
(gdb) p i
$1 = 1
(gdb) p j
$2 = 1
(gdb) n
14 j += f2(i);
(gdb) disas
Dump of assembler code for function f1:
0x00005555555546c0 <+0>: add $0x1,%edi
=> 0x00005555555546c3 <+3>: callq 0x5555555546b0 <f2>
0x00005555555546c8 <+8>: lea 0x1(%rax),%edi
0x00005555555546cb <+11>: callq 0x5555555546b0 <f2>
0x00005555555546d0 <+16>: lea 0x2(%rax),%edi
0x00005555555546d3 <+19>: callq 0x5555555546b0 <f2>
0x00005555555546d8 <+24>: add $0x3,%eax
0x00005555555546db <+27>: retq
End of assembler dump.
(gdb) p i
$3 = 2
(gdb) p j
$4 = 1
(gdb) n
15 k += f2(j);
(gdb) disas
Dump of assembler code for function f1:
0x00005555555546c0 <+0>: add $0x1,%edi
0x00005555555546c3 <+3>: callq 0x5555555546b0 <f2>
0x00005555555546c8 <+8>: lea 0x1(%rax),%edi
=> 0x00005555555546cb <+11>: callq 0x5555555546b0 <f2>
0x00005555555546d0 <+16>: lea 0x2(%rax),%edi
0x00005555555546d3 <+19>: callq 0x5555555546b0 <f2>
0x00005555555546d8 <+24>: add $0x3,%eax
0x00005555555546db <+27>: retq
End of assembler dump.
(gdb) p i
$5 = <optimized out>
(gdb) p j
$6 = 5
(gdb) n
16 l += f2(k);
(gdb) disas
Dump of assembler code for function f1:
0x00005555555546c0 <+0>: add $0x1,%edi
0x00005555555546c3 <+3>: callq 0x5555555546b0 <f2>
0x00005555555546c8 <+8>: lea 0x1(%rax),%edi
0x00005555555546cb <+11>: callq 0x5555555546b0 <f2>
0x00005555555546d0 <+16>: lea 0x2(%rax),%edi
=> 0x00005555555546d3 <+19>: callq 0x5555555546b0 <f2>
0x00005555555546d8 <+24>: add $0x3,%eax
0x00005555555546db <+27>: retq
End of assembler dump.
(gdb) p i
$7 = <optimized out>
(gdb) p j
$8 = <optimized out>
To understand what is going on, remember from the x86 Linux calling convention: What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 you should know that:
RDI contains the first argument
RDI can get destroyed in function calls
RAX contains the return value
From this we deduce that:
add $0x1,%edi
corresponds to the:
i += 1;
since i is the first argument of f1, and therefore stored in RDI.
Now, while we were at both:
i += 1;
j += f2(i);
the value of RDI hadn't been modified, and therefore GDB could just query it at anytime in those lines.
However, as soon as the f2 call is made:
the value of i is not needed anymore in the program
lea 0x1(%rax),%edi does EDI = j + RAX + 1, which both:
initializes j = 1
sets up the first argument of the next f2 call to RDI = j
Therefore, when the following line is reached:
k += f2(j);
both of the following instructions have/may have modified RDI, which is the only place i was being stored (f2 may use it as a scratch register, and lea definitely set it to RAX + 1):
0x00005555555546c3 <+3>: callq 0x5555555546b0 <f2>
0x00005555555546c8 <+8>: lea 0x1(%rax),%edi
and so RDI does not contain the value of i anymore. In fact, the value of i was completely lost! Therefore the only possible outcome is:
$3 = <optimized out>
A similar thing happens to the value of j, although j only becomes unnecessary one line later afer the call to k += f2(j);.
Thinking about j also gives us some insight on how smart GDB is. Notably, at i += 1;, the value of j had not yet materialized in any register or memory address, and GDB must have known it based solely on debug information metadata.
-O0 analysis
If we use -O0 instead of -O3 for compilation:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
then the disassembly would look like:
11 int __attribute__((noinline)) f1(int i) {
=> 0x0000555555554673 <+0>: 55 push %rbp
0x0000555555554674 <+1>: 48 89 e5 mov %rsp,%rbp
0x0000555555554677 <+4>: 48 83 ec 18 sub $0x18,%rsp
0x000055555555467b <+8>: 89 7d ec mov %edi,-0x14(%rbp)
12 int j = 1, k = 2, l = 3;
0x000055555555467e <+11>: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%rbp)
0x0000555555554685 <+18>: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%rbp)
0x000055555555468c <+25>: c7 45 fc 03 00 00 00 movl $0x3,-0x4(%rbp)
13 i += 1;
0x0000555555554693 <+32>: 83 45 ec 01 addl $0x1,-0x14(%rbp)
14 j += f2(i);
0x0000555555554697 <+36>: 8b 45 ec mov -0x14(%rbp),%eax
0x000055555555469a <+39>: 89 c7 mov %eax,%edi
0x000055555555469c <+41>: e8 b8 ff ff ff callq 0x555555554659 <f2>
0x00005555555546a1 <+46>: 01 45 f4 add %eax,-0xc(%rbp)
15 k += f2(j);
0x00005555555546a4 <+49>: 8b 45 f4 mov -0xc(%rbp),%eax
0x00005555555546a7 <+52>: 89 c7 mov %eax,%edi
0x00005555555546a9 <+54>: e8 ab ff ff ff callq 0x555555554659 <f2>
0x00005555555546ae <+59>: 01 45 f8 add %eax,-0x8(%rbp)
16 l += f2(k);
0x00005555555546b1 <+62>: 8b 45 f8 mov -0x8(%rbp),%eax
0x00005555555546b4 <+65>: 89 c7 mov %eax,%edi
0x00005555555546b6 <+67>: e8 9e ff ff ff callq 0x555555554659 <f2>
0x00005555555546bb <+72>: 01 45 fc add %eax,-0x4(%rbp)
17 return l;
0x00005555555546be <+75>: 8b 45 fc mov -0x4(%rbp),%eax
18 }
0x00005555555546c1 <+78>: c9 leaveq
0x00005555555546c2 <+79>: c3 retq
From this horrendous disassembly, we see that the value of RDI is moved to the stack at the very start of program execution at:
mov %edi,-0x14(%rbp)
and it then gets retrieved from memory into registers whenever needed, e.g. at:
14 j += f2(i);
0x0000555555554697 <+36>: 8b 45 ec mov -0x14(%rbp),%eax
0x000055555555469a <+39>: 89 c7 mov %eax,%edi
0x000055555555469c <+41>: e8 b8 ff ff ff callq 0x555555554659 <f2>
0x00005555555546a1 <+46>: 01 45 f4 add %eax,-0xc(%rbp)
The same basically happens to j which gets immediately pushed to the stack when when it is initialized:
0x000055555555467e <+11>: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%rbp)
Therefore, it is easy for GDB to find the values of those variables at any time: they are always present in memory!
This also gives us some insight on why it is not possible to avoid <optimized out> in optimized code: since the number of registers is limited, the only way to do that would be to actually push unneeded registers to memory, which would partly defeat the benefit of -O3.
Extend the lifetime of i
If we edited f1 to return l + i as in:
int __attribute__((noinline)) f1(int i) {
int j = 1, k = 2, l = 3;
i += 1;
j += f2(i);
k += f2(j);
l += f2(k);
return l + i;
}
then we observe that this effectively extends the visibility of i until the end of the function.
This is because with this we force GCC to use an extra variable to keep i around until the end:
0x00005555555546c0 <+0>: lea 0x1(%rdi),%edx
0x00005555555546c3 <+3>: mov %edx,%edi
0x00005555555546c5 <+5>: callq 0x5555555546b0 <f2>
0x00005555555546ca <+10>: lea 0x1(%rax),%edi
0x00005555555546cd <+13>: callq 0x5555555546b0 <f2>
0x00005555555546d2 <+18>: lea 0x2(%rax),%edi
0x00005555555546d5 <+21>: callq 0x5555555546b0 <f2>
0x00005555555546da <+26>: lea 0x3(%rdx,%rax,1),%eax
0x00005555555546de <+30>: retq
which the compiler does by storing i += i in RDX at the very first instruction.
Tested in Ubuntu 18.04, GCC 7.4.0, GDB 8.1.0.
It didn't. Your compiler did, but there's still a debug symbol for the original variable name.
From https://idlebox.net/2010/apidocs/gdb-7.0.zip/gdb_9.html
The values of arguments that were not saved in their stack frames are shown as `value optimized out'.
I'm guessing you compiled with -O(somevalue) and are accessing variables a,b,c in a function where optimization has occurred.
You need to turn off the compiler optimisation.
If you are interested in a particular variable in gdb, you can delare the variable as "volatile" and recompile the code. This will make the compiler turn off compiler optimization for that variable.
volatile int quantity = 0;
Just run "export COPTS='-g -O0';" and rebuild your code. After rebuild, debug it using gdb. You'll not see such error. Thanks.
I have a simple function with an inner loop - it scales the input value, looks up an output value in a lookup table, and copies it to the destination. (ftol_ambient is a trick I copied from the web for fast conversion of float to int).
for (i = 0; i < iCount; ++i)
{
iScaled = ftol_ambient(*pSource * PRECISION3);
if (iScaled <= 0)
*pDestination = 0;
else if (iScaled >= PRECISION3)
*pDestination = 255;
else
{
iSRGB = FloatToSRGBTable3[iScaled];
*pDestination = iSRGB;
}
pSource++;
pDestination++;
}
Now my lookup table is finite, and floats are infinite, so there's a possibility of off-by-one errors. I created a copy of the function with some code to handle that case. Notice that the only difference is the added 2 lines of code - please ignore the ugly pointer casting.
for (i = 0; i < iCount; ++i)
{
iScaled = ftol_ambient(*pSource * PRECISION3);
if (iScaled <= 0)
*pDestination = 0;
else if (iScaled >= PRECISION3)
*pDestination = 255;
else
{
iSRGB = FloatToSRGBTable3[iScaled];
if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
++iSRGB;
*pDestination = (unsigned char) iSRGB;
}
pSource++;
pDestination++;
}
Here's the strange part. I'm testing both versions with identical input of 100000 elements, repeated 100 times. On my Athlon 64 1.8 GHz (32 bit mode), the first function takes 0.231 seconds, and the second (longer) function takes 0.185 seconds. Both functions are adjacent in the same source file, so there's no possibility of different compiler settings. I've run the tests many times, reversing the order they're run in, and the timings are roughly the same every time.
I know there's a lot of mystery in modern processors, but how is this possible?
Here for comparison are the relevant assembler outputs from the Microsoft VC++6 compiler.
; 173 : for (i = 0; i < iCount; ++i)
$L4455:
; 174 : {
; 175 : iScaled = ftol_ambient(*pSource * PRECISION3);
fld DWORD PTR [esi]
fmul DWORD PTR __real#4#400b8000000000000000
fstp QWORD PTR $T5011[ebp]
; 170 : int i;
; 171 : int iScaled;
; 172 : unsigned int iSRGB;
fld QWORD PTR $T5011[ebp]
; 173 : for (i = 0; i < iCount; ++i)
fistp DWORD PTR _i$5009[ebp]
; 176 : if (iScaled <= 0)
mov edx, DWORD PTR _i$5009[ebp]
test edx, edx
jg SHORT $L4458
; 177 : *pDestination = 0;
mov BYTE PTR [ecx], 0
; 178 : else if (iScaled >= PRECISION3)
jmp SHORT $L4461
$L4458:
cmp edx, 4096 ; 00001000H
jl SHORT $L4460
; 179 : *pDestination = 255;
mov BYTE PTR [ecx], 255 ; 000000ffH
; 180 : else
jmp SHORT $L4461
$L4460:
; 181 : {
; 182 : iSRGB = FloatToSRGBTable3[iScaled];
; 183 : *pDestination = (unsigned char) iSRGB;
mov dl, BYTE PTR _FloatToSRGBTable3[edx]
mov BYTE PTR [ecx], dl
$L4461:
; 184 : }
; 185 : pSource++;
add esi, 4
; 186 : pDestination++;
inc ecx
dec edi
jne SHORT $L4455
$L4472:
; 199 : {
; 200 : iScaled = ftol_ambient(*pSource * PRECISION3);
fld DWORD PTR [esi]
fmul DWORD PTR __real#4#400b8000000000000000
fstp QWORD PTR $T4865[ebp]
; 195 : int i;
; 196 : int iScaled;
; 197 : unsigned int iSRGB;
fld QWORD PTR $T4865[ebp]
; 198 : for (i = 0; i < iCount; ++i)
fistp DWORD PTR _i$4863[ebp]
; 201 : if (iScaled <= 0)
mov edx, DWORD PTR _i$4863[ebp]
test edx, edx
jg SHORT $L4475
; 202 : *pDestination = 0;
mov BYTE PTR [edi], 0
; 203 : else if (iScaled >= PRECISION3)
jmp SHORT $L4478
$L4475:
cmp edx, 4096 ; 00001000H
jl SHORT $L4477
; 204 : *pDestination = 255;
mov BYTE PTR [edi], 255 ; 000000ffH
; 205 : else
jmp SHORT $L4478
$L4477:
; 206 : {
; 207 : iSRGB = FloatToSRGBTable3[iScaled];
xor ecx, ecx
mov cl, BYTE PTR _FloatToSRGBTable3[edx]
; 208 : if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
mov edx, DWORD PTR _SRGBCeiling[ecx*4]
cmp edx, DWORD PTR [esi]
jg SHORT $L4481
; 209 : ++iSRGB;
inc ecx
$L4481:
; 210 : *pDestination = (unsigned char) iSRGB;
mov BYTE PTR [edi], cl
$L4478:
; 211 : }
; 212 : pSource++;
add esi, 4
; 213 : pDestination++;
inc edi
dec eax
jne SHORT $L4472
Edit: Trying to test Nils Pipenbrinck's hypothesis, I added a couple of lines before and inside of the loop of the first function:
int one = 1;
int two = 2;
if (one == two)
++iSRGB;
The run time of the first function is now down to 0.152 seconds. Interesting.
Edit 2: Nils pointed out that the comparison would be optimized out of a release build, and indeed it is. The changes in the assembly code are very subtle, I will post it here to see if it provides any clues. At this point I'm wondering if it's code alignment?
; 175 : for (i = 0; i < iCount; ++i)
$L4457:
; 176 : {
; 177 : iScaled = ftol_ambient(*pSource * PRECISION3);
fld DWORD PTR [edi]
fmul DWORD PTR __real#4#400b8000000000000000
fstp QWORD PTR $T5014[ebp]
; 170 : int i;
; 171 : int iScaled;
; 172 : int one = 1;
fld QWORD PTR $T5014[ebp]
; 173 : int two = 2;
fistp DWORD PTR _i$5012[ebp]
; 178 : if (iScaled <= 0)
mov esi, DWORD PTR _i$5012[ebp]
test esi, esi
jg SHORT $L4460
; 179 : *pDestination = 0;
mov BYTE PTR [edx], 0
; 180 : else if (iScaled >= PRECISION3)
jmp SHORT $L4463
$L4460:
cmp esi, 4096 ; 00001000H
jl SHORT $L4462
; 181 : *pDestination = 255;
mov BYTE PTR [edx], 255 ; 000000ffH
; 182 : else
jmp SHORT $L4463
$L4462:
; 183 : {
; 184 : iSRGB = FloatToSRGBTable3[iScaled];
xor ecx, ecx
mov cl, BYTE PTR _FloatToSRGBTable3[esi]
; 185 : if (one == two)
; 186 : ++iSRGB;
; 187 : *pDestination = (unsigned char) iSRGB;
mov BYTE PTR [edx], cl
$L4463:
; 188 : }
; 189 : pSource++;
add edi, 4
; 190 : pDestination++;
inc edx
dec eax
jne SHORT $L4457
My guess is, that in the first case two different branches end up in the same branch-prediction slot on the CPU. If these two branches predict different each time the code will slow down.
In the second loop, the added code may just be enough to move one of the branches to a different branch prediction slot.
To be sure you can give the Intel VTune analyzer or the AMD CodeAnalyst tool a try. These tools will show you what's exactly going on in your code.
However, keep in mind that it's most probably not worth to optimize this code further. If you tune your code to be faster on your CPU it may at the same time become slower on a different brand.
EDIT:
If you want to read on the branch-prediction give Agner Fog's excellent web-site a try: http://www.agner.org/optimize/
This pdf explains the branch-prediction slot allocation in detail: http://www.agner.org/optimize/microarchitecture.pdf
My first guess is that the branch is being predicted better in the second case. Possibly because the nested if gives whatever algorithm the processor's using more information to guess from. Just out of curiousity, what happens when you remove the line
if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
?
How are you timing these routines? I wonder if paging or caching is having an effect on the timings? It's possible that calling the first routine loads both into memory, crosses a page boundary or causes the stack to cross into an invalid page (causing a page-in), but only the first routine pays the price.
You may want to to run through both functions once before making the calls that take the measurements to reduce the effects that virtual memory and caching might have.
Are you just testing this inner loop, or are you testing your undisclosed outer loop as well? If so, look at these three lines:
if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
++iSRGB;
*pDestination = (unsigned char) iSRGB;
Now, it looks like *pDestination is the counter for the outer loop. So by sometimes doing an extra increment of the iSRGB value you get to skip some of the iterations in the outer loop, thereby reducing the total amount of work the code needs to do.
I once had a similar situation. I hoisted some code out of a loop to make it faster, but it got slower. Confusing. Turns out, the average number of times though the loop was less than 1.
The lesson (which you don't need, obviously) is that a change doesn't make your code faster unless you measure it actually running faster.