I am looking for a specific series of bytes in the memory of a program in GDB.
'find' starting above a certain address (0x104f90) works, but 'find' starting below that address does not:
(gdb) find /w 0x104f90, 0x108fe4, 0x6863203b
0x108e08
0x108e58
0x108ee8
vs
(gdb) find /w 0x104f80, 0x108fe4, 0x6863203b
Pattern not found.
The memory around this address is (seemingly) accessible by GDB:
(gdb) x/12x 0x104f80
0x104f80: 0x00000000 0x00000000 0x00000000 0x00000000
0x104f90: 0x00000000 0x00000000 0x00000000 0x00000000
0x104fa0: 0x00000000 0x00000000 0x00000000 0x00000000
And both of these addresses are on the heap -- info proc mappings says the heap runs from 0xe7000 - 0x109000
Can anyone advise on what I'm missing here? Thank you!
The problem was that I was using gdbserver, and there is a bug in gdbserver where the 'find' function gives up if it doesn't find what it's looking for in 16,000 bytes. See https://sourceware.org/pipermail/gdb-patches/2020-April/167829.html for the official bug report.
The solutions are either update to gdb 10 (which will have a fix), or limit 'find' queries to less than 16,000 bytes
I use OpenOCD + GDB to debug the firmware. When I type load it loads the code to the FLASH memory:
Loading section ExtFlashSection, size 0x3fe000 lma 0x90000000
Loading section .isr_vector, size 0x1f8 lma 0x8000000
Loading section .text, size 0x19978 lma 0x8000200
Loading section .rodata, size 0x52d0 lma 0x8019b78
Loading section .ARM, size 0x8 lma 0x801ee48
Loading section .init_array, size 0x10 lma 0x801ee50
Loading section .fini_array, size 0x4 lma 0x801ee60
Loading section TextFlashSection, size 0x8 lma 0x801ee64
Loading section FontFlashSection, size 0x30b1c lma 0x801ee6c
Loading section .data, size 0x9c lma 0x804f988
Start address 0x80005f0, load size 4512284
Transfer rate: 89 KB/sec, 2926 bytes/write.
However, I want the ExtFlashSection to be not loaded, because I load it manually by external tool (it extracts the contents from the ELF and flashes). I tried adding NOLOAD attribute to that section, but then it is not present in the final ELF file (so I can not extract it).
How to tell GDB or OpenOCD to discard the contents of ExtFlashSection?
I have a server/client system that runs well on my machines. But it core dumps at one of the users machine (OS: Centos 5). Since I don't have access to the user's machine so I built a debug mode binary and asked the user to try it. The crash did happened again after around 2 days of running. And he sent me the core dump file. Loading the core dump file with gdb, it did shows the crash location but I don't understand the reason (sorry, my previous experience is mostly with Windows. I don't have much experience with Linux/gdb). I would like have your input. Thanks!
1. the /var/log/messages at the user's machine shows the segfault:
Jan 16 09:20:39 LPZ08945 kernel: LSystem[4688]: segfault at 0000000000000000 rip 00000000080e6433 rsp 00000000f2afd4e0 error 4
This message indicates that there is a segfault at instruction pointer 80e6433 and stack pointer f2afd4e0. Looks that the program tries to read/write at address 0.
2. load the core dump file into gdb and it shows the crash location:
$gdb LSystem core.19009
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
... (many lines of outputs from gdb omitted)
Core was generated by `./LSystem'.
Program terminated with signal 11,
Segmentation fault.
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
gdb says the crash occurs at Line 214?
3. Frame information. (at Frame #0)
(gdb) info frame
Stack level 0, frame at 0xf2afd7e0:
eip = 0x80e6433 in CLClient::connectToServer (liccomm/LClient.cpp:214); saved eip 0x80e6701
called by frame at 0xf2afd820
source language c++.
Arglist at 0xf2afd7d8, args: this=0xf2afd898, conn=11
Locals at 0xf2afd7d8, Previous frame's sp is 0xf2afd7e0
Saved registers:
ebx at 0xf2afd7cc, ebp at 0xf2afd7d8, esi at 0xf2afd7d0, edi at 0xf2afd7d4, eip at 0xf2afd7dc
The frame is at f2afd7e0, why it's different than the rsp from Part 1, which is f2afd4e0? I guess the user may have provided me with mismatched core dump file (whose pid is 19009) and /var/log/messages file (which indicates a pid 4688).
4. The source
(gdb) list +
209
210 //pHost is declared as struct hostent* and 'pHost = gethostbyname(serverAddress);'
211 memset( &a4, 0, sizeof(a4) );
212 a4.sin_family = AF_INET;
213 a4.sin_port = htons( nPort );
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
215
216 aalen = sizeof(a4);
217 aa = (struct sockaddr *)&a4;
I could not see anything wrong with Line 214. And this part of the code must ran many times during the runtime of 2 days.
5. The variables
Since gdb indicated that Line 214 was the culprit. I printed everything.
memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
(gdb) print a4.sin_addr
$1 = {s_addr = 0}
(gdb) print &(a4.sin_addr)
$2 = (in_addr *) 0xf2afd794
(gdb) print pHost->h_addr_list[0]
$3 = 0xa24af30 "\202}\204\250"
(gdb) print pHost->h_length
$4 = 4
(gdb) print memcpy
$5 = {} 0x2fcf90
So I basically printed everything that's at Line 214. ('pHost->h_addr_list[0]' is 'pHost->h_addr' due to '#define h_addr h_addr_list[0]')
I was not able to catch anything wrong. Did you catch anything fishy? Is it possible the memory has been corrupted somewhere else? I appreciate your help!
[edited] 6. back trace
(gdb) bt
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
'#1' 0x080e6701 in CLClient::connectToLMServer (this=0xf2afd898) at liccomm/LClient.cpp:121
... (Frames 2~7 omitted, not relevant)
'#8' 0x080937f2 in handleConnectionStarter (par=0xf3563f98) at LManager.cpp:166
'#9' 0xf7f5fb41 in ?? ()
'#10' 0xf3563f98 in ?? ()
'#11' 0xf2aff31c in ?? ()
'#12' 0x00000000 in ?? ()
I followed the nested calls. They are correct.
The problem with the memcpy is that the source location is not of the same type than the destination.
You should use inet_addr to convert addresses from string to binary
a4.sin_addr = inet_addr(pHost->h_addr);
The previous code may not work depending on the implementation (some my return struct in_addr, others will return unsigned long, but the principle is the same.
i have two c files:
a.c
void main(){
...
getvtable()->function();
}
the vtable is pointing to a function that is located in b.c:
void function(){
malloc(42);
}
now if i trace the program in valgrind I get the following:
==29994== 4,155 bytes in 831 blocks are definitely lost in loss record 26 of 28
==29994== at 0x402CB7A: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==29994== by 0x40A24D2: (below main) (libc-start.c:226)
so the call to function is completely ommited on the stack! How is it possible? In case I use GDB, a correct stack including "function" is shown.
Debug symbols are included, Linux, 32-bit.
Upd:
Answering the first question, I get the following output when debugging valgrind's GDB server. The breakpoint is not coming, while it comes when i debug directly with GDB.
stasik#gemini:~$ gdb -q
(gdb) set confirm off
(gdb) target remote | vgdb
Remote debugging using | vgdb
relaying data between gdb and process 11665
[Switching to Thread 11665]
0x040011d0 in ?? ()
(gdb) file /home/stasik/leak.so
Reading symbols from /home/stasik/leak.so...done.
(gdb) break function
Breakpoint 1 at 0x110c: file ../../source/leakclass.c, line 32.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>end
(gdb) continue
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0404efcb in ?? ()
(gdb) source thread-frames.py
Stack level 0, frame at 0x42348a0:
eip = 0x404efcb; saved eip 0x4f2f544c
called by frame at 0x42348a4
Arglist at 0x4234898, args:
Locals at 0x4234898, Previous frame's sp is 0x42348a0
Saved registers:
ebp at 0x4234898, eip at 0x423489c
Stack level 1, frame at 0x42348a4:
eip = 0x4f2f544c; saved eip 0x6e492056
called by frame at 0x42348a8, caller of frame at 0x42348a0
Arglist at 0x423489c, args:
Locals at 0x423489c, Previous frame's sp is 0x42348a4
Saved registers:
eip at 0x42348a0
Stack level 2, frame at 0x42348a8:
eip = 0x6e492056; saved eip 0x205d6f66
called by frame at 0x42348ac, caller of frame at 0x42348a4
Arglist at 0x42348a0, args:
Locals at 0x42348a0, Previous frame's sp is 0x42348a8
Saved registers:
eip at 0x42348a4
Stack level 3, frame at 0x42348ac:
eip = 0x205d6f66; saved eip 0x61746144
---Type <return> to continue, or q <return> to quit---
called by frame at 0x42348b0, caller of frame at 0x42348a8
Arglist at 0x42348a4, args:
Locals at 0x42348a4, Previous frame's sp is 0x42348ac
Saved registers:
eip at 0x42348a8
Stack level 4, frame at 0x42348b0:
eip = 0x61746144; saved eip 0x65736162
called by frame at 0x42348b4, caller of frame at 0x42348ac
Arglist at 0x42348a8, args:
Locals at 0x42348a8, Previous frame's sp is 0x42348b0
Saved registers:
eip at 0x42348ac
Stack level 5, frame at 0x42348b4:
eip = 0x65736162; saved eip 0x70616d20
called by frame at 0x42348b8, caller of frame at 0x42348b0
Arglist at 0x42348ac, args:
Locals at 0x42348ac, Previous frame's sp is 0x42348b4
Saved registers:
eip at 0x42348b0
Stack level 6, frame at 0x42348b8:
eip = 0x70616d20; saved eip 0x2e646570
called by frame at 0x42348bc, caller of frame at 0x42348b4
Arglist at 0x42348b0, args:
---Type <return> to continue, or q <return> to quit---
Locals at 0x42348b0, Previous frame's sp is 0x42348b8
Saved registers:
eip at 0x42348b4
Stack level 7, frame at 0x42348bc:
eip = 0x2e646570; saved eip 0x0
called by frame at 0x42348c0, caller of frame at 0x42348b8
Arglist at 0x42348b4, args:
Locals at 0x42348b4, Previous frame's sp is 0x42348bc
Saved registers:
eip at 0x42348b8
Stack level 8, frame at 0x42348c0:
eip = 0x0; saved eip 0x0
caller of frame at 0x42348bc
Arglist at 0x42348b8, args:
Locals at 0x42348b8, Previous frame's sp is 0x42348c0
Saved registers:
eip at 0x42348bc
(gdb) continue
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0404efcb in ?? ()
(gdb) continue
Continuing.
I see two possible reasons:
Valgrind is using a different stack unwind method than GDB
The address space layout is different while running your program under the two environments and you're only hitting stack corruption under Valgrind.
We can gain more insight by using Valgrind's builtin gdbserver.
Save this Python snippet to thread-frames.py
import gdb
f = gdb.newest_frame()
while f is not None:
f.select()
gdb.execute('info frame')
f = f.older()
t.gdb
set confirm off
file MY-PROGRAM
break function
commands
silent
end
run
source thread-frames.py
quit
v.gdb
set confirm off
target remote | vgdb
file MY-PROGRAM
break function
commands
silent
end
continue
source thread-frames.py
quit
(Change MY-PROGRAM, function in the scripts above and the commands below as required)
Get details about the stack frames under GDB:
$ gdb -q -x t.gdb
Breakpoint 1 at 0x80484a2: file valgrind-unwind.c, line 6.
Stack level 0, frame at 0xbffff2f0:
eip = 0x80484a2 in function (valgrind-unwind.c:6); saved eip 0x8048384
called by frame at 0xbffff310
source language c.
Arglist at 0xbffff2e8, args:
Locals at 0xbffff2e8, Previous frame's sp is 0xbffff2f0
Saved registers:
ebp at 0xbffff2e8, eip at 0xbffff2ec
Stack level 1, frame at 0xbffff310:
eip = 0x8048384 in main (valgrind-unwind.c:17); saved eip 0xb7e33963
caller of frame at 0xbffff2f0
source language c.
Arglist at 0xbffff2f8, args:
Locals at 0xbffff2f8, Previous frame's sp is 0xbffff310
Saved registers:
ebp at 0xbffff2f8, eip at 0xbffff30c
Get the same data under Valgrind:
$ valgrind --vgdb=full --vgdb-error=0 ./MY-PROGRAM
In another shell:
$ gdb -q -x v.gdb
relaying data between gdb and process 574
0x04001020 in ?? ()
Breakpoint 1 at 0x80484a2: file valgrind-unwind.c, line 6.
Stack level 0, frame at 0xbe88e2c0:
eip = 0x80484a2 in function (valgrind-unwind.c:6); saved eip 0x8048384
called by frame at 0xbe88e2e0
source language c.
Arglist at 0xbe88e2b8, args:
Locals at 0xbe88e2b8, Previous frame's sp is 0xbe88e2c0
Saved registers:
ebp at 0xbe88e2b8, eip at 0xbe88e2bc
Stack level 1, frame at 0xbe88e2e0:
eip = 0x8048384 in main (valgrind-unwind.c:17); saved eip 0x4051963
caller of frame at 0xbe88e2c0
source language c.
Arglist at 0xbe88e2c8, args:
Locals at 0xbe88e2c8, Previous frame's sp is 0xbe88e2e0
Saved registers:
ebp at 0xbe88e2c8, eip at 0xbe88e2dc
If GDB can successfully unwind the stack while connecting to "valgrind --gdb" then it's a problem with Valgrind's stack unwind algorithm. You can inspect the "info frame" output carefully for inline and tail call frames or some other reason that could throw Valgrind off. Otherwise it's probably stack corruption.
Ok, compiling all .so parts and the main program with an explicit -O0 seems to solve the problem. It seems that some of the optimizations of the 'core' program that was loading the .so (so was always compiled unoptimized) was breaking the stack.
This is Tail-call optimization in action.
The function function calls malloc as the last thing it does. The compiler sees this and kills the stack frame for function before it calls malloc. The advantage is that when malloc returns it returns directly to whichever function called function. I.e. it avoids malloc returning to function only to hit yet another return instruction.
In this case the optimization has prevented an unnecessary jump and made stack usage slightly more efficient, which is nice, but in the case of a recursive tail call then this optimization is a huge win as it turns a recursion into something more like iteration.
As you've discovered already, disabling optimization makes debugging much easier. If you want to debug optimized code (for performance testing, perhaps), then, as #Zang MingJie already said, you can disable this one optimization with -fno-optimize-sibling-calls.
When loading an executable onto a board using OpenOCD and GDB, I get something similar to (snippet taken from here):
$ arm-none-eabi-gdb example.elf
(gdb) target remote localhost:3333
Remote debugging using localhost:3333
...
(gdb) monitor reset halt
...
(gdb) load
Loading section .vectors, size 0x100 lma 0x20000000
Loading section .text, size 0x5a0 lma 0x20000100
Loading section .data, size 0x18 lma 0x200006a0
Start address 0x2000061c, load size 1720
Transfer rate: 22 KB/sec, 573 bytes/write.
(gdb) continue
Continuing.
...
What does lma mean in this context?
That means "Local Memory Address", which is the address in memory where code or data has been loaded to:
http://www.embeddedrelated.com/usenet/embedded/show/77071-1.php