How does `gdb` compute the bounds of a stack frame? - c++

I am debugging a new thread library, in which I set the stack register rsp manually (to switch to a user-managed stack), and then invoke a function which never returns.
When I try to get a backtrace in gdb, I get the following output.
(gdb) bt
#0 load (_m=std::memory_order_seq_cst, this=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>)
at /usr/include/c++/4.9/atomic:209
#1 Arachne::schedulerMainLoop () at Arachne.cc:236
#2 0x000000000040268d in Arachne::threadMainFunction (id=<optimized out>) at Arachne.cc:135
#3 0x0000000000000000 in ?? ()
How does gdb determine that the stack has 0 elements in it?
More generally, how does gdb determine how many elements the stack has?

Asked for position 0 of stack, stack only has 0 elements on it. comes from gdb/dwarf2/expr.c and doesn't refer to your CPU stack. Instead, it complains about the stack that DWARF expressions work on.
This might mean that your DWARF debug info is invalid. Or what you did with esp confused GDB's DWARF expression evaluator.

Related

How to debug crash, when backtrace starts with zero

my long running application crashes randomly with segmentation fault. When trying to debug the generated coredump, I get stuck with wierd stacktrace:
(gdb) bt full
#0 __memmove_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:2582
No locals.
#1 0x00000000 in ?? ()
No symbol table info available.
How it can happen, that the backtrace starts at 0x00000000?
What can I do to debug this issue more? I can't run it in gdb as it may take even a week till the crash occures.
Generally this means that the return address on the stack has been overwritten with 0, probably due to overrunning the end of an on-stack array. You can trying building with address sanitizer on gcc or clang (if you are using them). Or you can try running with valgrind to see if it will tell you about invalid memory writes.

GDB - Reading 1 words from the stack

I want to print 1 words from the top of stack in the form of hexadecimal. To do so, I typed the following:
(gdb) x/1xw $esp
but GDB keeps popping up:
0xffffffffffffe030: Cannot access memory at address 0xffffffffffffe030
The program I'm trying to debug has already pushed a value onto stack so just in case if you're wondering that I might be trying to access kernel variables at the very beginning of program, it's not so.
Any idea?
0xffffffffffffe030 is a 64-bit constant, so you are running in x64-bit mode. But $esp is a 32-bit register (which GDB sign-extends to 64 bits in this context). The 64-bit stack pointer is called $rsp. Try this instead:
(gdb) x/1xw $rsp

Interpreting cause of segfault with GDB nexti

So, I'm debugging a program which mysteriously crashes via a SIGSEGV. The program is single-threaded.
I've debugged many segfaults before - most of them come down to stack or heap corruption. It's usually easy to debug heap corruption problems with valgrind. Stack corruption is trickier, but you can usually at least tell that stack corruption is the problem when GDB shows that your stack is mangled.
But, here I've encountered a very bizarre problem which I've never seen before. Using GDB to go instruction by instruction, I see that the segfault happens immediately after a callq instruction. Except the callq address is not dynamically loaded from a register or from memory - it's just a static function address:
(gdb) ni
0x00007ffff659c423 223 setPolicyDocumentLoader(docLoader);
1: x/i $pc
=> 0x7ffff659c423 <WebCore::FrameLoader::init()+351>: mov %rdx,%rsi
(gdb)
0x00007ffff659c426 223 setPolicyDocumentLoader(docLoader);
1: x/i $pc
=> 0x7ffff659c426 <WebCore::FrameLoader::init()+354>: mov %rax,%rdi
(gdb)
0x00007ffff659c429 223 setPolicyDocumentLoader(docLoader);
1: x/i $pc
=> 0x7ffff659c429 <WebCore::FrameLoader::init()+357>:
callq 0x7ffff53a2d50 <_ZN7WebCore11FrameLoader23setPolicyDocumentLoaderEPNS_14DocumentLoaderE#plt>
(gdb) ni
Program received signal SIGSEGV, Segmentation fault.
0x0000000000683670 in ?? ()
1: x/i $pc
=> 0x683670: add %al,(%rax)
(gdb)
So, as soon as it executes callq to the address 0x7ffff53a2d50, it suddenly segfaults.
I realize that, in general, Stackoverflow can't possibly be too helpful for most segfaults or problems like this, because the reasons tend to be extremely specific to some particular circumstance, and usually just come down to memory corruption via programmer error.
But I still thought it would be worth asking this question because this fundamentally doesn't even make any sense to me. How is it even possible for the OS to deliver a SIGSEGV when a program executes a callq instruction to a legitimate statically determined function address?
nexti will execute the next instruction, but if the instruction is a call then it executes until the function returns. From the GDB manual:
nexti, nexti arg, ni
Execute one machine instruction, but if it is a function call, proceed until the function returns. An argument is a repeat count, as in next.
When you do the callq the debugger called into that function but then crashes somewhere during execution of that function. If you want to step into a function call then I'd recommend stepi when you hit callq 0x7ffff53a2d50
as soon as it executes callq to the address 0x7ffff53a2d50, it suddenly segfaults.
This is usually caused by stack overflow.
Look for deep recursion (using where command). Also look at your stack region (containing current $rsp value) in output from info proc map.

GDB backtrace from hard fault handler on ARM cortex M0 NRF51822

I'm on an ARM Cortex M0 (Nordic NRF51822) using the Segger JLink. When my code hard faults (say due to a dereferencing an invalid pointer), I see only the following stack trace:
(gdb) bt
#0 HardFault_HandlerC (hardfault_args=<optimized out>) at main_display.cpp:440
#1 0x00011290 in ?? ()
I have a hard fault handler installed and it can give me the lr and pc:
(gdb) p/x stacked_pc
$1 = 0x18ea6
(gdb) p/x stacked_lr
$2 = 0x18b35
And I know I can use addr-to-line to translate these to source code lines:
> arm-none-eabi-addr2line -e main_display.elf 0x18ea6
/Users/cmason/code/nrf/src/../libs/epaper/EPD_Display.cpp:33
> arm-none-eabi-addr2line -e main_display.elf 0x18b35
/Users/cmason/code/nrf/src/../libs/epaper/EPD.cpp:414
Can I get the rest of the backtrace somehow? If I stop at a normal breakpoint I can get a backtrace, so I know GDB can do the (somewhat complex) algorithm to unwind the stack on ARM. I understand that, in the general case, the stack may be screwed up by my code to the point where it's unreadable, but I don't think that's whats happening in this case.
I think this may be complicated by Nordic's memory protection scheme. Their bluetooth stack installs its own interrupt vector and prevents access to certain memory regions. Or maybe this is Segger's fault? On other examples of Cortex M0 do most people see regular back traces from hard faults?
Thanks!
-c
Cortex-M0 and Cortex-M3 is close enough that you can use the answer from this question:
Stack Backtrace for ARM core using GCC compiler (when there is a MSP to PSP switch)
in short: GCC has a function _Unwind_Backtrace to generate a full call stack; this needs to be hacked up a bit to simulate doing a backtrace from before the exception entry happened. Details in the linked question.

c++ new operator takes lots of memory (67MB) via libstdc++

I have some issues with the new operator in libstdc++. I wrote a program in C++ and had some problems with the memory management.
After having debugged with gdb to determine what is eating up my ram I got the following for info proc mappings
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x400000 0x404000 0x4000 0 /home/sebastian/Developement/powerserverplus-svn/psp-job-distributor/Release/psp-job-distributor
0x604000 0x605000 0x1000 0x4000 /home/sebastian/Developement/powerserverplus-svn/psp-job-distributor/Release/psp-job-distributor
0x605000 0x626000 0x21000 0 [heap]
0x7ffff0000000 0x7ffff0021000 0x21000 0
0x7ffff0021000 0x7ffff4000000 0x3fdf000 0
0x7ffff6c7f000 0x7ffff6c80000 0x1000 0
0x7ffff6c80000 0x7ffff6c83000 0x3000 0
0x7ffff6c83000 0x7ffff6c84000 0x1000 0
0x7ffff6c84000 0x7ffff6c87000 0x3000 0
0x7ffff6c87000 0x7ffff6c88000 0x1000 0
0x7ffff6c88000 0x7ffff6c8b000 0x3000 0
0x7ffff6c8b000 0x7ffff6c8c000 0x1000 0
0x7ffff6c8c000 0x7ffff6c8f000 0x3000 0
0x7ffff6c8f000 0x7ffff6e0f000 0x180000 0 /lib/x86_64-linux-gnu/libc-2.13.so
0x7ffff6e0f000 0x7ffff700f000 0x200000 0x180000 /lib/x86_64-linux-gnu/libc-2.13.so
0x7ffff700f000 0x7ffff7013000 0x4000 0x180000 /lib/x86_64-linux-gnu/libc-2.13.so
0x7ffff7013000 0x7ffff7014000 0x1000 0x184000 /lib/x86_64-linux-gnu/libc-2.13.so
That's just snipped out of it. However, everything is normal. Some of this belongs to the code for the standard libs, some if it is heap and some of it are stack sections for threads I created.
But. there is this one section I id not figure out why it is allocated:
0x7ffff0000000 0x7ffff0021000 0x21000 0
0x7ffff0021000 0x7ffff4000000 0x3fdf000 0
These two sections are created at a seemlike random time. There is several hours of debugging no similarity in time nor at a certain created thread or so. I set a hardware watch point with awatch *0x7ffff0000000 and gave it several runs again.
These two sections are created at nearly the same time within the same code section of a non-debuggable function (gdb shows it in stack as in ?? () from /lib/x86_64-linux-gnu/libc.so.6). More exact this is a sample stack where it occured:
#0 0x00007ffff6d091d5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6d0b2bd in calloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff7dee28f in _dl_allocate_tls () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff77c0484 in pthread_create##GLIBC_2.2.5 () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007ffff79d670e in Thread::start (this=0x6077c0) at ../src/Thread.cpp:42
#5 0x000000000040193d in MultiThreadedServer<JobDistributionServer_Thread>::Main (this=0x7fffffffe170) at /home/sebastian/Developement/powerserverplus-svn/mtserversock/src/MultiThreadedServer.hpp:55
#6 0x0000000000401601 in main (argc=1, argv=0x7fffffffe298) at ../src/main.cpp:29
Another example would be here (from a differet run):
#0 0x00007ffff6d091d5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6d0bc2d in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff751607d in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x000000000040191b in MultiThreadedServer<JobDistributionServer_Thread>::Main (this=0x7fffffffe170) at /home/sebastian/Developement/powerserverplus-svn/mtserversock/src/MultiThreadedServer.hpp:53
#4 0x0000000000401601 in main (argc=1, argv=0x7fffffffe298) at ../src/main.cpp:29
The whole thing says that it occurs at the calloc called from the pthread lib or in another situation it was the new operator or the malloc called from it. It doesn't matter which new it is - in several runs it occured at nearly every new or thread creation in my code. The only "constant" thing with it is that it occurs every time in the libc.so.6.
No matter at which point of the code,
no matter if used with malloc or calloc,
no matter after how much time the program ran,
no matter after how many threads have been created,
it is always that section: 0x7ffff0000000 - 0x7ffff4000000.
Everytime the program runs. But everytime at another point in the program. I am really confused because it allocated 67MB of virtual space but it does not use it.
When watching the variables it created there, especially watched those which are created when malloc or calloc were called by libc, none of this space is used by them. They are created in a heap section which is far away from that address range (0x7ffff0000000 - 0x7ffff4000000).
Edit:
I checked the stack size of the parent process too and got a usage of 8388608 Bytes, which is 0x800000 (~8MB). To get these values I did:
pthread_attr_t attr;
size_t stacksize;
struct rlimit rlim;
pthread_attr_init(&attr);
pthread_attr_getstacksize(&attr, &stacksize);
getrlimit(RLIMIT_STACK, &rlim);
fit into a size_t variable. */
printf("Resource limit: %zd\n", (size_t) rlim.rlim_cur);
printf("Stacksize: %zd\n", stacksize);
pthread_attr_destroy(&attr);
Please help me with that. I am really confused about that.
It looks like it is allocating a stack space for a thread.
The space will be used as you make function calls in the thread.
But really what is is doing is none of your business. It is part of the internal implementation of pthread_create() it can do anything it likes in there.