GDB: recover control from blocked process - c++

I have the following problem: I want to recover control of gdb when a process enters a blocking situation i.e. a blocking function or a pooling loop.
Lets illustrate it with an example: I have process A which forks process B. B does its work and then gets stuck waiting for an event from A. I want to switch GDB to A so I can run it separately until the event generation. However, I can not recover control of GDB from B. Of course I can ctrl+C in B which generates a SIGINT signal, and then change to A, but when I go back to B, even if I handle pass SIGINT, B finishes.
Log:
Program received signal SIGINT, Interrupt.
[Switching to Thread 0xb68feb40 (LWP 3177)]
0xb7fdeb0c in ?? ()
(gdb) handle SIGINT pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal Stop Print Pass to program Description
SIGINT Yes Yes Yes Interrupt
(gdb) c
Continuing.
[Thread 0xb7abcb40 (LWP 3178) exited]
[Thread 0xb68feb40 (LWP 3177) exited]
Couldn't get registers: No such process.
(gdb) info inferiors
Num Description
* 2 <null>
1 process 3168
Is there a way to recover control of GDB and switch process without killing it?

Related

How to debug C++ Application crash without showing coredump or empty stack [duplicate]

So I fire up my c++ application in GDB, and when it quits, I basically get:
[Thread 0x7fff76e07700 (LWP 6170) exited]
[Thread 0x7fff76f08700 (LWP 6169) exited]
[Thread 0x7fff77009700 (LWP 6168) exited]
...
Program terminated with signal SIGKILL, Killed. The program no longer exists.
(gdb)
I literally have no idea why this is occuring, why can't I do a backtrace to see how it exited? Anyone have any ideas? It should never end :(
Thanks!
I literally have no idea why this is occuring,
This usually means that either
some other process executed a kill -9 <your-pid>, or
the kernel OOM killer decided that your process consumed too many resources, and terminated it (effectively the kernel executed kill -9 for it). You should look in /var/log/messages (/var/log/syslog on Ubuntu variants) for traces of that -- the kernel usually logs a message when it OOMs some process.
why can't I do a backtrace to see how it exited?
Because in order to see a backtrace, the process must exist. If it doesn't exist, it doesn't have stack, and so can't have backtrace.
If you are using Unix/Linux you should also be able to type dmesg on your terminal and see the cause of the process terminating. In my case it was indeed OOM. here is a screenshot of my kernel log shortly after the termination
It is possible that the process ran into the cpu time ulimit. Check with ulimit -a from the environment where the process is actually started if "cpu time" is set to anything other than "unlimited"
In my case was a crash (AV). Even with GDB attached I couldn't catch this violation.
Hope it helps

gdb stack trace shows no backtrace - Selected thread is running

I have a program with multiple threads. This program has been compiled with the -g option and has debug symbols.
(gdb) i threads
Id Target Id Frame
1 Thread 0x7fbb5256c9c0 (LWP 15799) "main" 0x00007fbb40dfdc93 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
2 Thread 0x7fbb149ff700 (LWP 15858) "IOService" (running)
3 Thread 0x7fbb135ff700 (LWP 15860) "EvoAftManBt-mai" (running)
4 Thread 0x7fbb0cfff700 (LWP 15868) "main" (running)
5 Thread 0x7fbaecfff700 (LWP 15873) "myTimer-0" (running)
* 6 Thread 0x7fbadffff700 (LWP 15882) "stats-root" (running)>>>>>
What seems odd is the status at the end of each stack frame which says running...
Now if I switch to the thread I am interested in:
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fbadffff700 (LWP 15882))](running)
(gdb) bt
Selected thread is running
No backtrace is printed.
However, for the same binary if I strip off the symbols, I can see the stack trace which is baffling.
Any pointers?

Segfault process id and core dump process id are different. Why?

In the Linux message file, I notice that a segfault is reported for process 14947, but I did not get the core dump for process 14947, instead I got 14069.core.(Its generated time matches the time the segfault is hit).
Then I use gdb and find:-
Program terminated with signal 11, Segmentation fault.
[New process 14947]
[New process 26131]
[New process 26130]
[New process 26129]
[New process 26128]
[New process 14945]
[New process 14842]
[New process 14726]
[New process 14598]
[New process 14069]
When I run "info thread", I get:-
(gdb) info thread
10 process 14069 0xffffe410 in __kernel_vsyscall ()
9 process 14598 0xffffe410 in __kernel_vsyscall ()
8 process 14726 0xffffe410 in __kernel_vsyscall ()
7 process 14842 0xffffe410 in __kernel_vsyscall ()
6 process 14945 0xffffe410 in __kernel_vsyscall ()
5 process 26128 0xffffe410 in __kernel_vsyscall ()
4 process 26129 0xffffe410 in __kernel_vsyscall ()
3 process 26130 0xffffe410 in __kernel_vsyscall ()
2 process 26131 0xffffe410 in __kernel_vsyscall ()
* 1 process 14947 0x006a8300 in pthread_mutex_lock ()
So here goes my questions:-
Why the coredump file name does not match the segfault process id in the message file?
I think the coredump is for a particular process, why there are so many info like "[New process 26130]" here ?
why "info thread" will display the info for process, not thread?
Thanks!
Plus: My OS is RHEL5.
In Linux, kernel threads are simply light-weight processes (processes where the virtual memory is marked as shared with the parent process rather than marked as copy-on-write), and hence the process IDs that you see listed are the same as the thread IDs. This is just a guess, but probably the ID for the core is the same as the thread that handled the signal, which might not be the same as the main thread.

C++ Linux Binary terminated with signal SIGKILL - why? (loaded in GDB)

So I fire up my c++ application in GDB, and when it quits, I basically get:
[Thread 0x7fff76e07700 (LWP 6170) exited]
[Thread 0x7fff76f08700 (LWP 6169) exited]
[Thread 0x7fff77009700 (LWP 6168) exited]
...
Program terminated with signal SIGKILL, Killed. The program no longer exists.
(gdb)
I literally have no idea why this is occuring, why can't I do a backtrace to see how it exited? Anyone have any ideas? It should never end :(
Thanks!
I literally have no idea why this is occuring,
This usually means that either
some other process executed a kill -9 <your-pid>, or
the kernel OOM killer decided that your process consumed too many resources, and terminated it (effectively the kernel executed kill -9 for it). You should look in /var/log/messages (/var/log/syslog on Ubuntu variants) for traces of that -- the kernel usually logs a message when it OOMs some process.
why can't I do a backtrace to see how it exited?
Because in order to see a backtrace, the process must exist. If it doesn't exist, it doesn't have stack, and so can't have backtrace.
If you are using Unix/Linux you should also be able to type dmesg on your terminal and see the cause of the process terminating. In my case it was indeed OOM. here is a screenshot of my kernel log shortly after the termination
It is possible that the process ran into the cpu time ulimit. Check with ulimit -a from the environment where the process is actually started if "cpu time" is set to anything other than "unlimited"
In my case was a crash (AV). Even with GDB attached I couldn't catch this violation.
Hope it helps

OpenMP - Hanging during execution

I'm experiencing an inconsistent behavior of a program that's parallelized using OpenMP.
When I run it, it prints out its current stage, so the expected output is: "2 3 4 5" etc.
Time between the first few stages is usually 1 to 2 seconds (when running in parallel on 4 cores).
However, without recompiling, or altering anything, sometimes when I run the software it hangs right after printing 2 (which is printed before the first parallel code is executed);
It doesn't become slow, it literally stops computing. I've run this under gdb and confirmed that it hangs inside of OpenMP:
(there are more than 4 threads because of hyperthreading)
[New Thread 0x7ffff6c78700 (LWP 25878)]
[New Thread 0x7ffff6477700 (LWP 25879)]
[New Thread 0x7ffff5c76700 (LWP 25880)]
[New Thread 0x7ffff5475700 (LWP 25881)]
[New Thread 0x7ffff4c74700 (LWP 25882)]
[New Thread 0x7ffff4473700 (LWP 25883)]
[New Thread 0x7ffff3c72700 (LWP 25884)]
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7641fd4 in ?? () from /usr/lib/libgomp.so.1
(gdb) up
#1 0x00007ffff7640a9e in ?? () from /usr/lib/libgomp.so.1
(gdb)
#2 0x0000000000408ae8 in Redcraft::createStructures (this=0x7fffffffd8d0) at source/redcraft.cpp:512
512 #pragma omp parallel for private(node)
Originally the pragma specified schedule(dynamic) but having that or removing that doesn't change the consistency of this hangup.
Lastly, I tried enabling/disabling omp_set_dynamic() and that had no effect either.
Any suggestions for debugging?
This usually happens when there is data race.You'll have to post the code block that is being parallelized.Basically what is to be found out is how the threads are using the data.Rerunning without compiling doesn't guarantee the same thread execution sequence hence these kind of problems arise.Are you working with files?You'll have to close them before rerunning.