gdb stack trace shows no backtrace - Selected thread is running - gdb

I have a program with multiple threads. This program has been compiled with the -g option and has debug symbols.
(gdb) i threads
Id Target Id Frame
1 Thread 0x7fbb5256c9c0 (LWP 15799) "main" 0x00007fbb40dfdc93 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
2 Thread 0x7fbb149ff700 (LWP 15858) "IOService" (running)
3 Thread 0x7fbb135ff700 (LWP 15860) "EvoAftManBt-mai" (running)
4 Thread 0x7fbb0cfff700 (LWP 15868) "main" (running)
5 Thread 0x7fbaecfff700 (LWP 15873) "myTimer-0" (running)
* 6 Thread 0x7fbadffff700 (LWP 15882) "stats-root" (running)>>>>>
What seems odd is the status at the end of each stack frame which says running...
Now if I switch to the thread I am interested in:
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fbadffff700 (LWP 15882))](running)
(gdb) bt
Selected thread is running
No backtrace is printed.
However, for the same binary if I strip off the symbols, I can see the stack trace which is baffling.
Any pointers?

Related

How to get gdb to break on "CHKP: Bounds check error" from icc `-check-pointers=write`

The Intel icc compiler has a run-time check feature -check-pointers=write that does some sort of magic to check if a pointer writes beyond data it is supposed to. When I run this on my code, I get hundreds of these errors rolling by in gdb. I would like to have gdb break on the first instance of this error, but it is not implemented as an exception or signal, so catch throw or catch signal doesn't work, and I have no idea if there is a function name associated with this feature.
Is there any way to have the debugger "break" when the run-time checker hits it?
The -check-pointers feature installs code from libchkp.so and all of the functions have the chkp prefix in them. A quick search of the functions in gdb using info functions chkp showed that the traceback function is called "chkp_print_traceback", so this will install a break point when the traceback happens:
break chkp_print_traceback
and now it stops!
[New Thread 0x7fffce34c700 (LWP 41385)]
[New Thread 0x7fffceb4d700 (LWP 41384)]
[New Thread 0x7fffd034e700 (LWP 41383)]
CHKP: Bounds check error ptr=0x7ffe24685870 sz=4 lb=0x7ffe24685860 ub=0x7ffe2468586f loc=0xb170b0
[New Thread 0x7ffec47fc700 (LWP 41621)]
[New Thread 0x7ffe29fff700 (LWP 41622)]
[New Thread 0x7ffed47fe700 (LWP 41603)]
[New Thread 0x7ffecc7fe700 (LWP 41605)]
[New Thread 0x7ffef07f8700 (LWP 41598)]
[New Thread 0x7fff147f8700 (LWP 41597)]
[New Thread 0x7fff387f8700 (LWP 41595)]
[New Thread 0x7fff687f8700 (LWP 41594)]
[New Thread 0x7fff707f8700 (LWP 41590)]
[New Thread 0x7fff907f8700 (LWP 41589)]
[New Thread 0x7fffb45ec700 (LWP 41587)]
[New Thread 0x7ffec4ffd700 (LWP 41577)]
[New Thread 0x7ffec57fe700 (LWP 41442)]
[New Thread 0x7ffec7fff700 (LWP 41441)]
[New Thread 0x7ffecefff700 (LWP 41440)]
[New Thread 0x7ffed5fff700 (LWP 41439)]
[New Thread 0x7ffef0ff9700 (LWP 41438)]
[Switching to Thread 0x7ffec47fc700 (LWP 41621)]
Breakpoint 1, 0x00007ffff5f32d74 in chkp_print_traceback () from /opt/intel/composer_xe_2015.2.164/compiler/lib/intel64/libchkp.so
(gdb) where
#0 0x00007ffff5f32d74 in chkp_print_traceback () from /opt/intel/composer_xe_2015.2.164/compiler/lib/intel64/libchkp.so
#1 0x00007ffff5f31706 in __chkp_check_bounds () from /opt/intel/composer_xe_2015.2.164/compiler/lib/intel64/libchkp.so
#2 0x0000000000b170b0 in redacted

GDB: recover control from blocked process

I have the following problem: I want to recover control of gdb when a process enters a blocking situation i.e. a blocking function or a pooling loop.
Lets illustrate it with an example: I have process A which forks process B. B does its work and then gets stuck waiting for an event from A. I want to switch GDB to A so I can run it separately until the event generation. However, I can not recover control of GDB from B. Of course I can ctrl+C in B which generates a SIGINT signal, and then change to A, but when I go back to B, even if I handle pass SIGINT, B finishes.
Log:
Program received signal SIGINT, Interrupt.
[Switching to Thread 0xb68feb40 (LWP 3177)]
0xb7fdeb0c in ?? ()
(gdb) handle SIGINT pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal Stop Print Pass to program Description
SIGINT Yes Yes Yes Interrupt
(gdb) c
Continuing.
[Thread 0xb7abcb40 (LWP 3178) exited]
[Thread 0xb68feb40 (LWP 3177) exited]
Couldn't get registers: No such process.
(gdb) info inferiors
Num Description
* 2 <null>
1 process 3168
Is there a way to recover control of GDB and switch process without killing it?

Segfault process id and core dump process id are different. Why?

In the Linux message file, I notice that a segfault is reported for process 14947, but I did not get the core dump for process 14947, instead I got 14069.core.(Its generated time matches the time the segfault is hit).
Then I use gdb and find:-
Program terminated with signal 11, Segmentation fault.
[New process 14947]
[New process 26131]
[New process 26130]
[New process 26129]
[New process 26128]
[New process 14945]
[New process 14842]
[New process 14726]
[New process 14598]
[New process 14069]
When I run "info thread", I get:-
(gdb) info thread
10 process 14069 0xffffe410 in __kernel_vsyscall ()
9 process 14598 0xffffe410 in __kernel_vsyscall ()
8 process 14726 0xffffe410 in __kernel_vsyscall ()
7 process 14842 0xffffe410 in __kernel_vsyscall ()
6 process 14945 0xffffe410 in __kernel_vsyscall ()
5 process 26128 0xffffe410 in __kernel_vsyscall ()
4 process 26129 0xffffe410 in __kernel_vsyscall ()
3 process 26130 0xffffe410 in __kernel_vsyscall ()
2 process 26131 0xffffe410 in __kernel_vsyscall ()
* 1 process 14947 0x006a8300 in pthread_mutex_lock ()
So here goes my questions:-
Why the coredump file name does not match the segfault process id in the message file?
I think the coredump is for a particular process, why there are so many info like "[New process 26130]" here ?
why "info thread" will display the info for process, not thread?
Thanks!
Plus: My OS is RHEL5.
In Linux, kernel threads are simply light-weight processes (processes where the virtual memory is marked as shared with the parent process rather than marked as copy-on-write), and hence the process IDs that you see listed are the same as the thread IDs. This is just a guess, but probably the ID for the core is the same as the thread that handled the signal, which might not be the same as the main thread.

OpenMP - Hanging during execution

I'm experiencing an inconsistent behavior of a program that's parallelized using OpenMP.
When I run it, it prints out its current stage, so the expected output is: "2 3 4 5" etc.
Time between the first few stages is usually 1 to 2 seconds (when running in parallel on 4 cores).
However, without recompiling, or altering anything, sometimes when I run the software it hangs right after printing 2 (which is printed before the first parallel code is executed);
It doesn't become slow, it literally stops computing. I've run this under gdb and confirmed that it hangs inside of OpenMP:
(there are more than 4 threads because of hyperthreading)
[New Thread 0x7ffff6c78700 (LWP 25878)]
[New Thread 0x7ffff6477700 (LWP 25879)]
[New Thread 0x7ffff5c76700 (LWP 25880)]
[New Thread 0x7ffff5475700 (LWP 25881)]
[New Thread 0x7ffff4c74700 (LWP 25882)]
[New Thread 0x7ffff4473700 (LWP 25883)]
[New Thread 0x7ffff3c72700 (LWP 25884)]
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7641fd4 in ?? () from /usr/lib/libgomp.so.1
(gdb) up
#1 0x00007ffff7640a9e in ?? () from /usr/lib/libgomp.so.1
(gdb)
#2 0x0000000000408ae8 in Redcraft::createStructures (this=0x7fffffffd8d0) at source/redcraft.cpp:512
512 #pragma omp parallel for private(node)
Originally the pragma specified schedule(dynamic) but having that or removing that doesn't change the consistency of this hangup.
Lastly, I tried enabling/disabling omp_set_dynamic() and that had no effect either.
Any suggestions for debugging?
This usually happens when there is data race.You'll have to post the code block that is being parallelized.Basically what is to be found out is how the threads are using the data.Rerunning without compiling doesn't guarantee the same thread execution sequence hence these kind of problems arise.Are you working with files?You'll have to close them before rerunning.

switching block focus in cuda-cdb

Pretty simple... I want to change focus in cuda-gdb. I can change to a different thread within the current block (block 0), but not to a different block. I'm using cuda/cuda-gdb 3.0
The way in the 3.0 manual:
(cuda-gdb) cuda block
Current CUDA focus: block (0,0).
(cuda-gdb) cuda block (9,0)
CUDA focus unchanged.
(cuda-gdb) cuda thread (9,0,0)
New CUDA focus: device 0, sm 1, warp 0, lane 9, grid 42672, block (0,0), thread (9,0,0).
or the other way (from the 3.2 manual):
(cuda-gdb) thread
[Current Thread 2 (Thread 140272898447104 (LWP 28681))]
[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
(cuda-gdb) thread <<<(9),(10)>>>
Switching to <<<(9,0),(10,0,0)>>> 0x000000000246a5c8 in my_kernel
<<<(16,1),(128,1,1)>>> ...
(cuda-gdb) thread
[Current Thread 2 (Thread 140272898447104 (LWP 28681))]
[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
(cuda-gdb) thread <<<20>>>
Switching to <<<(0,0),(20,0,0)>>> 0x000000000246a5c8 in my_kernel
<<<(16,1),(128,1,1)>>> ...
(cuda-gdb) thread
[Current Thread 2 (Thread 140272898447104 (LWP 28681))]
[Current CUDA Thread <<<(0,0),(20,0,0)>>>]
What am I doing wrong?
cuda 3.0 | ubuntu 9.04 | gtx 480
If you run info cuda sm (IIRC) you can see the currently active blocks. It's not possible to switch to a block (or a warp within a block) that has already completed execution.
If you want to look at a specific block then you should be able to break on the kernel function itself, then change focus, then continue the debugging session.