OpenMP - Hanging during execution - c++

I'm experiencing an inconsistent behavior of a program that's parallelized using OpenMP.
When I run it, it prints out its current stage, so the expected output is: "2 3 4 5" etc.
Time between the first few stages is usually 1 to 2 seconds (when running in parallel on 4 cores).
However, without recompiling, or altering anything, sometimes when I run the software it hangs right after printing 2 (which is printed before the first parallel code is executed);
It doesn't become slow, it literally stops computing. I've run this under gdb and confirmed that it hangs inside of OpenMP:
(there are more than 4 threads because of hyperthreading)
[New Thread 0x7ffff6c78700 (LWP 25878)]
[New Thread 0x7ffff6477700 (LWP 25879)]
[New Thread 0x7ffff5c76700 (LWP 25880)]
[New Thread 0x7ffff5475700 (LWP 25881)]
[New Thread 0x7ffff4c74700 (LWP 25882)]
[New Thread 0x7ffff4473700 (LWP 25883)]
[New Thread 0x7ffff3c72700 (LWP 25884)]
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7641fd4 in ?? () from /usr/lib/libgomp.so.1
(gdb) up
#1 0x00007ffff7640a9e in ?? () from /usr/lib/libgomp.so.1
(gdb)
#2 0x0000000000408ae8 in Redcraft::createStructures (this=0x7fffffffd8d0) at source/redcraft.cpp:512
512 #pragma omp parallel for private(node)
Originally the pragma specified schedule(dynamic) but having that or removing that doesn't change the consistency of this hangup.
Lastly, I tried enabling/disabling omp_set_dynamic() and that had no effect either.
Any suggestions for debugging?

This usually happens when there is data race.You'll have to post the code block that is being parallelized.Basically what is to be found out is how the threads are using the data.Rerunning without compiling doesn't guarantee the same thread execution sequence hence these kind of problems arise.Are you working with files?You'll have to close them before rerunning.

Related

gdb stack trace shows no backtrace - Selected thread is running

I have a program with multiple threads. This program has been compiled with the -g option and has debug symbols.
(gdb) i threads
Id Target Id Frame
1 Thread 0x7fbb5256c9c0 (LWP 15799) "main" 0x00007fbb40dfdc93 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
2 Thread 0x7fbb149ff700 (LWP 15858) "IOService" (running)
3 Thread 0x7fbb135ff700 (LWP 15860) "EvoAftManBt-mai" (running)
4 Thread 0x7fbb0cfff700 (LWP 15868) "main" (running)
5 Thread 0x7fbaecfff700 (LWP 15873) "myTimer-0" (running)
* 6 Thread 0x7fbadffff700 (LWP 15882) "stats-root" (running)>>>>>
What seems odd is the status at the end of each stack frame which says running...
Now if I switch to the thread I am interested in:
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fbadffff700 (LWP 15882))](running)
(gdb) bt
Selected thread is running
No backtrace is printed.
However, for the same binary if I strip off the symbols, I can see the stack trace which is baffling.
Any pointers?

How to force ROS running in single thread mode

I am trying to debug my ros code in gdb, how ever, when I start the node in gdb, it always gives me:
Starting program: /home/uav/catkin_ws/devel/lib/my_package/my_node
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-
gnu/libthread_db.so.1".
[New Thread 0x7ffff13b9700 (LWP 28089)]
[New Thread 0x7ffff0bb8700 (LWP 28090)]
[New Thread 0x7fffebfff700 (LWP 28091)]
[New Thread 0x7fffeb7fe700 (LWP 28096)]
[New Thread 0x7fffeaffd700 (LWP 28098)]
[New Thread 0x7fffea7fc700 (LWP 28121)]
and hangs forever. I don't see this issue before and have no clue why it always start in multi-threads mode. My main function looks like this:
int main(int argc, char** argv)
{
ros::init(argc, argv, "my_node");
ros::NodeHandle nodeHandle("~");
ros::Rate rate(10);
while (ros::ok()) {
// Do Something
ros::spinOnce();
rate.sleep();
}
return 0;
}
How can I fix this problem? I am thinking I should make my node to a single thread version and debug it, how am I supposed to do that?
There are two ways to talk about multithreading in ROS.
The whole node
The callbackQueue.
In the first case, when you use:
ros::NodeHandle nodeHandle("~");
ros::ok()
...
It does some requests to the ROS Master. And with roscpp the network communication is handled with multithreading. There is no way to change that. If single threading is very important for you, you should try rospy, rosnodejs or others.
"roscpp does not try to specify a threading model for your application. This means that while roscpp may use threads behind the scenes to do network management, scheduling etc., it will never expose its threads to your application."
In the second case, we talk about handling topics, services and actions with multithreading. By default, a topic is handled with one thread. But if you send more data that your node can handle it, you can use multithreading.
"roscpp does, however, allow your callbacks to be called from any number of threads if that's what you want."
For mode details, see:
http://wiki.ros.org/roscpp/Overview/Callbacks%20and%20Spinning

How to get gdb to break on "CHKP: Bounds check error" from icc `-check-pointers=write`

The Intel icc compiler has a run-time check feature -check-pointers=write that does some sort of magic to check if a pointer writes beyond data it is supposed to. When I run this on my code, I get hundreds of these errors rolling by in gdb. I would like to have gdb break on the first instance of this error, but it is not implemented as an exception or signal, so catch throw or catch signal doesn't work, and I have no idea if there is a function name associated with this feature.
Is there any way to have the debugger "break" when the run-time checker hits it?
The -check-pointers feature installs code from libchkp.so and all of the functions have the chkp prefix in them. A quick search of the functions in gdb using info functions chkp showed that the traceback function is called "chkp_print_traceback", so this will install a break point when the traceback happens:
break chkp_print_traceback
and now it stops!
[New Thread 0x7fffce34c700 (LWP 41385)]
[New Thread 0x7fffceb4d700 (LWP 41384)]
[New Thread 0x7fffd034e700 (LWP 41383)]
CHKP: Bounds check error ptr=0x7ffe24685870 sz=4 lb=0x7ffe24685860 ub=0x7ffe2468586f loc=0xb170b0
[New Thread 0x7ffec47fc700 (LWP 41621)]
[New Thread 0x7ffe29fff700 (LWP 41622)]
[New Thread 0x7ffed47fe700 (LWP 41603)]
[New Thread 0x7ffecc7fe700 (LWP 41605)]
[New Thread 0x7ffef07f8700 (LWP 41598)]
[New Thread 0x7fff147f8700 (LWP 41597)]
[New Thread 0x7fff387f8700 (LWP 41595)]
[New Thread 0x7fff687f8700 (LWP 41594)]
[New Thread 0x7fff707f8700 (LWP 41590)]
[New Thread 0x7fff907f8700 (LWP 41589)]
[New Thread 0x7fffb45ec700 (LWP 41587)]
[New Thread 0x7ffec4ffd700 (LWP 41577)]
[New Thread 0x7ffec57fe700 (LWP 41442)]
[New Thread 0x7ffec7fff700 (LWP 41441)]
[New Thread 0x7ffecefff700 (LWP 41440)]
[New Thread 0x7ffed5fff700 (LWP 41439)]
[New Thread 0x7ffef0ff9700 (LWP 41438)]
[Switching to Thread 0x7ffec47fc700 (LWP 41621)]
Breakpoint 1, 0x00007ffff5f32d74 in chkp_print_traceback () from /opt/intel/composer_xe_2015.2.164/compiler/lib/intel64/libchkp.so
(gdb) where
#0 0x00007ffff5f32d74 in chkp_print_traceback () from /opt/intel/composer_xe_2015.2.164/compiler/lib/intel64/libchkp.so
#1 0x00007ffff5f31706 in __chkp_check_bounds () from /opt/intel/composer_xe_2015.2.164/compiler/lib/intel64/libchkp.so
#2 0x0000000000b170b0 in redacted

GDB: recover control from blocked process

I have the following problem: I want to recover control of gdb when a process enters a blocking situation i.e. a blocking function or a pooling loop.
Lets illustrate it with an example: I have process A which forks process B. B does its work and then gets stuck waiting for an event from A. I want to switch GDB to A so I can run it separately until the event generation. However, I can not recover control of GDB from B. Of course I can ctrl+C in B which generates a SIGINT signal, and then change to A, but when I go back to B, even if I handle pass SIGINT, B finishes.
Log:
Program received signal SIGINT, Interrupt.
[Switching to Thread 0xb68feb40 (LWP 3177)]
0xb7fdeb0c in ?? ()
(gdb) handle SIGINT pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal Stop Print Pass to program Description
SIGINT Yes Yes Yes Interrupt
(gdb) c
Continuing.
[Thread 0xb7abcb40 (LWP 3178) exited]
[Thread 0xb68feb40 (LWP 3177) exited]
Couldn't get registers: No such process.
(gdb) info inferiors
Num Description
* 2 <null>
1 process 3168
Is there a way to recover control of GDB and switch process without killing it?

Segfault process id and core dump process id are different. Why?

In the Linux message file, I notice that a segfault is reported for process 14947, but I did not get the core dump for process 14947, instead I got 14069.core.(Its generated time matches the time the segfault is hit).
Then I use gdb and find:-
Program terminated with signal 11, Segmentation fault.
[New process 14947]
[New process 26131]
[New process 26130]
[New process 26129]
[New process 26128]
[New process 14945]
[New process 14842]
[New process 14726]
[New process 14598]
[New process 14069]
When I run "info thread", I get:-
(gdb) info thread
10 process 14069 0xffffe410 in __kernel_vsyscall ()
9 process 14598 0xffffe410 in __kernel_vsyscall ()
8 process 14726 0xffffe410 in __kernel_vsyscall ()
7 process 14842 0xffffe410 in __kernel_vsyscall ()
6 process 14945 0xffffe410 in __kernel_vsyscall ()
5 process 26128 0xffffe410 in __kernel_vsyscall ()
4 process 26129 0xffffe410 in __kernel_vsyscall ()
3 process 26130 0xffffe410 in __kernel_vsyscall ()
2 process 26131 0xffffe410 in __kernel_vsyscall ()
* 1 process 14947 0x006a8300 in pthread_mutex_lock ()
So here goes my questions:-
Why the coredump file name does not match the segfault process id in the message file?
I think the coredump is for a particular process, why there are so many info like "[New process 26130]" here ?
why "info thread" will display the info for process, not thread?
Thanks!
Plus: My OS is RHEL5.
In Linux, kernel threads are simply light-weight processes (processes where the virtual memory is marked as shared with the parent process rather than marked as copy-on-write), and hence the process IDs that you see listed are the same as the thread IDs. This is just a guess, but probably the ID for the core is the same as the thread that handled the signal, which might not be the same as the main thread.