Locating segmentation fault for multithread program running on cluster - c++

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we don't have an interactive operation.
How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.
I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.

The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.
The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile to debug the program program which produced the core file corefile: You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb matching the hardware of the machine where it was run.
I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.
Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, e.g., ps -elf | grep program) and then attach to it using
gdb program pid
I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP for SIGSEGV...).
That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.

Related

Detecting not using MPI when running with mpirun/mpiexec

I am writing a program (in C++11) that can optionally be run in parallel using MPI. The project uses CMake for its configuration, and CMake automatically disables MPI if it cannot be found and displays a warning message about it.
However, I am worrying about a perfectly plausible use case whereby a user configures and compiles the program on an HPC cluster, forgets to load the MPI module, and does not notice the warning. That same user might then try to run the program, notice that mpirun is not found, include the MPI module, but forget to recompile. If the user then runs the program with mpirun, this will work, but the program will just run a number of times without any parallelization, as MPI was disabled at compile time. To prevent the user from thinking the program is running in parallel, I would like to make the program display an error message in this case.
My questions is: how can I detect that my program is being run in parallel without using MPI library functions (as MPI was disabled at compile time)? mpirun just launches the program a number of times, but does not tell the processes it launches about them being run in parallel, as far as I know.
I thought about letting the program write some test file, and then check if that file already exists, but apart from the fact that this might be tricky to do due to concurrency problems, there is no guarantee that mpirun will even launch the various processes on nodes that share a file system.
I also considered using a system variable to communicate between the two processes, but as far as I know, there is no system independent way of doing this (and again, this might cause concurrency issues, as there is no way to coordinate system calls between the various processes).
So at the moment, I have run out of ideas, and I would very much appreciate any suggestions that might help me achieve this. Preferred solutions should by operating system independent, although a UNIX-only solution would already be of great help.
Basically, you want to run a a detection of whether you are being run by mpirun etc. in your non-MPI code-path. There is a very similar question: How can my program detect, whether it was launch via mpirun that already presents one non-portable solution.
Check for environment variables that are set by mpirun. See e.g.:
http://www.open-mpi.org/faq/?category=running#mpi-environmental-variables
As another option, you could get the process id of the parent process and it's process name and compare it with a list of known MPI launcher binaries such as orted,slurmstepd,hydra??1. Everything about that is unfortunately again non-portable.
Since launching itself is not clearly defined by the MPI standard, there cannot be a standard way to detect it.
1: Only from my memory, please don't take the list literally.
From a user experience point of view, I would argue that always showing a clear message how the program is being run, such as:
Running FancySimulator serially. If you see this as part of mpirun, rebuild FancySimuilator with FANCYSIM_MPI=True.
or
Running FancySimulator in parallel with 120 MPI processes.
would "solve" the problem. A user getting 120 garbled messages will hopefully notice.

How to get linux kernel coredump for later analysis using gdb tool?

Is it possible to intentionally crash the kernel at specific point during the course of its execution (by inserting some C statement there Or otherwise) and then collect the corefile for analysis using normal gdb program ?
Can somebody pls share the steps and what needs to be done.
Is it possible to intentionally crash the kernel
Sure: just insert a call to panic() in desired place.
The easiest way to do this is using user-mode linux. The kernel becomes just a regular program, and you can execute it under GDB the usual way, setting breakpoints, looking at variables, etc.
If you need to do "bare metal" execution, you should probably start here or here.

Recover from crash with a core dump

A C++ program crashed on FreeBSD 6.2 and OS was kind enough to create a core dump. Is it possible to amputate some stack frames, reset the instruction pointer and restart the process in gdb, and how?
Is it possible to amputate some stack frames, reset the instruction pointer and restart the process in gdb?
I assume you mean: change the process state, and set it to start executing again (as if it never crashed in the first place).
No. For one thing, how do you propose GDB (if it magically had this capability) would handle your file descriptors (which the kernel automatically closed when your process died)?
Yes, gdb can debug core dumps just as well as running programs. Assuming that a.out is the name of your program's executable and that a.core is the name of your core file, invoke gdb like so:
gdb a.out a.core
And then you can debug like normal, except you cannot continue execution in any way (even if you could, the program would just crash again). You can examine the stack trace, registers, memory, etc.
Possible duplicate of this: Best practices for recovering from a segmentation fault
Summary: It is possible but not recommended. The way to do it is to usse setjmp() and longjmp() from a signal handler. (Please look at complete source code example in duplicate post.

Corrupt stack problem in C/C++ program

I am running a C/C++ program in linux servers to serve videos. The program's(say named Plugin) core functionality is to convert videos and we fork a separate Plugin process for each video request. But I am having a weird problem for which sometimes server load average gets unexpectedly high. What I see from top command at this stage is that there are some processes which are running for long time and taking some huge CPU's.
When I debug this running program with gdb and backtrace stack,what I found is the corrupt stack: "Previous frame inner to this frame (corrupt stack?)". I have searched the net and found that this occurs if the program gets segmentation fault.
But what I know if the program gets segmentation fault, the program should crash and exit at that point. But surprisingly the program still running after segmentation fault.
What can be the causes of this? I know there must be some big problems in the program but I just can't understand from where to start fixing the problem...It would be great if any of you can show me some lights...
Thanks in advance
Attaching the debugger changes the behavior of the process so you won't get reliable investigation results most probably. Corrupted stack message from the debugger can mean that the particular debugger does not understand text info from the binary.
I would recommend running pstack several time subsequently on the problematic (this is known as "Monte Carlo performance profiling") and also attach strace or truss to the problematic and check what system calls is the process doing when consuming CPU.
Run your program under Valgrind and fix any invalid memory writes that it finds.
Certain optimisations, such as frame pointer omission, can make it harder for the debugger to understand the stack.
If you have the code, compile the program in debug and run Valgrind on it.
If you don't have the code, contact the author/provider of the program.
The corrupt stack message simply means the code is doing something weird with the memory. It does not mean the program has a segmentation fault. Also, the program can still run if it choose to handle the SIGSEGV signal.
If by forking you mean that you have some process which spawn and run other smaller processes, just monitor for such spikes and restart the process. This assumes that you have no access to the fix the program.
There could be some interesting manipulation of the stack taking place through assembly code manipulation, such as true tail-recursion optimization, self-modifying code, non-returning functions, etc. that may cause the debugger to be incapable of properly back-tracing the stack and causing it to trigger a corrupted stack error, but that doesn't necessarily mean that memory is corrupted ... but definitely something non-traditional is happening under the hood.

Segmentation fault only when running on multi-core

I am using a c++ library that is meant to be multi-threaded and the number of working threads can be set using a variable. The library uses pthreads. The problem appears when I run the application ,that is provided as a test of library, on a quad-core machine using 3 threads or more. The application exits with a segmentation fault runtime error. When I try to insert some tracing "cout"s in some parts of library, the problem is solved and application finishes normally.
When running on single-core machine, no matter what number of threads are used, the application finishes normally.
How can I figure out where the problem seam from?
Is it a kind of synchronization error? how can I find it? is there any tool I can use too check the code ?
Sounds like you're using Linux (you mention pthreads). Have you considered running valgrind?
Valgrind has tools for checking for data race conditions (helgrind) and memory problems (memcheck). Valgrind may be to find such an error in debug mode without needing to produce the crash that release mode produces.
Some general debugging recommendations.
Make sure your build has symbols (compile with -g). This option is orthogonal to other build options (i.e. the decision to build with symbols is independent of the optimization level).
Once you have symbols, take a close look at the call stack of where the seg fault occurs. To do this, first make sure your environment is configured to generate core files (ulimit -c unlimited) and then after the crash, load the program/core in the debugger (gdb /path/to/prog /path/to/core). Once you know what part of your code is causing the crash, that should give you a better idea of what is going wrong.
You are running into a race condition.
Where multiple threads are interacting on the same resource.
There are a whole host of possible culprits, but without the source anything we say is a guess.
You want to create a core file; then debug the application with the core file. This will set up the debugger to the state of the application at the point it crashed. This will allow you to examin the variables/registers etc.
How to do this will very depending on your system.
A quick Google revealed this:
http://www.codeguru.com/forum/archive/index.php/t-299035.html
Hope this helps.