cygwin OMP Core Dump - fortran

I have been trying to parallelize an optimization algorithm written in FORTRAN90 and compiled / run using the cygwin interface with gfortran XXXXX -fomp.
The algorithm computes a gradient and hessian matrix by finite differences from a subroutine call. The subroutine is pretty large and involves manipulation of an ~2 mb matrix each time. For the purpose of discussion I'll use "call srtin()" as example of the subroutine call.
Without using any OMP code anywhere in the compile, the program fails if I use the -fomp option during compilation (the code compiles without a hitch). Regular compilation and execution using gfortran does not cause any issues. However, the moment I add the -fomp option, the resulting executable causes a segmentation fault if a single call to srtin() is present.
I've read on this site that a common issue with omp is stacksize issues. I've inferred (possibly wrong) that the master thread stack size issue is at fault because I haven't yet included any code that would create any slave threads. On a typical linux computer, my understanding is, that I would use the " ulimit -s XXX" to reset this stacksize to a sufficiently high value so that the error no longer occurs. I've tried this through my cygwin interface, but the error persists. I've also tried using the peflags command to set a higher stack memory for this executable with no success. I also have increased the OMP_STACKSIZE environmental variable with no success.
Does anyone have any suggestions?

Enabling OpenMP in GCC disables the automatic placement of large arrays on the heap. Thus, it could make your program crash even if there are no OpenMP constructs in the code. Windows has no equivalent of ulimit -s as the stack size of the main thread is read from the PE header of the executable file. OMP_STACKSIZE controls the stack size of the worker threads and does not affect the one of the master thread.
Use -Wl,--stack,some_big_value as advised by #tim18 instead of editing the PE header with peflags. some_big_value is in bytes. See here for more information.

Related

Debugging an "Invalid address space" error

I've built some C++ code that uses OpenACC and compiled it with the PGI compiler for use on the Tesla GPU.
Compilation succeeds without any warnings.
I run the program and get two errors:
call to cuStreamSynchronize returned error 717: Invalid address space
call to cuMemFreeHost returned error 717: Invalid address space
The internet doesn't seem to know much about this, other than to suggest enabling unified memory so that the problem is automatically swept under the rug. I'm not into that kind of solution.
How do I go about debugging this?
With C++ code running only on the CPU, I'd fire up gdb, do a backtrace, and say, "Ah ha!"
But now I have code living on the CPU and the GPU and data flowing between the two. I don't even know what tools to use.
A fallback is to start commenting out lines until the problem goes away, but that seems suboptimal too.
You can use "cuda-gdb" to debug the device code or use "cuda-memcheck" to check for memory errors.
Though I'm not sure either will help here. The error is indicating that the device code is issuing an instruction using an address from the wrong memory space. For example, using a shared memory pointer with an instruction that expects a global memory pointer.
I have not seen this error before nor do I see any previous bug reports for it, so can only theorize as to the cause. One possibility is if you have a shared memory variable (scalar or array in a "private" clause, or "cache" directive) that's passed from a outer gang loop to a vector routine. In this case, the vector routine may be accessing the variable as if it's in global memory.
Most likely whatever the cause, it's a compiler error. If possible, please post or send to PGI Customer Service (trs#pgroup.com) a reproducing example and I'll get it to our compiler engineers for investigation.
I can also try to get you a work-around once I better understand the cause. Though in the meantime you can try compiling with "-ta=tesla:nollvm,keepgpu". "nollvm" will cause the compiler to generate an intermediary CUDA C version of the OpenACC kernels as opposed to the default LLVM device code generator. "keepgpu" will keep the intermediary ".gpu" file which you can inspect.
There are some helpful environment variables that aid in debugging. Any combination can be enabled:
export PGI_ACC_TIME=1 #Profile time usage
export PGI_ACC_NOTIFY=1 #Set to values 0-3 where 3 is the most detailed
export PGI_ACC_DEBUG=1 #Extra debugging info

Problems spawning a multi-threaded process from a separate single-threaded process in Linux

I've had some rather unusual behavior from my OpenMP program, which I think must be caused by some inner-workings of Linux processes that I am unaware of.
When I run a C benchmark binary which has been compiled with OpenMP support, the program executes successfully, no problems at all:
>:$ OMP_NUM_THREADS=4 GOMP_CPU_AFFINITY="0-3" /home/me/bin/benchmark
>:$ ...benchmark complete...
When I run the benchmark from a separate C++ program that I start, where the code to execute it (for example) looks like:
int main(int argc, char* argv[]){
system("/home/me/bin/benchmark");
return 0;
}
The output gives me warnings:
>:$ home/me/bin/my_cpp_program
OMP: Warning #123: Ignoring invalid OS proc ID 1.
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
These warnings are the same warnings I get when I try to set CPU affinity to CPUs that don't exist, and run the OpenMP benchmark directly.
I therefore assume that the only CPU my_cpp_program knows to exist is processor ID 0. I also get the same error when using root so I don't think it is a permissions problem? I've also checked that the code executed by system() has the correct environment variables, and the same linked libraries.
Does anyone know what is causing this to occur?
Following a suggestion by Zulan, I'm pretty sure the reason this occurred was because of the nature and compilation of bin/my_cpp_program. It was importantly also compiled with -fopenmp. It turns out that bin/my_cpp_program was compiled against the GNU OpenMP implementation library libgomp, whereas the bin/benchmark program was compiled against the LLVM OpenMP implementation library libomp.
I am not certain, but what I think happened was the following. GOMP_CPU_AFFINITIY is the GNU environment variable for setting CPU affinity in libgomp. Therefore because GOMP_CPU_AFFINITY was set, the single-thread running in bin/my_cpp_program was bound to CPU 0. I guess any child processes spawned by that thread must also only see CPU 0 as its potential-CPUs when using further GOMP_CPU_AFFINITY affinity assignments. This then gave me the warnings when the spawned process (from the environment) tries to find CPUs 1-3.
To workaround this I used KMP_AFFINITY which is Intel's CPU affinity environment variable. The libomp OpenMP runtime used by bin/benchmark gives a higher precedence to KMP_AFFINITY when both it and GOMP_CPU_AFFINITY are set, and for whatever reason this allows the spawned child process to assign to other CPUs. So to do this, I used:
KMP_AFFINITY=logical,granularity=fine ./bin/benchmark
This means the program works as expected (each logical core is bound in ascending order from 0 to 3) in both situations, as the CPU assignments of the bin/my_cpp_program no longer screw with the assignments of bin/benchmark, when one spawns the other. This can be checked as truly occurring by adding verbose to the comma-separated list.

Using call system() causes program to hang... 50% of the time

I have developed a Fortran code with memory requirements that scale with the size of the problem compiled with ifort. After initialization of the problem (allocation of arrays, etc.) the main part of the code loops through a series of function calls.
One of these functions includes 3-5 call system() commands. Some are simple and are only copying directories such as:
call system('cp -r plot_files plot_files1)
While there is another which actually calls an mpiexec that runs a separate program.
The problem is that the program 'hangs' on the system calls about 50% of the time but only for large problems in which there have been arrays allocated (~ array(300000)).
By hang I mean that when I qstat, it is still shown to be running but searching for the PID using pstack, strace, cat/proc/PID/status reveals that the PID no longer exists.
There are call system() earlier in the code before the bulk of the initialization and it will never fail there, only after allocation of the arrays. This made me believe that it was a memory issue, but monitoring during the hanging process reveals that there is plenty of memory available.
[top -cbp PID before the program hangs][1]
I was originally compiling it with OpenMP with the hopes of future parallelization of the code. With OpenMP the failure rate was around 80% of the time. When OpenMP was taken out the failure rate fell to around 50%. I've searched and searched for possible reasons for this problem and have come up empty handed.

Locating segmentation fault for multithread program running on cluster

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we don't have an interactive operation.
How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.
I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.
The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.
The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile to debug the program program which produced the core file corefile: You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb matching the hardware of the machine where it was run.
I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.
Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, e.g., ps -elf | grep program) and then attach to it using
gdb program pid
I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP for SIGSEGV...).
That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.

OpenMP in Fortran

I very rarely use fortran, however I have been tasked with taking legacy code rewriting it to run in parallel. I'm using gfortran for my compiler choice. I found some excellent resources at https://computing.llnl.gov/tutorials/openMP/ as well as a few others.
My problem is this, before I add any OpenMP directives, if I simply compile the legacy program:
gfortran Example1.F90 -o Example1
everything works, but turning on the openmp compiler option even without adding directives:
gfortran -openmp Example1.F90 -o Example1
ends up with a Segmentation fault when I run the legacy program. Using smaller test programs that I wrote, I've successfully compiled other programs with -openmp that run on multiple threads, but I'm rather at a loss why enabling the option alone and no directives is resulting in a seg fault.
I apologize if my question is rather simple. I could post code but it is rather long. It faults as I assign initial values:
REAL, DIMENSION(da,da) :: uconsold
REAL, DIMENSION(da,da,dr,dk) :: uconsolde
...
uconsold=0.0
uconsolde=0.0
The first assignment to "uconsold" works fine, the second seems to be the source of the fault as when I comment the line out the next several lines execute merrily until "uconsolde" is used again.
Thank you for any help in this matter.
Perhaps you are running of stack space? With openmp variables will be on the stack so that each thread has its own copy. Perhaps your arrays are large, and even with a single thread (no openmp directives) they are using up the stack. Just a guess... Trying your operating system's method to increase the size of the stack space and see if the segmentation fault goes away.
Another approach: to specify that the array should go on the heap, you could make it "allocatable". OpenMP version 3.0 allows more uses of Fortran allocatable arrays -- I'm not sure of the details.
I had this problem. It's spooky: I get segfaults just for declaring 33x33 arrays or 11x11x11 arrays with no OpenMP directives; these segfaults occur on an Intel Mac with 4 GB RAM. Making them "allocatable" rather than statically allocated fixed this problem.