Why does MPI_Barrier cause a segmentation fault in C++ - c++

I have reduced my program to the following example:
#include <mpi.h>
int main(int argc, char * argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
I compile and run the code, and get the following result:
My-MacBook-Pro-2:xCode_TrapSim user$ mpicxx -g -O0 -Wall barrierTest.cpp -o barrierTestExec
My-MacBook-Pro-2:xCode_TrapSim user$ mpiexec -n 2 ./barrierTestExec
================================================================================== =
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 21633 RUNNING AT My-MacBook-Pro-2.local
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
If I comment out the MPI_Barrier, or run the program on only one node, the code runs fine. I am using the following compilers:
My-MacBook-Pro-2:xCode_TrapSim user$ mpiexec --version
HYDRA build details:
Version: 3.2
Release Date: Wed Nov 11 22:06:48 CST 2015
CC: clang
CXX: clang++
F77: /usr/local/bin/gfortran
F90: /usr/local/bin/gfortran
Configure options: '--disable-option-checking' '--prefix=/usr/local/Cellar/mpich/3.2_1' '--disable-dependency-tracking' '--disable-silent-rules' '--mandir=/usr/local/Cellar/mpich/3.2_1/share/man' 'CC=clang' 'CXX=clang++' 'FC=/usr/local/bin/gfortran' 'F77=/usr/local/bin/gfortran' '--cache-file=/dev/null' '--srcdir=.' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=-lpthread ' 'CPPFLAGS= -I/private/tmp/mpich-20160606-48824-1qsaqn8/mpich-3.2/src/mpl/include -I/private/tmp/mpich-20160606-48824-1qsaqn8/mpich-3.2/src/mpl/include -I/private/tmp/mpich-20160606-48824-1qsaqn8/mpich-3.2/src/openpa/src -I/private/tmp/mpich-20160606-48824-1qsaqn8/mpich-3.2/src/openpa/src -D_REENTRANT -I/private/tmp/mpich-20160606-48824-1qsaqn8/mpich-3.2/src/mpi/romio/include'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
My-MacBook-Pro-2:xCode_TrapSim user$ clang --version
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
This seems like it should be a trivial problem, but I can't seem to figure it out. Why would MPI_Barrier cause this simple code to seg fault?

It is supper hard to decide what is wrong with your installation. However, if you can use any of the MPI flavors, maybe you can try this one:
http://www.owsiak.org/?p=3492
All I can say, it works with Open MPI
~/opt/usr/local/bin/mpicxx -g -O0 -Wall barrierTestExec.cpp -o barrierTestExec
~/opt/usr/local/bin/mpiexec -n 2 ./barrierTestExec
and no exception in my case. It really seems to be environment specific.

Related

Using gprof with LULESH benchmark

I've been trying to compile and run LULESH benchmark
https://codesign.llnl.gov/lulesh.php
https://codesign.llnl.gov/lulesh/lulesh2.0.3.tgz
with gprof but I always get a segmentation fault. I updated these instructions in the Makefile:
CXXFLAGS = -g -pg -O3 -I. -Wall
LDFLAGS = -g -pg -O3
[andrestoga#n01 lulesh2.0.3]$ mpirun -np 8 ./lulesh2.0 -s 16 -p -i 10
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 30557 on node n01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
In the webpage of gprof says the following:
If you are running the program on a system which supports shared
libraries you may run into problems with the profiling support code in
a shared library being called before that library has been fully
initialised. This is usually detected by the program encountering a
segmentation fault as soon as it is run. The solution is to link
against a static version of the library containing the profiling
support code, which for gcc users can be done via the -static' or
-static-libgcc' command line option. For example:
gcc -g -pg -static-libgcc myprog.c utils.c -o myprog
I added the -static command line option and I also got segmentation fault.
I found a pdf where they profiled LULESH by updating the Makefile by adding the command line option -pg. Although they didn't say the changes they made.
http://periscope.in.tum.de/releases/latest/pdf/PTF_Best_Practices_Guide.pdf
Page 11
Could someone help me out please?
Best,
Make sure all libraries are loaded:
openmpi (which you have done already)
gcc
You can try with parameters that will allow you to identify if the problem is your machine in terms of resources. if the machine does not support such number of processes see how many processes (MPI or not MPI) it supports by looking at the architecture topology. This will allow you to identify what is the correct amount of jobs/processes you can launch into the system.
Very quick run:
mpirun -np 1 ./lulesh2.0 -s 1 -p -i 1
Running problem size 1^3 per domain until completion
Num processors: 1
Num threads: 2
Total number of elements: 1
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
cycle = 1, time = 1.000000e-02, dt=1.000000e-02
Run completed:
Problem size = 1
MPI tasks = 1
Iteration count = 1
Final Origin Energy = 4.333329e+02
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 0.000000e+00
TotalAbsDiff = 0.000000e+00
MaxRelDiff = 0.000000e+00
Elapsed time = 0.00 (s)
Grind time (us/z/c) = 518 (per dom) ( 518 overall)
FOM = 1.9305019 (z/s)

"sh: ./<file> not found" error when trying to execute a file

I've come across a weirdest problem I ever met. I'm cross-compiling an app for ARM CPU with Linux on-board. I'm using buildroot, and all goes well until I'm trying to run the application on the target: I'm getting -sh: ./hw: not found. E.g.:
$ cat /tmp/test.cpp
#include <cstdio>
#include <vector>
int main(int argc, char** argv){
printf("Hello Kitty!\n");
return 0;
}
$ ./arm-linux-g++ -march=armv7-a /tmp/test.cpp -o /tftpboot/hw
load the executable to the target; then issuing on the target:
# ./hw
-sh: ./hw: Permission denied
# chmod +x ./hw
# ./hw
-sh: ./hw: not found
# ls -l ./hw
-rwxr-xr-x 1 root root 6103 Jan 1 03:40 ./hw
There's more to it: upon building with distro compiler, like arm-linux-gnueabi-g++ -march=armv7-a /tmp/test.cpp -o /tftpboot/hw, the app runs fine!
I compared executables through readelf -a -W /tftpboot/hw, but didn't notice much defference. I pasted both outputs here. The only thing I noticed, are lines Version5 EABI, soft-float ABI vs Version5 EABI. I tried removing the difference by passing either of -mfloat-abi=softfp and -mfloat-abi=soft, but compiler seems to ignore it. I suppose though, this doesn't really matter, as compiler doesn't even warn.
I also thought, perhaps sh outputs this error if an executable is incompatible in some way. But on my host PC I see another error in this case, e.g.:
$ sh /tftpboot/hw
/tftpboot/hw: 1: /tftpboot/hw: Syntax error: word unexpected (expecting ")")
sh prints this weird error because it is trying to run your program as a shell script!
Your error ./hw: not found is probably caused by the dynamic linker (AKA ELF interpreter) not being found. Try compiling it as a static program with -static or running it with your dynamic loader: # /lib/ld-linux.so.2 ./hw or something like that.
If the problem is that the dynamic loader is named differently in your tool-chain and in your runtime environment you can fix it:
In the runtime environment: with a symbolic link.
In the tool-chain: use -Wl,--dynamic-linker=/lib/ld-linux.so.2

Application on another system crashes on startup without error message for sudo, Segmentation Fault for non-sudo

I have written a websocket++ server on Ubuntu 13.10 and am trying to execute it on Linux Mint 16.
I have installed all dependencies, and the first line under main is a cout which never fires.
This is the compile command:
g++ -o Dgn Dgn.cpp ed25519-donna-master/ed25519.o
-Og -std=c++0x -I ~/Dgn -D_WEBSOCKETPP_CPP11_STL_ -D_WEBSOCKETPP_NO_CPP11_REGEX_
-lboost_regex -lboost_system -L/usr/lib -lssl -lcrypto -pthread -lpqxx
-lboost_thread -ljson_spirit -lgmp -lgmpxx
If I execute with sudo to use restricted ports, it fails immediately without error returning to the command line.
If I execute without sudo, is prints Segmentation Fault and fails immediately to the command line.
The directories in ~/Dgn are present on the new system.
I did a quick, simple test and checked to see if a basic websocket++ example could compile and execute normally, and it was successful.
Both systems are 64-bit. The only difference are the distros, but Linux Mint 16 is based upon Ubuntu 13.10, and all commands to setup were identical.
How can this be compiled so that it can execute on another system?
As a further test, I compiled it on the new system, and it works.
Is it not possible to compile on one system and run on another?
GDB
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7de58da in ?? () from /lib64/ld-linux-x86-64.so.2
The fact that cout line never fires (I assume it has << std::endl) means that the crash happens in a static object constructor. The most straightforward way to debug is to allow core dump (see man limits) and inspect the dump with gdb. So far that's all I can think of. More details will help.

dtrace on linux does not always remove userland probes at program exit -- why?

(See end of this post if you want to see how I installed dtrace - for now I'll assume you already have it installed)
I made a custom probe with no problems at all by following these steps:
A. Create thing.d with my probe definition
provider thing {
probe test();
};
B. Create a simple main.cpp
#include "thing.h"
int main()
{
// Fire my probe
THING_TEST();
// Something to prevent immediate exit
for(;;)
sleep(1);
// Bye
return 0;
}
C. Compile (but don't link) main.cpp. Note how you must define _DTRACE_VERSION or else your probes will be commented out in thing.h.
g++ -D _DTRACE_VERSION -c main.cpp -o main.o
D. Build the probe object file (note that you must include main.o as part of this)
dtrace -G -s thing.d -o thing.o main.o
E. Link it all up
g++ main.o thing.o -o thing
Here's the problem: Run the app, and terminate with CTRL-C (obviously the app won't stop by itself because of the infinite loop...).
In fact, do this a few times.
Now, from a superuser terminal:
# dtrace -l | grep thing
322991 thing28217 thing main test
322992 thing28403 thing main test
322994 thing28636 thing main test
These guys are just hanging around... It's like they never got de-registered or something.
I've run "ps" to see if there are any procs with those pids (28217, 28403, 28636) and nope, nothing there.
INTERESTINGLY, if I remove the infinite loop from main.cpp (the sleep() loop) and just let the app immediately exit, then the probes are properly removed. So it seems like the issue has to do with CTRL-C being detected inside of sleep() - perhaps some kind of atexit() handler isn't being called?
Here's my system info:
$ uname -a
Linux beavis 3.5.0-26-generic #42-Ubuntu SMP Fri Mar 8 23:18:20 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version
g++ (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2
DTRACE INSTALLATION
I am not using the default dtrace that comes with Ubuntu, but rather the dtrace4linux stuff which I installed like this:
http://askubuntu.com/questions/60940/how-do-i-install-dtrace
NOTE: I am using the latest version from Paul Fox's site:
ftp://crisp.dyndns-server.com/pub/release/website/dtrace/dtrace-20130317.tar.bz2

Problems installing srilm 1.6.0 with llvm-gcc 4.x

The problem
I am trying to build SRI's Language Modeling tool, srilm version 1.6.0 on my mac, and coming across some fairly strange compilation problems ("strange" in that a few hours of Google-fu did not help), so I am turning to you to see if anyone sees how I can fix this.
I have already checked that I have the required dependencies and followed the install instructions as well as gone through the build troubleshooting section of the FAQ.
System Specifications
I have a pretty vanilla install of OS X, with some packages installed through homebrew. XCode 4.3.2 (latest version) is installed. Here are the other relevant system details.
OS version
Mac OS X 10.7.4
gcc -v printout
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin11.0.0/4.6.1/lto-wrapper
Target: x86_64-apple-darwin11.0.0
Configured with: ../gcc-4.6.1/configure --enable-languages=fortran,c++
Thread model: posix
gcc version 4.6.1 (GCC)
g++ -v printout
$ gcc -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin11.0.0/4.6.1/lto-wrapper
Target: x86_64-apple-darwin11.0.0
Configured with: ../gcc-4.6.1/configure --enable-languages=fortran,c++
Thread model: posix
gcc version 4.6.1 (GCC)
uname -a printout
$ uname -a
Darwin MacBook-Air.local 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr 9 19:32:15 PDT 2012; root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64
The error itself
The following is the end of the output produced by running make World from the srlim top-level directory. Everything up until this point compiles fine in any of the following circumstances:
I run make World on its own.
I run make World MACHINE_TYPE=macosx
I run make World MACHINE_TYPE=macosx-m64 (specific makefile for 64bit processors)
I run make World MACHINE_TYPE=macosx-m32 (specific makefile for 32bit processors)
And the error that pops up is always the same (shown below).
stderr printout
$ make World
(...) # a bunch of stuff compiles with no errors or warnings
c++ -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -DHAVE_ZOPEN -I/usr/include -I. -I../../include -DHAVE_ZOPEN -c -g -O2 -fno-common -o ../obj/macosx/LatticeIndex.o LatticeIndex.cc
LatticeIndex.cc:78:6: error: variable length array of non-POD element type
'NBestWordInfo'
makeArray(NBestWordInfo, roundedNgram, len + 1);
^
../../include/Array.h:93:33: note: expanded from macro 'makeArray'
# define makeArray(T, A, n) T A[n]
^
LatticeIndex.cc:126:4: warning: data argument not used by format string
[-Wformat-extra-args]
(float)ngram[0].start);
^
LatticeIndex.cc:128:4: warning: data argument not used by format string
[-Wformat-extra-args]
(float)(ngram[len-1].start + ngram[len-1].duration));
^
2 warnings and 1 error generated.
make[2]: *** [../obj/macosx/LatticeIndex.o] Error 1
make[1]: *** [release-libraries] Error 1
make: *** [World] Error 2
Any idea what could be going wrong? It seems to compile fine on some other people's macs in my department, and I've checked their makefiles for differences, but nothing popped up. No one here has any idea why the build fails, but we'd really appreciate it if you can help us out. Thanks in advance for any help you can provide me with! :-)
The problem is due to Apple using llvm-gcc/clang, which does not support variable length arrays. This problem can actually be addressed by modifying $SRILM/dstruct/src/Array.h, and has been noted and addressed in the upcoming release of srilm.
For the time being, on a mac, build srilm using g++ 4.2 instead, using the following command:
$ make MACHINE_TYPE=macosx-m64 CXX=g++-4.2 World
This builds srilm without problem on all my macs.