mpiexec fails as MPI init aborts - mpich

I am trying to install MPICH 2 on a 64-bit machine running on Ubuntu 11.04 (Natty Narwhal). I used
sudo apt-get install mpich2
First I was surprised to see that mpd was not installed. On looking up on Google, I saw that Hydra is the new default package manager.
So I tried to run my MPI code. I got the following error.
> -------------------------------------------------------------------------------------------
> [ip-10-99-75-58:02212] [[INVALID],INVALID] ORTE_ERROR_LOG: A
> system-required executable either could not be found or was not
> executable by this user in file
> ../../../../../../orte/mca/ess/singleton/ess_singleton_module.c at
> line 357 [ip-10-99-75-58:02212] [[INVALID],INVALID] ORTE_ERROR_LOG: A
> system-required executable either could not be found or was not
> executable by this user in file
> ../../../../../../orte/mca/ess/singleton/ess_singleton_module.c at
> line 230 [ip-10-99-75-58:02212] [[INVALID],INVALID] ORTE_ERROR_LOG: A
> system-required executable either could not be found or was not
> executable by this user in file ../../../orte/runtime/orte_init.c at
> line 132
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process
> is likely to abort. There are many reasons that a parallel process
> can fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_ess_set_name failed --> Returned value A system-required
> executable either could not be found or was not executable by this
> user (-127) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process
> is likely to abort. There are many reasons that a parallel process
> can fail during MPI_INIT; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> ompi_mpi_init: orte_init failed --> Returned "A system-required
> executable either could not be found or was not executable by this
> user" (-127) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> -------------------------------------------------------------------------------------------
First of all, it looks to me as an Open MPI error. But I installed MPICH 2 and not Open MPI.
Secondly, I am at a fix on how to handle this as all help seems to be directed to Open MPI users. Am I missing something?

I have the same problem on Ubuntu 12.04. I find my problem is because I have both open-mpi and mpich2 on my computer. When I compile my program using mpicc, it will be linked to open-mpi not mpich2. To fix this problem, you can use "mpicc.mpich2" to compile your program and then use "mpiexec.mpich2" to execute your code.

Indeed, these error messages are all Open MPI errors. For some reason, you appear to also have a (badly configured?) copy of Open MPI installed somewhere. You can check which particular file you are executing when you type mpiexec by running which mpiexec. I believe that you can compare this with the result of:
dpkg --listfiles mpich2
(or similar) in order to figure out where the MPICH2 package was installed.

I had this happen to me, and I found the issue. Somewhere on your system during startup LD_PRELOAD was set to point to libmpi.so in OpenMPI.
Example:
export LD_PRELOAD=<some_directory>/openmpi/1.4.4/lib/libmpi.so
The result is that MPICH2 fails. Just do 'unset LD_PRELOAD' before running MPICH2 and the issue goes away.
Note that setting LD_PRELOAD to OpenMPI's libmpi.so is actually required sometimes for OpenMPI to work, so unsetting may break OpenMPI for you. Just remember to reset it if you need to use OpenMPI.

Related

Fatal error: debugger does not support channel locks

I am trying to use ocamldebug with my project, to understand why a 3rd party lib I'm using is not behaving the way I expected.
https://ocaml.org/manual/debugger.html
The OCaml debugger is invoked by running the program ocamldebug with the name of the bytecode executable file as first argument
I have added (modes byte exe) to my dune file.
When I run dune build I can see the bytecode file output, alongside the exe, as _build/default/bin/cli.bc
When I pass this to ocamldebug I get the following error:
ocamldebug _build/default/bin/cli.bc
OCaml Debugger version 4.12.0
(ocd) r
Loading program... done.
Fatal error: debugger does not support channel locks
Lost connection with process 33035 (active process)
between time 170000 and time 180000
Restart from time 170000 and try to get closer of the problem ? (y or n)
If I choose y the console seems to hang indefinitely.
I found the source of the error here:
https://github.com/ocaml/ocaml/blob/f68acd1a618ac54790a8347fad466084f15a9a9e/runtime/debugger.c#L144
/* The code in this file does not bracket channel I/O operations with
Lock and Unlock, so fail if those are not no-ops. */
if (caml_channel_mutex_lock != NULL ||
caml_channel_mutex_unlock != NULL ||
caml_channel_mutex_unlock_exn != NULL)
caml_fatal_error("debugger does not support channel locks");
...but I don't know what might be triggering it.
My project is using cmdliner and lwt ...I think at this early point of execution it hasn't hit any lwt code though.
Is ocamldebug incompatible with cmdliner?
If that's the case then I will need to make a new entrypoint just for debugging I guess. (currently the bin/cli is the only executable artefact in my project, the code I need to debug is all under lib/s)
It looks like that the OCaml debugger is broken for your version of macOS. Please, report the issue to the OCaml issue tracker including the detailed information on your system. I can't reproduce it on my machine, but I am using a pretty old version of macOS (10.11.6) and I have the 4.12 debugger working flawlessly.
As a workaround, try using an older version of OCaml, as this channel lock test was introduced very recently you can install any version prior to 4.12,
opam switch create 4.11.0
eval $(opam env)
Then, do not forget to rebuild your project (previously installing the required dependencies),
opam install lwt cmdliner
dune build
and then you can use the debugger to your taste.

No source file for Netaccel_link error on running program

I have an OCaml program that worked fine on Ubuntu 16 but when recompiled and run on Ubuntu 20 I get the following error:-
$ ocamldebug ./linearizer
OCaml Debugger version 4.08.1
(ocd) r
Loading program... done.
Time: 89534
Program end.
Uncaught exception: Sys_error "Illegal seek"
(ocd) b
Time: 89533 - pc: 624888 - module Netaccel_link
No source file for Netaccel_link.
I thought this was due to missing dev libraries but:-
$ sudo apt install libocamlnet-ocaml-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
libocamlnet-ocaml-dev is already the newest version (4.1.6-1build6).
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
What setup step am I missing on Ubuntu 20?
This looks like a regression bug in libocamlnet and you should report an issue there or, I am a bit pessimistic that you will get any response, you can try to debug the issue yourself.
The problem that you are facing has nothing to do with missing libraries (they will be reported during installation or, if the package is broken, end up in linker errors). It may result, however, from some misconfiguration of the system. If that is true, then you're lucky as you can fix it yourself.
I will give you some advice that might help you in debugging this issue. For more, please try using discuss.ocaml.org as a more suitable media (SO doesn't favor this kind of a discussion and we might get deleted by admins).
The illegal seek exception is thrown when the seek operation is applied on a non-regular file, aka ESPIPE Unix error. So check your inputs. It could be that what was previously regarded as a file in Ubuntu is now a pipe or a socket.
Try to use ltrace or strace to pinpoint the culprit e.g.,
ltrace ./linearizer
or, if it overwhelms you, try strace
strace ./linearizer
Instead of using ocamldebug you can use plain gdb. You can use gdb's interfaces to provide the path to the source code (though most likely it won't work since ocamlnet is not compiled with debug information). I believe that it will give you a more meaningful backtrace.
Instead of using the system installation try using opam. Install your dependencies with opam and try older versions as well as newer versions of the OCaml compiler. Also, try different versions of ocamlnet. Ideally, try to reproduce the environment that used to work for you.
When nothing else works, you can use objdump -d and look at the disassembly of your binary. OCaml is using a pretty readable and intuitive name mangling scheme (<module_name>__<function_name>_<uid>), so you can easily find the source code (search for <module_name>.ml file and look for the <function_name> there)
Finally, just use docker or any other container to run your application. Consider switching from ocamlnet to something more modern and supported.

error: initial process state wasn't stopped: exited

I'm having a problem debugging a command line program on OS X. I've used this same source file with the same g++ command line hundreds of times to test things with the Crypto++ library.
Under GDB, I get the following after loading the EXE:
$ gdb ./cryptopp-test.exe
...
(gdb) r
Starting program: /Users/jwalton/cryptopp-test.exe
Unable to find Mach task port for process-id 42811: (os/kern) failure (0x5).
Under LLDB, I get the following:
$ lldb ./cryptopp-test.exe
Current executable set to './cryptopp-test.exe' (x86_64).
(lldb) r
error: initial process state wasn't stopped: exited
I've recompiled the program a few times, and I can't get it to run under a debugger. I'm getting a segfault when trying to run outside the debugger too, so that may be a symptom here also.
OS X is 10.8.5, and Xcode is 5.1.1 (5B1008). Everything is fully patched. The only thing to change recently is signing up for a developer account, which is broken thanks to Apple's DRM crap. I can't seem to get any of it to work with Xcode or the command line even though Roots and Certificates are in my Keychain. But this program does not use code signing.
What is causing the initial process state wasn't stopped: exited error, and how do I fix it?
The errors that you have received are usually a direct correlation of a codesigning issue, not with your executable, but with gdb and lldb themselves.
You have a couple of options:
Launch gdb or lldb as sudo (which ignores the codesign req to run executables)
Create a codesigning certificate for gdb or lldb in Keychain.app
Obviously the first option is quickest, but probably should be avoided as it opens up the possibility of bad things happening with elevated permissions.
With option #2 you can likely get gdb or lldb properly working by doing this:
Launch /Applications/Utilities/Keychain Access.app
Select the Keychain Access -> Certificate Assistant -> Create a Certificate...
Choose a name for the new certificate (for example lldb-cert or gdb-cert)
Set Identity Type to Self Signed Root
Set Certificate Type to Code Signing
Select the Let me override defaults option
Continue until the "Specify a Location For The Certificate" screen appears
Set Keychain to System and Continue
In the view showing your certificates, double-click on the one just created and then set "When using this certificate" to "Always Trust"
In Terminal:
codesign -f -s "gdb-cert" /path/to/gdb (or) "lldb-cert" /path/to/lldb
You might have to restart for this to effectively take hold.
There are more concise instructions here for gdb and here for lldb on the codesigning process.

strange gdbserver output shows at my target device

when I run gdbserver on uclinux target device blackfin bfin537/stamp it work perfectly but it always generates annoying output
Request to get for unknown register 232
Request to get for unknown register 236
it is extremely annoying since each step out or step in gdb client results several of that error on the output screen terminal RS232 I was recommended to change the bfin compiler version and rebuild gdb server with different version of uclinux ,.... none of them worked and even compiling my code with different versions of bfin-uclinux-gcc didn't solve my problem.
I decided to recompile gdbserver.c and eliminate the line that generates the error but in fact that line does not exists in any of the gdbserver related files for compiling.
I decided to suppress the stderr output of gdb server by running gdbserver :3298 process 1>/dev/null 2>/dev/null but this didn't solve it
how can I configure my gdb client to asks for specific registers (bfin-uclinux-gdb) related to bfin537-stamp?
I think this error originates somewhere else in uclinux system background system processes.
I want to find which process writes in stderr,stdout which I am unaware of It and I want to suppress its outputs?
Shall I change something in the busybox shell or /bin/bash to eliminates all stderr outputs
which means if I send all the parent shell output or stderr to /dev/null
Thanks

Need GLIBC debug information from rpmbuild of updated source

I'm working on RHEL WS 4.5.
I've obtained the glibc source rpm matching this system, opened it to get its contents using rpm2cpio.
Working in that tree, I've created a patch to mtrace.c (i want to add more stack backtrace levels) and incorporated it in the spec file and created a new set of RPMs including the debuginfo rpms.
I installed all of these on a test vm (created from the same RH base image) and can confirm that my changes are included.
But with more complex executions, I crash in mtrace.c ... but gdb can't find the debug information so I don't get line number info and I can't actually debug the failure.
Based on dates, I think I can confirm that the debug information is installed on the test system in /usr/src/debug/glibc-2.3.6/
I tried
sharedlibrary libc*
in gdb and it tells me the symbols are already loaded.
My test includes a locally built python and full symbols are found for python.
My sense is that perhaps glibc isn't being built under rpmbuild with debug enabled. I've reviewed the glibc.spec file and even built with
_enable_debug_packages
defined as 1 which looked like it might influence the result. My review of the configure scripts invoked during the rpmbuild build step didn't give me any hints.
Hmmmm .. just found /usr/lib/debug/lib/libc-2.3.4.so.debug
and /usr/lib/debug/lib/tls/i486/libc-2.3.4.so.debug
but both of these are reported as stripped by the file command.
It appears that you are installing non-matching RPMs:
/usr/src/debug/glibc-2.3.6
just found /usr/lib/debug/lib/libc-2.3.4.so.debug
There are not for the same version; there is no way they came from the same -debuginfo RPM.
both of these are reported as stripped by the file command.
These should not show as stripped. Either they were not built correctly, or your strip is busted.
Also note that you don't actually have to get all of this working to debug your problem. In the RPMBUILD directory, you should be able to find the glibc build directory, with full-debug libc.so.6. Just copy that library into your VM, and you wouldn't have to worry about the debuginfo RPM.
Try verifying that debug info for mtrace.c is indeed present. First see if the separate debug info for GLIBC knows about a compilation unit called mtrace.c:
$ eu-readelf -w /usr/lib/debug/lib64/libc-2.15.so.debug > t
$ grep mtrace t
name (strp) "mtrace.c"
name (strp) "mtrace"
1 0 0 0 mtrace.c
[10480] "mtrace.c"
[104bb] "mtrace"
[5052] symbol: mtrace, CUs: 446
Then see if GDB actually finds the source file from the glibc-debuginfo RPM:
(gdb) set pagination off
(gdb) start # pause your test program right after main()
(gdb) set logging on
Copying output to gdb.txt.
(gdb) info sources
Quit GDB then grep for mtrace in gdb.txt and you should find something like /usr/src/debug/glibc-2.15-a316c1f/malloc/mtrace.c
This works with GDB 7.4. I'm not sure the GDB version shipped with RHEL 4.5 supports all the command used above. Building upstream GDB from source is in fact easier than Python though.
When trying to add strack traces to mtrace, make sure you don't call malloc() directly or indirectly in the GLIBC malloc hooks.