I have been running a program I developed in C++ with OpenMPI version 1.6.5 in Ubuntu 14.04.
Everything was working fine (i.e. the program was executing as it was supposed to) until I quit it using Ctrl+C at a point, as I realised I ran it with a wrong input value and could not be bothered to wait for it to complete (big, rookie mistake!).
After I changed the variable value and recompiled the program (allright), I tried to run the program again with mpirun -np 8 program_name. However, OpenMPI returned the following error:
mpirun has exited due to process rank 5 with PID 3363 on
node Hal exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
I tried to recompile the program and run it multiple times, but always got the same error with a different process rank depending on whichever was called first (I assume). Since I think the problem is relating to my "not so nice" way to quit the previous run, I restarted the computer, but the error is still there.
Is there a command to shut down all MPI runs or a file to clear to see if that was the problem?
Thank you very much in advance!
Related
While writing an x86 WinAPI-based debugger, I've encountered a rare condition when the debuggee (which usually works well) suddenly terminates with EXCEPTION_ACCESS_VIOLATION after I attach to it with my native debugger. I can stably reproduce this on any applications it seems (tried on .NET Hello World-styled application and on notepad.exe on multiple Windows 10 machines).
Essentially I've written a simple WaitForDebugEvent loop:
CreateProcessW(L"C:\\Windows\\SYSWOW64\\notepad.exe", […], CREATE_SUSPENDED, […]);
DebugActiveProcess(processId);
DEBUG_EVENT debugEvent = {};
while (WaitForDebugEvent(&debugEvent, INFINITE)) {
switch (debugEvent.dwDebugEventCode) {
// log all the events
}
ContinueDebugEvent(debugEvent.dwProcessId, debugEvent.dwThreadId, DBG_EXCEPTION_NOT_HANDLED);
}
DebugActiveProcessStop(processId);
(here's the full listing: I won't paste it all here, because there's some additional non-essential boilerplate there; the MCVE is 136 lines long)
For the sake of an example, I'll just log all the debugger events and detect whether the debuggee is ready to "proceed normally" or it will terminate due to an exception.
Most of the time, my debugging session looks like that:
CREATE_PROCESS_DEBUG_EVENT (which reports creation of both the process and its initial thread)
LOAD_DLL_DEBUG_EVENT (I was never able to get the name for this DLL, but this is documented in MSDN)
CREATE_THREAD_DEBUG_EVENT (which, I suspect, is a thread injected by debugger)
LOAD_DLL_DEBUG_EVENT […] — after this, many DLLs get loaded into the target process and everything looks okay, the process works as intended
But sometimes (in about 1.5% of all runs), the event sequence changes:
CREATE_PROCESS_DEBUG_EVENT
LOAD_DLL_DEBUG_EVENT
CREATE_THREAD_DEBUG_EVENT
EXCEPTION_DEBUG_EVENT: EXCEPTION_ACCESS_VIOLATION (which I never was able to gather details for: it reports a DEP violation, and the address is empty)
After that, I cannot proceed with debugging, because my debuggee is in exception state and will terminate soon. I was never able to catch notepad.exe crash without my debugger attached (and I doubt it is that bad and will crash for no reason), so I suspect that my debugger causes these exceptions.
One bizarre detail is that I could "fix" the situation by calling Sleep(1) immediately after WaitForDebugEvent. So, this is possibly some sort of race condition, but race condition between what? Between the debugger thread and other threads in the debuggee? Is it a thing? How are we supposed to debug other applications, then? How could actual debuggers work if it is a thing?
I couldn't reproduce the issue with the same code compiled for x64 CPU (and debugging an x64 process).
What could actually cause this erroneous behavior? I've carefully read the documentation about the API functions I call, and checked some other debugger examples online, but still wasn't able to find what's wrong with my debugger: it looks like I follow all the right conventions.
I have tried to debug my debuggee with WinDBG while it is still paused in my debugger, but had no luck doing that. First of all, it's difficult to attach to the debuggee with another debugger (WinDBG only allows to use non-intrusive mode, which is less functional it seems?), and the call stacks for the process' threads aren't usually meaningful.
Steps to reproduce
Checkout this repository, compile with MSVC and then execute in cmd:
Debug\NetRuntimeWaiter.exe > log.txt
It is important to redirect output to the log file and not show it in the terminal: without that, timings for the log writer get changed, and the issue won't reproduce (due to a possible race condition I mentioned earlier?).
Usually the program will start and terminate 1000 notepads in about 10 seconds, and 10-15 of 1000 invocations will hold the error condition (i.e. EXCEPTION_ACCESS_VIOLATION).
the DebugActiveProcess (and undocumented DbgUiDebugActiveProcess which is internally called by DebugActiveProcess) have serious design problem: after calling NtDebugActiveProcess it create remote thread in the target process, via DbgUiIssueRemoteBreakin call - as result new thread in target process is created - DbgUiRemoteBreakin - this thread call DbgBreakPoint and then RtlExitUserThread
all this not documented and explained, only this note from DebugActiveProcess:
After all of this is done, the system resumes all threads in the
process. When the first thread in the process resumes, it executes a
breakpoint instruction that causes an EXCEPTION_DEBUG_EVENT
debugging event to be sent to the debugger.
of course this is wrong. why is DbgUiRemoteBreakin first (??) thread ? and which thread resume first undefined. why not exactly write - we create additional (but not first) thread in process ? and this thread execute breakpoint.
however, when process already running - create this additional thread not create problems. but in case we create process in suspended state, and then just call DebugActiveProcess - the DbgUiRemoteBreakin really became first executing thread in process and process initialization was done on this thread, instead of created first thread. on xp this always lead to fail process initialize at connect to csrss phase. (csrss wait connect to it only on first created thread in process). on later systems this is fixed and process can execute as usual. but can and not, because thread on which it was initialized is exit. it can cause subtle problems.
solution here - not use DebugActiveProcess but NtDebugActiveProcess in it place.
the debug object we can create or via DbgUiConnectToDbg() and then get it via DbgUiGetThreadDebugObject() (system store debug object in thread TEB) or direct by call NtCreateDebugObject
also if we create debuggee process from another process(B) we can do next:
duplicate debug object from debugger process to this B process
call DbgUiSetThreadDebugObject(hDdg) just before call
CreateProcessW with DEBUG_ONLY_THIS_PROCESS or DEBUG_PROCESS
system will be use DbgUiGetThreadDebugObject() for get debug object
from your thread and pass it to low level process create api
remove debug object from your thread via
DbgUiSetThreadDebugObject(0)
really no matter who is create process with debug object. matter who is handle events posted to this debug object.
all undocumented api definitions you can take from ntdbg.h and then link with ntdll.lib or ntdllp.lib
I have written a C++ program and I am executing in the gnome terminal (I am on Ubuntu). I press Ctrl + Z, which suspends the process. Later on, I execute % on the same terminal, which resumes execution.
From what I've read, Ctrl+Z sends a TSTP signals to the process, which tells it to stop execution. But TSTP is polite, in the sense that the process is allowed to continue until it decides it can stop. In my C++ program code, I didn't do anything to explicitly deal with TSTP signals. So, my question is, what things inside my C++ code will continue running in spite of the TSTP signal? For example, if I have a file stream open, will it wait until it is closed? I expect an overall answer, not too deep or covering all the details. I just want an idea of how this happens.
Your program continues running while the SIGTSTP handler executes. Since you haven't set one up, you get the default signal handling behavior, which is for the process to be stopped.
While your process is stopped, it simply isn't scheduled for execution. Files don't get closed, nor is stopping delayed until files get closed (unless done in the signal handler).
This website looks like it has a helpful explanation of how a handler can be installed to perform some tasks and then have the default stopping behavior:
http://man7.org/tlpi/code/online/dist/pgsjc/handling_SIGTSTP.c.html
A very odd one: I have a Qt 4 embedded app that runs on framebuffer, it normally runs from inittab as the only UI on the box. There is an option to put the machine to sleep - I do the normal thing and open /sys/power/state, write "mem" and close it (using QFile). Very straight forward and it works fine EXCEPT the first time the app runs after booting. If it sleeps then it receives SIGUSR2 and just hangs forever with a blank screen. The hang occurs after resume.
But, if I manually kill it and run it again .. sleep works fine again. Note that it must do the failed sleep attempt and be killed - whereafter all is peachy every time it runs, SIGUSR2 never shows up again.
I have already tried to trap the signal - doesn't trap. No idea why - except that I know that pthreads uses SIGUSR2 ..
Stumped. Ideas? Clues?
[edit] I found that if I fork() and write to /sys/power/state in the child and exit it sort of solves the problem, but it doesn't solve the mystery..
[edit 2] I subsequently found that in fact the child is still hanging when the machine is shut down (causing it to hang forever without shutting down..), although the ugly hack just mentioned did fix the hang coming out of suspend - I have not figured out what is happening but finally solved it by just using a script/daemon: in a while loop it checks a file in /tmp for an action and either halts or suspends and restarts the app afterwards .. not pretty but it works.
And still the mystery of the SIGUSR2 hang remains ..
I am using waitpid as given
waitpid(childPID, &status, WNOHANG);
This is used in a program inside an infinite loop that forks when needed and the parent waits for the child process to return. But recently I have come across a problem where in the program exits after printing this to the cerr..
waitpid: No child processes
This is always the last log from the program before it crashes/exits. I know that it doesnot segfault or anything because i have a traceback function written that prints the last 10 addresses that the program accessed. So does it mean that the program exited the loop after finding that there is no child process? Or is there something sinister at work here?
I guess what is happening over here is that the fork system call is failing due lo lack of available entries in the process table. You can do a perror on the output of fork. I think it should be RESOURCE_TEMPORARILY_UNAVAILABLE.
I am facing strange issue on Windows CE:
Running 3 EXEs
1)First exe doing some work every 8 minutes unless exit event is signaled.
2)Second exe doing some work every 5 minutes unless exit event signaled.
3)Third exe while loop is running and in while loop it do some work at random times.
This while loop continues until exit event signaled.
Now this exit event is global event and can be signaled by any process.
The Problem is
When I run First exe it works fine,
Run second exe it works fine,
run third exe it works fine
When I run all exes then only third exe runs and no instructions get executed in first and second.
As soon as third exe gets terminated first and second starts get processing.
It that can be the case that while loop in third exe is taking all CPU cycles?
I havn't tried putting Sleep but I think that can do some tricks.
But OS should give CPU to all processes ...
Any thoughts ???
Put the while loop in the third EXE to Sleep each time through the loop and see what happens. Even if it doesn't fix this particular probem, it isn't ever good practice to poll with a while loop, and even using Sleep inside a loop is a poor substitute for a proper timer.
On the MSDN, I also read that CE allows for (less than) 32 processes simultaneously. (However, the context switches are lightning fast...). Some are already taken by system services.
(From Memory) Processes in Windows CE run until completion if there are no higher priority processes running, or they run for their time slice (100ms) if there are other processes of equal priority running. I'm not sure if Windows CE gives the process with the active/foreground window a small priority boost (just like desktop Windows), or not.
In your situation the first two processes are starved of processor time so they never run until the third process exits. Some ways to solve this are:
Make the third process wait/block on some multi-process primitives (mutex, semaphore, etc) and a short timeout. Using WaitForMultipleObjects/WaitForSingleObject etc.
Make the third process wait using a call to Sleep every time around the processing loop.
Boost the priority of the other processes so when they need to run they will interrupt the third process and actually run. I would probably make the least often called process have the highest priority of the three processes.
The other thing to check is that the third process does actually complete its tasks in time, and does not peg the CPU trying to do its thing normally.
Yeah I think that is not good solution . I may try to use timer and see the results..