Graceful signal handling in slurm - c++

I have an issue with graceful exiting my slurm jobs with saving data, etc.
I have a signal handler in my program which sets a flag, which is then queried in a main loop and a graceful exit with data saving follows. The general scheme is something like this:
#include <utility>
#include <atomic>
#include <fstream>
#include <unistd.h>
namespace {
std::atomic<bool> sigint_received = false;
}
void sigint_handler(int) {
sigint_received = true;
}
int main() {
std::signal(SIGTERM, sigint_handler);
while(true) {
usleep(10); // There are around 100 iterations per second
if (sigint_received)
break;
}
std::ofstream out("result.dat");
if (!out)
return 1;
out << "Here I save the data";
return 0;
}
Batch scripts are unfortunately complicated because:
I want hundreds of parallel, low-thread-count independent tasks, but my cluster allows only 16 jobs per user
srun in my cluster always claims a whole node, even if I don't want all cores, so in order to run multiple processes on a single node I have to use bash
Because of it, batch script is this mess (2 nodes for 4 processes):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
srun -N 1 -n 1 bash -c '
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
./my_program input3 &
./my_program input4 &
wait
' &
wait
Now, to propagate signals sent by slurm, I have even a bigger mess like this (following this answer, in particular double waits):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
trap 'kill $(jobs -p) && wait' TERM
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input3 &
./my_program input4 &
wait
' &
wait
For the most part it is working. But, firstly, I am getting error messeges at the end of output:
run: error: nid00682: task 0: Exited with exit code 143
srun: Terminating job step 732774.7
srun: error: nid00541: task 0: Exited with exit code 143
srun: Terminating job step 732774.4
...
and, what is worse, like 4-6 out of over 300 processes actually fail on if (!out) - errno gives "Interrupted system call". Again, guided by this, I guess that my signal handler is called two times - the second one during some syscall under std::ofstream constructor.
Now,
How to get rid of slurm errors and have an actual graceful exit?
Am I correct that signal is sent two times? If so, why, and how can I fix it?

Suggestions:
trap EXIT, not a signal. EXIT happens once, TERM can be delivered multiple times.
use declare -f to transfer code and declare -p to transfer variables to an unrelated subshell
kill can fail, I do not think you should && on it
use xargs (or parallel) instead of reinventing the wheel with kill $(jobs -p)
extract "data" (input1 input2 ...) from "code" (work to be done)
Something along:
# The input.
input="$(cat <<'EOF'
input1
input2
input3
input4
EOF
)"
work() {
# Normally write work to be done.
# For each argument, run `my_program` in parallel.
printf "%s\n" "$#" | xargs -d'\n' -P0 ./my_program
}
# For each two arguments run `srun....` with a shell that runs `work` in parallel.
# Note - declare -f outputs source-able definition of the function.
# "No more hand escaping!"
# Then the work function is called with arguments passed by xargs inside the spawned shell.
xargs -P0 -n2 -d'\n' <<<"$input" \
srun -N 1 -n 1 \
bash -c "$(declare -f work)"'; work "$#"' --
The -P0 is specific to GNU xargs. GNU xargs specially handles exit status 255, you can write a wrapper like xargs ... bash -c './my_program "$#" || exit 255' -- || exit 255 if you want xargs to terminate if any of programs fail.
If srun preserves environment variables, then export work function export -f work and just call it within child shell like xargs ... srun ... bash -c 'work "$#"' --.

Related

How to build single-threaded TensorFlow 2.x from source

While building TensorFlow 2.x (for CPU) from source, what change should I make to force the TensorFlow not to use more than 1 threads? If this is not possible, what specific c++ statements (and in which cpp files) should I change to suppress the generation of multiple threads?
No matter what the number of cpus/cores are, I need 1 thread in total from TensorFlow 2.x.
Use top -H -b -n1 | grep program_name | wc -l to count the total number of threads.
The solution is in C++ the options you can give to a session:
// set the number of worker threads
tensorflow::SessionOptions options;
tensorflow::ConfigProto & configuration = options.config;
configuration.set_inter_op_parallelism_threads(1);
configuration.set_intra_op_parallelism_threads(1);
configuration.set_use_per_session_threads(false);
mySession->reset(tensorflow::NewSession(options));
In this way you will have only a worker thread.
But this not ensure that top -H -b -n1 | grep program_name | wc -l command return 1 thread only. In fact in the above example we speeking about a worker thread, but for sure there is at least the main thread that manage the spawn and the return of the working threads.

Bash run command in background inside subshell

I want to be able to bg a process inside a subshell as if it were not in a subshell.
$( sleep 3 & ) just ignores the ampersand.
I've tried:
$( sleep 3 & )
$( sleep 3 & ) &
$( sleep 3 ) &
but nothing changes.
Then I tried $( disown sleep 3 & ) which returned
disown: can't manipulate jobs in subshell
which led me to try $( set -m; disown sleep 3 & ) but I got the same output.
I even tried creating a c++ program that would daemonize itself:
#include <unistd.h>
#include <chrono>
#include <thread>
using namespace std;
int main() {
int ret = fork();
if (ret < 0) return ret; // fork error
if (ret > 0) return 0; // parent exits
this_thread::sleep_for(chrono::milliseconds(3000));
return 0;
}
But after running it, realized that because I am forking instead of separate_from_parent_and_let_parent_dieing the subshell will still wait for the process to end.
To step out of my MCVE, a function is being called from a subshell, and in that function, I need to pull data from a server and it needs to be run in the bg. My only constraint is that I can't edit the function call in the subshell.
Is there any way to not fork but separate from the parent process in a c++ program so that it can die without consequence or force a command to separate from a subshell in bash?
Preferably the latter.
The $(...) command substitution mechanism waits for EOF on the pipe that the subshell's stdout is connected to. So even if you background a command in the subshell, the main shell will still wait for it to finish and close its stdout. To avoid waiting for this, you need to redirect its output away from the pipe.
echo "$( cat file1; sleep 3 >/dev/null & cat file2 )"
I hope I've got you right. Fix me if I'm wrong- you want that your main thread will ba able to die before the sub-threads ends?
I f this is the situation you can use detach method on the thread.

Shell script to terminate program but causes output file to not be written

I am wanting to run a program in the background that collects some performance data and then run an application in the foreground. When the foreground application finishes it detects this and the closes the application in the background. The issue is that when the background application closes without first closing the file, I'm assuming, the output of the file remains empty. Is there a way to constantly write the output file so that if the background application unexpectedly closes the output is preserved?
Here is my shell script:
./background_application -o=output.csv &
background_pid=$!
./foreground_application
ps -a | grep foreground_application
if pgrep foreground_application > /dev/null
then
result=1
else
result=0
fi
while [ result -ne 0 ]
do
if pgrep RPx > /dev/null
then
result=1
else
result=0
fi
sleep 10
done
kill $background_pid
echo "Finished"
I have access to the source code for the background application written in C++ it is a basic loop and runs fflush(outputfile) every loop iteration.
This would be shorter:
./background_application -o=output.csv &
background_pid=$!
./foreground_application
cp output.csv output_last_look.csv
kill $background_pid
echo "Finished"

C++ program significantly slower when run in bash

I have a query regarding bash. I have been running some of my own C++ programs in conjunction with commercial programs and controlling their interaction (via input and output files) through Bash scripting. I am finding that if I run my c++ program alone in terminal it completes in around 10–15 seconds, but when I run the same through the bash script it can take up to 5 minutes to complete in each case.
I find using System Monitor that consistently 100% of one CPU is used when I run the program directly in terminal whereas when I run it in bash (in a loop) a maximum of 60% of CPU usage is recorded and seems to be linked to the longer completion time (although the average CPU usage is higher over the 4 processors).
This is quite frustrating as until recently this was not a problem.
An example of the code:
#!/usr/bin/bash
DIR="$1"
TRCKDIR=$DIR/TRCKRSLTS
STRUCTDIR=$DIR
SHRTTRCKDIR=$TRCKDIR/SHRT_TCK_FILES
VTAL=VTAL.png
VTAR=VTAR.png
NAL=$(find $STRUCTDIR | grep NAL)
NAR=$(find $STRUCTDIR | grep NAR)
AMYL=$(find $STRUCTDIR | grep AMYL)
AMYR=$(find $STRUCTDIR | grep AMYR)
TCKFLS=($(find $TRCKDIR -maxdepth 1 | grep .fls))
numTCKFLS=${#TCKFLS[#]}
for i in $(seq 0 $[numTCKFLS-1]); do
filenme=${TCKFLS[i]}
filenme=${filenme%.t*}
filenme=${filenme##*/}
if [[ "$filenme" == *VTAL* || "$filenme" == *VTA_L* ]]; then
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAL -ROI2 $NAL -op "$SHRTTRCKDIR"/"$filenme"_VTAL_NAL.fls
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAL -ROI2 $AMYL -op "$SHRTTRCKDIR"/"$filenme"_VTAL_AMYL.fls
fi
if [[ "$filenme" == *VTAR* || "$filenme" == *VTA_R* ]];then
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAR -ROI2 $NAR -op "$SHRTTRCKDIR"/"$filenme"_VTAR_NAR.fls
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAR -ROI2 $AMYR -op "$SHRTTRCKDIR"/"$filenme"_VTAR_AMYR.fls
fi
done

Executing commands with pipes and timeout in c++ (and reading stdout)

I need your help !
I made a reporting deamon (in c++) which needs to periodicaly execute a bunch of commands on a server.
A simple example command would be : "/bin/ps aux | /usr/bin/wc -l"
The first idea was to fork a child process that runs the command with popen() and set up an alarm() in the parent process to kill the child after 5 seconds if the command has not exited already.
I tried using "sleep 20000" as command, the child process is killed but the sleep command is still running... not good.
The second idea was to use execlp() instead of popen(), it works with simple commands (ie with no pipes) such as "ls -lisa" or "sleep 20000". I can get the result and the processes are killed if they're not done after 5 seconds.
Now I need to execute that "/bin/ps aux | /usr/bin/wc -l" command, obviously it won't work with execlp() directly, so I tried that "hack" :
execlp("sh","sh","-c","/bin/ps aux | /usr/bin/wc -l",NULL);
I works... or so I thought... I tried
execlp("sh","sh","-c","sleep 20000",NULL);
just to be sure and the child process is killed after 5 secs (my timeout) but the sleep command just stays there...
i'm open for suggestions (I'd settle for a hack) !
Thanks in advance !
TLDR;
I need a way to :
execute a "complex" command such as "/bin/ps aux | /usr/bin/wc -l"
get its output
make sure it's killed if it takes more than 5 seconds (the ps command is just and example, actual commands may hang forever)
Use timeout command from coreutils:
/usr/bin/timeout 5 /bin/sh -c "/bin/ps aux | /usr/bin/wc -l"