I need your help !
I made a reporting deamon (in c++) which needs to periodicaly execute a bunch of commands on a server.
A simple example command would be : "/bin/ps aux | /usr/bin/wc -l"
The first idea was to fork a child process that runs the command with popen() and set up an alarm() in the parent process to kill the child after 5 seconds if the command has not exited already.
I tried using "sleep 20000" as command, the child process is killed but the sleep command is still running... not good.
The second idea was to use execlp() instead of popen(), it works with simple commands (ie with no pipes) such as "ls -lisa" or "sleep 20000". I can get the result and the processes are killed if they're not done after 5 seconds.
Now I need to execute that "/bin/ps aux | /usr/bin/wc -l" command, obviously it won't work with execlp() directly, so I tried that "hack" :
execlp("sh","sh","-c","/bin/ps aux | /usr/bin/wc -l",NULL);
I works... or so I thought... I tried
execlp("sh","sh","-c","sleep 20000",NULL);
just to be sure and the child process is killed after 5 secs (my timeout) but the sleep command just stays there...
i'm open for suggestions (I'd settle for a hack) !
Thanks in advance !
TLDR;
I need a way to :
execute a "complex" command such as "/bin/ps aux | /usr/bin/wc -l"
get its output
make sure it's killed if it takes more than 5 seconds (the ps command is just and example, actual commands may hang forever)
Use timeout command from coreutils:
/usr/bin/timeout 5 /bin/sh -c "/bin/ps aux | /usr/bin/wc -l"
Related
I have an issue with graceful exiting my slurm jobs with saving data, etc.
I have a signal handler in my program which sets a flag, which is then queried in a main loop and a graceful exit with data saving follows. The general scheme is something like this:
#include <utility>
#include <atomic>
#include <fstream>
#include <unistd.h>
namespace {
std::atomic<bool> sigint_received = false;
}
void sigint_handler(int) {
sigint_received = true;
}
int main() {
std::signal(SIGTERM, sigint_handler);
while(true) {
usleep(10); // There are around 100 iterations per second
if (sigint_received)
break;
}
std::ofstream out("result.dat");
if (!out)
return 1;
out << "Here I save the data";
return 0;
}
Batch scripts are unfortunately complicated because:
I want hundreds of parallel, low-thread-count independent tasks, but my cluster allows only 16 jobs per user
srun in my cluster always claims a whole node, even if I don't want all cores, so in order to run multiple processes on a single node I have to use bash
Because of it, batch script is this mess (2 nodes for 4 processes):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
srun -N 1 -n 1 bash -c '
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
./my_program input3 &
./my_program input4 &
wait
' &
wait
Now, to propagate signals sent by slurm, I have even a bigger mess like this (following this answer, in particular double waits):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
trap 'kill $(jobs -p) && wait' TERM
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input3 &
./my_program input4 &
wait
' &
wait
For the most part it is working. But, firstly, I am getting error messeges at the end of output:
run: error: nid00682: task 0: Exited with exit code 143
srun: Terminating job step 732774.7
srun: error: nid00541: task 0: Exited with exit code 143
srun: Terminating job step 732774.4
...
and, what is worse, like 4-6 out of over 300 processes actually fail on if (!out) - errno gives "Interrupted system call". Again, guided by this, I guess that my signal handler is called two times - the second one during some syscall under std::ofstream constructor.
Now,
How to get rid of slurm errors and have an actual graceful exit?
Am I correct that signal is sent two times? If so, why, and how can I fix it?
Suggestions:
trap EXIT, not a signal. EXIT happens once, TERM can be delivered multiple times.
use declare -f to transfer code and declare -p to transfer variables to an unrelated subshell
kill can fail, I do not think you should && on it
use xargs (or parallel) instead of reinventing the wheel with kill $(jobs -p)
extract "data" (input1 input2 ...) from "code" (work to be done)
Something along:
# The input.
input="$(cat <<'EOF'
input1
input2
input3
input4
EOF
)"
work() {
# Normally write work to be done.
# For each argument, run `my_program` in parallel.
printf "%s\n" "$#" | xargs -d'\n' -P0 ./my_program
}
# For each two arguments run `srun....` with a shell that runs `work` in parallel.
# Note - declare -f outputs source-able definition of the function.
# "No more hand escaping!"
# Then the work function is called with arguments passed by xargs inside the spawned shell.
xargs -P0 -n2 -d'\n' <<<"$input" \
srun -N 1 -n 1 \
bash -c "$(declare -f work)"'; work "$#"' --
The -P0 is specific to GNU xargs. GNU xargs specially handles exit status 255, you can write a wrapper like xargs ... bash -c './my_program "$#" || exit 255' -- || exit 255 if you want xargs to terminate if any of programs fail.
If srun preserves environment variables, then export work function export -f work and just call it within child shell like xargs ... srun ... bash -c 'work "$#"' --.
I'm coding an application on a Raspberry Pi/Raspbian in C++. I create a named pipe (FIFO) with mkfifo() then I start raspiyuv to grab image from my camera. For memory, raspiyuv is the Raspberry Pi command line application that takes still images and save them as YUV file.
I'm using g++ 6.3 and Boost 1.64 with -std=c++17. The FIFO I create is correct in the sense that I can use it from command line. It works as expected.
The bug is that the application raspiyuv I spawn returns immediately with exit code 0.
My code:
void myFunction()
{
// Create the FIFO here with mkfifo(); // Works fine...
boost::filesystem::path lExecPath =
boost::process::search_path( "raspiyuv" ); // returns correct path
boost::process::child lProcess( lExecPath, "-w 2592 -h 1944 -o - -t 0 -y -s >> /var/tmp/myfifo" );
int lPID = lProcess.id(); // Seems to be correct
int lExitCode = lProcess.exit_code(); // Returns immediately with 0
}
The command $ raspiyuv -w 2592 -h 1944 -o - -t 0 -y -s is correct when I enter it directly to the command line. Also, the redirection to the FIFO works correctly. -w 2592 -h 1944 give the size of the grabbed image, -o - means output image to stdout, -t 0 means wait forever, -y means save only the Y channel and -s means wait for SIGUSR1 to trigger image capture.
When I call it from the command line, the application is idle until I send a SIGUSR1 then it captures an image and streams it to the FIFO then returns idle. That's fine.
When I spawn it by creating the boost::process::child object, it returns immediately.
Any idea to correct this and allow the boost::process::child to remain alive as long as my application (parent process) is alive and I don't send a SIGKILL, etc.?
Thanks for your help!
The boost::process::childruns asyncronously to main process. If you want synchronous behavior then you must either call lProcess.wait()
or launch that raspiyuv by synchronous means like using boost::process::system.
EDIT:
If you want the other process to be asynchronous (as your edits and comment seem to indicate) then what you really try to do becomes unclear to me:
The bug is that the application raspiyuv I spawn returns immediately
with exit code 0.
What bug? By docs of boost::process::child it is exactly what is expected:
int exit_code() const;
Get the exit_code. The return value is without
any meaning if the child wasn't waited for or if it was terminated.
Your code nowhere waits for child to end and so gets exit code like documented.
I am wanting to run a program in the background that collects some performance data and then run an application in the foreground. When the foreground application finishes it detects this and the closes the application in the background. The issue is that when the background application closes without first closing the file, I'm assuming, the output of the file remains empty. Is there a way to constantly write the output file so that if the background application unexpectedly closes the output is preserved?
Here is my shell script:
./background_application -o=output.csv &
background_pid=$!
./foreground_application
ps -a | grep foreground_application
if pgrep foreground_application > /dev/null
then
result=1
else
result=0
fi
while [ result -ne 0 ]
do
if pgrep RPx > /dev/null
then
result=1
else
result=0
fi
sleep 10
done
kill $background_pid
echo "Finished"
I have access to the source code for the background application written in C++ it is a basic loop and runs fflush(outputfile) every loop iteration.
This would be shorter:
./background_application -o=output.csv &
background_pid=$!
./foreground_application
cp output.csv output_last_look.csv
kill $background_pid
echo "Finished"
I launch a master script : master.ksh
I want to do some background task during the work of master.ksh.
For this, I created an script sourced at the beggining of master.ksh : slave.ksh with a $
./slave.ksh &
here is the code of slave.ksh:
#!/bin/ksh
touch tmpfile
export thepid=$!
while [[`if [ -n "$thepid" ];fi`]]; do
pwd >> tmpfile
#other set of commands ...
export thepid=$!
done
thepid is used to monitor the pid of the master.ksh, when master.ksh ends, I expect the end of the slave.ksh too and so, the exit of slave.ksh too
but I get an error from slave.ksh :
syntax error at line 5; fi unexpected
if I delete fi , I get another error. What is the good way to test $thepid ?
...
I'm not sure where to begin. This is broken in at least three ways: shell variables don't work that way, if statements don't work that way, and conditionals don't work that way.
Here's one way to do it (tested on 93u+):
> cat master.ksh
#!/bin/ksh -eu
print master says hi
./slave.ksh&
sleep 5
print master says bye
> cat slave.ksh
#!/bin/ksh -eu
print slave says hi
while (($(ps oppid= $$)==$PPID))
do
# work
print slave working....
sleep 1
done
print slave says bye
> ./master.ksh
master says hi
slave says hi
slave working....
slave working....
slave working....
slave working....
slave working....
master says bye
> slave says bye
This compares the PPID shell variable, which appears to be set at process launch, to the parent process id as returned by the linux ps tool, which returns the true current value. This works because when a process dies, any child processes it had have their parent process changed to 1 (init). So the slave works as long as its original PPID matches its current PPID, and then exits.
I want to write a script for gdb, which will save backtrace (stack) of process every 10 ms. How can I do this?
It can be smth like call graph profiling for 'penniless' (for people, who can't use any sort of advanced profiler).
Yes, there are a lot of advanced profilers. For popular CPUs and for popular OSes. Shark is very impressive and easy to use, but I want to get a basic functionality with such script, working with gdb.
Can you get lsstack? Perhaps you could run that from a script outside your app. Why 10ms? Percentages will be about the same at 100ms or more. If the app is too fast, you could artificially slow it down with an outer loop, and that wouldn't change the percentages either. For that matter, you could just use Ctrl-C to get the samples manually under gdb, if the app runs long enough and if your goal is to find out where the performance problems are.
(1) Manual. Execute the following in a shell. Keep pressing Ctrl+C repeatedly on shell prompt.
gdb -x print_callstack.gdb -p pid
or, (2) send signals to pid repeatedly same number of times on another shell as in below loop
let count=0; \
while [ $count -le 100 ]; do \
kill -INT pid ; sleep 0.10; \
let $count=$count+1; \
done
The source of print_callstack.gdb from (1) is as below:
set pagination 0
set $count = 0
while $count < 100
backtrace
continue
set $count = $count + 1
end
detach
quit
man page of pstack https://linux.die.net/man/1/pstack
cat > gdb.run
set pagination 0
backtrace
continue
backtrace
continue
... as many more backtrace + continue's as needed
backtrace
continue
detach
quit
Of course, omit the duplicate newlines, how do you do single newlines in this forum software? :(
gdb -x gdb.run -p $pid
Then just use do
kill -INT $pid ; sleep 0.01
in a loop in another script.
kill -INT is what the OS does when you hit ctrl-C. Exercise for the reader: make the gdb script use a loop with $n iterations.