Bash run command in background inside subshell - c++

I want to be able to bg a process inside a subshell as if it were not in a subshell.
$( sleep 3 & ) just ignores the ampersand.
I've tried:
$( sleep 3 & )
$( sleep 3 & ) &
$( sleep 3 ) &
but nothing changes.
Then I tried $( disown sleep 3 & ) which returned
disown: can't manipulate jobs in subshell
which led me to try $( set -m; disown sleep 3 & ) but I got the same output.
I even tried creating a c++ program that would daemonize itself:
#include <unistd.h>
#include <chrono>
#include <thread>
using namespace std;
int main() {
int ret = fork();
if (ret < 0) return ret; // fork error
if (ret > 0) return 0; // parent exits
this_thread::sleep_for(chrono::milliseconds(3000));
return 0;
}
But after running it, realized that because I am forking instead of separate_from_parent_and_let_parent_dieing the subshell will still wait for the process to end.
To step out of my MCVE, a function is being called from a subshell, and in that function, I need to pull data from a server and it needs to be run in the bg. My only constraint is that I can't edit the function call in the subshell.
Is there any way to not fork but separate from the parent process in a c++ program so that it can die without consequence or force a command to separate from a subshell in bash?
Preferably the latter.

The $(...) command substitution mechanism waits for EOF on the pipe that the subshell's stdout is connected to. So even if you background a command in the subshell, the main shell will still wait for it to finish and close its stdout. To avoid waiting for this, you need to redirect its output away from the pipe.
echo "$( cat file1; sleep 3 >/dev/null & cat file2 )"

I hope I've got you right. Fix me if I'm wrong- you want that your main thread will ba able to die before the sub-threads ends?
I f this is the situation you can use detach method on the thread.

Related

Graceful signal handling in slurm

I have an issue with graceful exiting my slurm jobs with saving data, etc.
I have a signal handler in my program which sets a flag, which is then queried in a main loop and a graceful exit with data saving follows. The general scheme is something like this:
#include <utility>
#include <atomic>
#include <fstream>
#include <unistd.h>
namespace {
std::atomic<bool> sigint_received = false;
}
void sigint_handler(int) {
sigint_received = true;
}
int main() {
std::signal(SIGTERM, sigint_handler);
while(true) {
usleep(10); // There are around 100 iterations per second
if (sigint_received)
break;
}
std::ofstream out("result.dat");
if (!out)
return 1;
out << "Here I save the data";
return 0;
}
Batch scripts are unfortunately complicated because:
I want hundreds of parallel, low-thread-count independent tasks, but my cluster allows only 16 jobs per user
srun in my cluster always claims a whole node, even if I don't want all cores, so in order to run multiple processes on a single node I have to use bash
Because of it, batch script is this mess (2 nodes for 4 processes):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
srun -N 1 -n 1 bash -c '
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
./my_program input3 &
./my_program input4 &
wait
' &
wait
Now, to propagate signals sent by slurm, I have even a bigger mess like this (following this answer, in particular double waits):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
trap 'kill $(jobs -p) && wait' TERM
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input3 &
./my_program input4 &
wait
' &
wait
For the most part it is working. But, firstly, I am getting error messeges at the end of output:
run: error: nid00682: task 0: Exited with exit code 143
srun: Terminating job step 732774.7
srun: error: nid00541: task 0: Exited with exit code 143
srun: Terminating job step 732774.4
...
and, what is worse, like 4-6 out of over 300 processes actually fail on if (!out) - errno gives "Interrupted system call". Again, guided by this, I guess that my signal handler is called two times - the second one during some syscall under std::ofstream constructor.
Now,
How to get rid of slurm errors and have an actual graceful exit?
Am I correct that signal is sent two times? If so, why, and how can I fix it?
Suggestions:
trap EXIT, not a signal. EXIT happens once, TERM can be delivered multiple times.
use declare -f to transfer code and declare -p to transfer variables to an unrelated subshell
kill can fail, I do not think you should && on it
use xargs (or parallel) instead of reinventing the wheel with kill $(jobs -p)
extract "data" (input1 input2 ...) from "code" (work to be done)
Something along:
# The input.
input="$(cat <<'EOF'
input1
input2
input3
input4
EOF
)"
work() {
# Normally write work to be done.
# For each argument, run `my_program` in parallel.
printf "%s\n" "$#" | xargs -d'\n' -P0 ./my_program
}
# For each two arguments run `srun....` with a shell that runs `work` in parallel.
# Note - declare -f outputs source-able definition of the function.
# "No more hand escaping!"
# Then the work function is called with arguments passed by xargs inside the spawned shell.
xargs -P0 -n2 -d'\n' <<<"$input" \
srun -N 1 -n 1 \
bash -c "$(declare -f work)"'; work "$#"' --
The -P0 is specific to GNU xargs. GNU xargs specially handles exit status 255, you can write a wrapper like xargs ... bash -c './my_program "$#" || exit 255' -- || exit 255 if you want xargs to terminate if any of programs fail.
If srun preserves environment variables, then export work function export -f work and just call it within child shell like xargs ... srun ... bash -c 'work "$#"' --.

Using two threads and system() command to run shell scripts: how to make sure that one shell script is started before another

There are two shell scripts:
#shell_script_1
nc -l -p 2234
#shell_script_2
echo "hello" | nc -p 1234 localhost 2234 -w0
From inside the C++ program, I want to run shell script no.1 first, and then run shell script no.2. What I have now is something like this:
#include <string>
#include <thread>
#include <cstdlib>
#include <unistd.h>
int main()
{
std::string sh_1 = "./shell_script_1";
std::string sh_2 = "./shell_script_2";
std::thread t1( &system, sh_1.c_str() );
usleep( 5000000 ); //wait for 5 seconds
std::thread t2( &system, sh_2.c_str() );
t1.join();
t2.join();
}
When I run the program above, shell_script_1 runs before shell_script_2, as expected. However, is a 5-second wait enough to make sure that the two shell scripts start in order? Is there anyway I can enforce the order other than set a timer and cross my finger? Thanks.
It is not enough to "start" the first script before the second. You want the first script to actually be listening on the port you've specified. To make that happen, you need to check periodically. This will be platform dependant, but on Linux you could check /proc/PID of the first child to know what file descriptors it has open, and/or run nc -z to check if the port is listening.
A simpler approach would be to automatically retry the second script a few times if it fails to connect and the first thread is still running.
A more sophisticated approach would be make your C++ program bind two ports and listen on both, and change your first script to connect instead of listen. This way both scripts would act as clients, and your C++ launcher would act as the server (even if all it does is pass the data between the two children), giving you more control and avoiding a race.

how waitpid() function is implemented in system() function in linux

I was going through Richard Stevens"Advanced Programming in UNIX Environment" and I found this topic.
*8.13. system Function
*****Because system is implemented by calling fork, exec, and waitpid, there are three types of return values.**
1. If either the fork fails or waitpid returns an error other than EINTR, system returns –1 with errno set to indicate the error.
2. If the exec fails, implying that the shell can't be executed, the return value is as if the shell had executed exit(127).
**3. Otherwise, all three functions—fork, exec, and waitpid—succeed, and the return value from system is the termination status of the shell, in the format specified for waitpid.******
As of my understanding we fork() a process by the cmdstring name and exec() makes it separate from the parent process.
But unable to figure out how waitpid() function is a part of system() function call?
The below link ambiguous constructor call while object creation didn't provide me correct answer.
After you fork() off, your original process continues immediately, i.e. fork() returns at once. At that point, the new process is still running. Since system() is supposed to be synchronous, i.e. must only return after the executed program finishes, the original program now needs to call waitpid() on the PID of the new process to wait for its termination.
In a picture:
[main process]
.
.
.
fork() [new process]
A
/ \
| \
| \___ exec()
waitpid() .
z .
z . (running)
z .
z Done!
z |
+----<----<---+
|
V
(continue)
The system() call would, in a Unix environment look something like this:
int system(const char *cmd)
{
int pid = fork();
if(!pid) // We are in the child process.
{
// Ok, so it's more complicated than this, it makes a new string with a
// shell in it, etc.
exec(cmd);
exit(127); // exec failed, return 127. [exec doesn't return unless it failed!]
}
else
{
if (pid < 0)
{
return -1; // Failed to fork!
}
int status;
if (waitpid(pid, &status, 0) > 0)
{
return status;
}
}
return -1;
}
Please do note that this is SYMBOLICALLY what system does - it's a fair bit more complicated, because waitpid can give other values, and all sorts of other things that need checking.
From the man pages:
system() executes a command specified in command by calling /bin/sh -c command, and returns after the command has been completed. During execution of the command, SIGCHLD will be blocked, and SIGINT and SIGQUIT will be ignored.
system() presumably uses waitpid() to wait until the shell command finishes.

Executing commands with pipes and timeout in c++ (and reading stdout)

I need your help !
I made a reporting deamon (in c++) which needs to periodicaly execute a bunch of commands on a server.
A simple example command would be : "/bin/ps aux | /usr/bin/wc -l"
The first idea was to fork a child process that runs the command with popen() and set up an alarm() in the parent process to kill the child after 5 seconds if the command has not exited already.
I tried using "sleep 20000" as command, the child process is killed but the sleep command is still running... not good.
The second idea was to use execlp() instead of popen(), it works with simple commands (ie with no pipes) such as "ls -lisa" or "sleep 20000". I can get the result and the processes are killed if they're not done after 5 seconds.
Now I need to execute that "/bin/ps aux | /usr/bin/wc -l" command, obviously it won't work with execlp() directly, so I tried that "hack" :
execlp("sh","sh","-c","/bin/ps aux | /usr/bin/wc -l",NULL);
I works... or so I thought... I tried
execlp("sh","sh","-c","sleep 20000",NULL);
just to be sure and the child process is killed after 5 secs (my timeout) but the sleep command just stays there...
i'm open for suggestions (I'd settle for a hack) !
Thanks in advance !
TLDR;
I need a way to :
execute a "complex" command such as "/bin/ps aux | /usr/bin/wc -l"
get its output
make sure it's killed if it takes more than 5 seconds (the ps command is just and example, actual commands may hang forever)
Use timeout command from coreutils:
/usr/bin/timeout 5 /bin/sh -c "/bin/ps aux | /usr/bin/wc -l"

Way to force file descriptor to close so that pclose() will not block?

I am creating a pipe using popen() and the process is invoking a third party tool which in some rare cases I need to terminate.
::popen(thirdPartyCommand.c_str(), "w");
If I just throw an exception and unwind the stack, my unwind attempts to call pclose() on the third party process whose results I no longer need. However, pclose() never returns as it blocks with the following stack trace on Centos 4:
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00807dc3 in __waitpid_nocancel () from /lib/libc.so.6
#2 0x007d0abe in _IO_proc_close##GLIBC_2.1 () from /lib/libc.so.6
#3 0x007daf38 in _IO_new_file_close_it () from /lib/libc.so.6
#4 0x007cec6e in fclose##GLIBC_2.1 () from /lib/libc.so.6
#5 0x007d6cfd in pclose##GLIBC_2.1 () from /lib/libc.so.6
Is there any way to force the call to pclose() to be successful before calling it so I can programmatically avoid this situation of my process getting hung up waiting for pclose() to succeed when it never will because I've stopped supplying input to the popen()ed process and wish to throw away its work?
Should I write an end of file somehow to the popen()ed file descriptor before trying to close it?
Note that the third party software is forking itself. At the point where pclose() has hung, there are four processes, one of which is defunct:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
abc 6870 0.0 0.0 8696 972 ? S 04:39 0:00 sh -c /usr/local/bin/third_party /home/arg1 /home/arg2 2>&1
abc 6871 0.0 0.0 10172 4296 ? S 04:39 0:00 /usr/local/bin/third_party /home/arg1 /home/arg2
abc 6874 99.8 0.0 10180 1604 ? R 04:39 141:44 /usr/local/bin/third_party /home/arg1 /home/arg2
abc 6875 0.0 0.0 0 0 ? Z 04:39 0:00 [third_party] <defunct>
I see two solutions here:
The neat one: you fork(), pipe() and execve() (or anything in the exec family of course...) "manually", then it is going to be up to you to decide if you want to let your children become zombies or not. (i.e. to wait() for them or not)
The ugly one: if you're sure you only have one of this child process running at any given time, you could use sysctl() to check if there is any process running with this name before you call pclose()... yuk.
I strongly advise the neat way here, or you could just ask whomever responsible to fix that infinite loop in your third party tool haha.
Good luck!
EDIT:
For you first question: I don't know. Doing some researches on how to find processes by name using sysctl() shoud tell you what you need to know, I myself have never pushed it this far.
For your second and third question: popen() is basically a wrapper to fork() + pipe() + dup2() + execl().
fork() duplicates the process, execl() replaces the duplicated process' image with a new one, pipe() handles inter process communication and dup2() is used to redirect the output... And then pclose() will wait() for the duplicated process to die, which is why we're here.
If you want to know more, you should check this answer where I've recently explained how to perform a simple fork with standard IPC. In this case, it's just a bit more complicated as you have to use dup2() to redirect the standard output to your pipe.
You should also take a look at popen()/pclose() source codes, as they are of course open source.
Finally, here's a brief example, I cannot make it clearer than that:
int pipefd[2];
pipe(pipefd);
if (fork() == 0) // I'm the child
{
close(pipefd[0]); // I'm not going to read from this pipe
dup2(pipefd[1], 1); // redirect standard output to the pipe
close(pipefd[1]); // it has been duplicated, close it as we don't need it anymore
execve()/execl()/execsomething()... // execute the program you want
}
else // I'm the parent
{
close(pipefd[1]); // I'm not going to write to this pipe
while (read(pipefd[0], &buf, 1) > 0) // read while EOF
write(1, &buf, 1);
close(pipefd[1]); // cleaning
}
And as always, remember to read the man pages and to check all your return values.
Again, good luck!
Another solution is to kill all your children. If you know that the only child processes you have are processes that get started when you do popen(), then it's easy enough. Otherwise you may need some more work or use the fork() + execve() combo, in which case you will know the first child's PID.
Whenever you run a child process, it's PPID (parent process ID) is your own PID. It is easy enough to read the list of currently running processes and gather those that have their PPID = getpid(). Repeat the loop looking for processes that have their PPID equal to one of your children's PID. In the end you build a whole tree of child processes.
Since you child processes may end up creating other child processes, to make it safe, you will want to block those processes by sending a SIGSTOP. That way they will stop creating new children. As far as I know, you can't prevent the SIGSTOP from doing its deed.
The process is therefore:
function kill_all_children()
{
std::vector<pid_t> me_and_children;
me_and_children.push_back(getpid());
bool found_child = false;
do
{
found_child = false;
std::vector<process> processes(get_processes());
for(auto p : processes)
{
// i.e. if I'm the child of any one of those processes
if(std::find(me_and_children.begin(),
me_and_children.end(),
p.ppid()))
{
kill(p.pid(), SIGSTOP);
me_and_children.push_back(p.pid());
found_child = true;
}
}
}
while(found_child);
for(auto c : me_and_children)
{
// ignore ourselves
if(c == getpid())
{
continue;
}
kill(c, SIGTERM);
kill(c, SIGCONT); // make sure it continues now
}
}
This is probably not the best way to close your pipe, though, since you probably need to let the command time to handle your data. So what you want is execute that code only after a timeout. So your regular code could look something like this:
void send_data(...)
{
signal(SIGALRM, handle_alarm);
f = popen("command", "w");
// do some work...
alarm(60); // give it a minute
pclose(f);
alarm(0); // remove alarm
}
void handle_alarm()
{
kill_all_children();
}
-- about the alarm(60);, the location is up to you, it could also be placed before the popen() if you're afraid that the popen() or the work after it could also fail (i.e. I've had problems where the pipe fills up and I don't even reach the pclose() because then the child process loops forever.)
Note that the alarm() may not be the best idea in the world. You may prefer using a thread with a sleep made of a poll() or select() on an fd which you can wake up as required. That way the thread would call the kill_all_children() function after the sleep, but you can send it a message to wake it up early and let it know that the pclose() happened as expected.
Note: I left the implementation of the get_processes() out of this answer. You can read that from /proc or with the libprocps library. I have such an implementation in my snapwebsites project. It's called process_list. You could just reap off that class.
I'm using popen() to invoke a child process which doesn't need any stdin or stdout, it just runs for a short time to do its work, then it stops all by itself. Arguably, invoking this type of child process should rather be done with system() ? Anyway, pclose() is used afterwards to verify that the child process exited cleanly.
Under certain conditions, this child process keeps on running indefinitely. pclose() blocks forever, so then my parent process is also stuck. CPU usage runs to 100%, other executables get starved, and my whole embedded system crumbles. I came here looking for solutions.
Solution 1 by #cmc : decomposing popen() into fork(), pipe(), dup2() and execl().
It might just be a matter of personal taste, but I'm reluctant to rewrite perfectly fine system calls myself. I would just end up introducing new bugs.
Solution 2 by #cmc : verifying that the child process actually exists with sysctl(), to make sure that pclose() will return successfully. I find that this somehow sidesteps the problem from the OP #WilliamKF - there is definitely a child process, it just has become unresponsive. Forgoing the pclose() call won't solve that. [As an aside, in the 7 years since #cmc wrote this answer, sysctl() seems to have become deprecated.]
Solution 3 by #Alexis Wilke : killing the child process. I like this approach best. It basically automates what I did when I stepped in manually to resuscitate my dying embedded system. The problem with my stubborn adherence to popen(), is that I get no PID from the child process. I have been trying in vain with
waitid(P_PGID, getpgrp(), &child_info, WNOHANG);
but all I get on my Debian Linux 4.19 system is EINVAL.
So here's what I cobbled together. I'm searching for the child process by name; I can afford to take a few shortcuts, as I'm sure there will only be one process with this name. Ironically, commandline utility ps is invoked by yet another popen(). This won't win any elegance prizes, but at least my embedded system stays afloat now.
FILE* child = popen("child", "r");
if (child)
{
int nr_loops;
int child_pid;
for (nr_loops=10; nr_loops; nr_loops--)
{
FILE* ps = popen("ps | grep child | grep -v grep | grep -v \"sh -c \" | sed \'s/^ *//\' | sed \'s/ .*$//\'", "r");
child_pid = 0;
int found = fscanf(ps, "%d", &child_pid);
pclose(ps);
if (found != 1)
// The child process is no longer running, no risk of blocking pclose()
break;
syslog(LOG_WARNING, "child running PID %d", child_pid);
usleep(1000000); // 1 second
}
if (!nr_loops)
{
// Time to kill this runaway child
syslog(LOG_ERR, "killing PID %d", child_pid);
kill(child_pid, SIGTERM);
}
pclose(child); // Even after it had to be killed
} /* if (child) */
I learned in the hard way, that I have to pair every popen() with a pclose(), otherwise I pile up the zombie processes. I find it remarkable that this is needed after a direct kill; I figure that's because according to the manpage, popen() actually launches sh -c with the child process in it, and it's this surrounding sh that becomes a zombie.