Child processes not sharing parents cgroup - cgroups

I am a newbie in using cgroup. I am managing cgroups using libcgroup on CentOS 64. I have managed to create some cgroups e.g.
[ehsan.haq#datavault ~]$ cgcreate -g blkio:/test
[ehsan.haq#datavault ~]$ cgcreate -g cpu:/test
Lets say, I have a parent process 125672 and its child processes 33117, 33403, 33404, 33880, 34663
[ehsan.haq#datavault ~]$ pgrep -P 125672
33117
33403
33404
33880
34663
What I want is to move the parent process 125672 and its existing children 33117, 33403, 33404, 33880, 34663 and also future children ?,?,? in the cgroups blkio:/test and cpu:/test. What is the correct way to achieve this?
I have tried
cgclassify -g blkio:/test [--sticky] 125672
cgclassify -g cpu:/test [--sticky] 125672
but the cgroup.procs and tasks file only contains the parent process id only.
[ehsan.haq#datavault ~]$ cat /cgroup/blkio/test/cgroup.procs
125672
[ehsan.haq#datavault ~]$ cat /cgroup/blkio/test/tasks
125672
Does it mean that the child process are not in the parents cgroup? if not then how should I do it.
Update
I found out that instead of using cgclassify, if I do
echo "{parent PID}" > /cgroup/blkio/test/tasks
then all the future children processes also ends up in the /cgroup/blkio/test/cgroup.procs

Related

Linux: setting process priority AND dynamically loading libraries

I have a linux application which loads *.so libraries using a modified rpath (set during installation). It also needs to run with realtime priority.
To get realtime priority it does this:
sched_param sched;
sched.sched_priority = 70;
sched_setscheduler(getpid(), SCHED_FIFO, &sched);
However sched_setscheduler is a privilaged method, protected by the CAP_SYS_NICE capability. Therefore, to get realtime priority without running as root, I add setcap to my postinst:
setcap cap_sys_nice+ep /path/to/myapp
However, linux decides that programs should not be allowed to load libraries from rpath if they have extra capabilities.
Is there a way for me to set my own priority and load rpath libraries?
Note: I'd prefer to do this in the application or in the postinst. I'd like to avoid deploying scripts as the only way to launch the application. I know sudo chrt -f -p 70 $! could do it from a script.
I have two solutions which do not involve modifying libc. Both solutions require us to replace the calls to sched_setscheduler() with a call to launch another process directly.
Install a file to /etc/sudoers.d/ with the following line:
%users ALL=NOPASSWD: /usr/bin/chrt
Then from our application launch sudo as a process with arguments chrt -f -p X Y where X is the configured priority and Y is the result of getpid().
Create a custom chrt with:
cp $(which chrt) $(DESTDIR)/bin/chrt
sudo setcap cap_sys_nice+ep $(DESTDIR)/bin/chrt
sudo chmod 755 $(DESTDIR)/bin/chrt
Then from our application launch chrt as a process with arguments -f -p X Y
Not sure which solution is better. Note this is effectively embedded (or at least purpose built) so I'm not too worried about the security exposure.

running parallel code on PC

I have fortran code that has been parallelized with OpenMP. I want to test my code on my PC before running on HPC. My PC has double core CPU and I work on Linux-mint. I installed gfortranmultilib and this is my script:
#!/bin/bash
### Job name
#PBS -N pme
### Keep Output and Error
#PBS -j eo
### Specify the number of nodes and thread (ppn) for your job.
#PBS -l nodes=1:ppn=2
### Switch to the working directory;
cd $PBS_O_WORKDIR
### Run:
OMP_NUM_THREADS=$PBS_NUM_PPN
export OMP_NUM_THREADS
ulimit -s unlimited
./a.out
echo 'done'
What should I do more to run my code?
OK, I changed script as suggested in answers:
#!/bin/bash
### Switch to the working directory;
cd Desktop/test
### Run:
OMP_NUM_THREADS=2
export OMP_NUM_THREADS
ulimit -s unlimited
./a.out
echo 'done'
my code and its executable file are in folder test on Desktop, so:
cd Desktop/test
is this correct?
then I compile my simple code:
implicit none
!$OMP PARALLEL
write(6,*)'hi'
!$OMP END PARALLEL
end
by command:
gfortran -fopenmp test.f
and then run by:
./a.out
but only one "hi" is printed as output. What should I do?
(and a question about this site: in situation like this I should edit my post or just add a comment?)
You don't need and probably don't want to use the script on your PC. Not even to learn how to use such a script, because these scripts are too much connected to the specifics of each supercomputer.
I use several supercomputers/clusters and I cannot just reuse the script from one at the other, because they are so much different.
On your PC you should just do:
optional, it is probably the default
export OMP_NUM_THREADS=2
to set the number of OpenMP threads to 2. Adjust if you need some other number.
cd to the working directory
cd my_working_directory
Your working directory is the directory where you have the required data or where the executable resides. In your case it seems to be the directory where a.out is.
run the damn thing
ulimit -s unlimited
./a.out
That's it.
You can also store the standard output and error output to a file
./out > out.txt 2> err.txt
to mimic the supercomputer behaviour.
The PBS variables are only set when you run the script using qsub. You probably don't have that on your PC and you probably don't want to have it either.
$PBS_O_WORKDIR is the directory where you run the qsub command, unless you set it differently by other means.
$PBS_NUM_PPN is the number you indicated in #PBS -l nodes=1:ppn=2. The queue system reads that and sets this variable for you.
The script you posted is for Portable Batch System (https://en.wikipedia.org/wiki/Portable_Batch_System) queue system. That means, that the job you want to run on the HPC infrastructure has to go first into the queue system and when the resources are available the job will run on the system.
Some of the commands (those starting with #PBS) are specific commands for this queue system. Among these commands, some allow the user to indicate the application process hierarchy (i.e. number of processes and threads). Also, keep in mind that since all the PBS commands start by # they are ignored by regular shell script execution. In the case you presented, that is given by
### Specify the number of nodes and thread (ppn) for your job.
#PBS -l nodes=1:ppn=2
which as the comment indicates it should tell the queue system that you want to run 1 process and each process will have 2 threads. The queue system is likely to pass these parameters to the process launcher (srun/mpirun/aprun/... for MPI apps in addition to OMP_NUM_THREADS for OpenMP apps).
If you want to run this job on a computer that does not have PBS queue, you should be aware at least of two things.
1) The following command
### Switch to the working directory;
cd $PBS_O_WORKDIR
will be translated into "cd" because the environment variable PBS_O_WORKDIR is only defined within the PBS job context. So, you should change this command (or execute another cd command just before the execution) in order to fix where you want to run the job.
2) Similarly for PBS_NUM_PPN environment variable,
OMP_NUM_THREADS=$PBS_NUM_PPN
export OMP_NUM_THREADS
this variable won't be defined if you don't run this within a PBS job context, so you should set OMP_NUM_THREADS to the value you want (2, according to your question) manually.
If you want your linux box environment to be like an HPC login node. You can do the following
Make sure that your compiler supports OpenMP, test a simple hello world program with OpenMP flags
Install OpenMPI on your system from your favourite package manager or download the source/binary from the website (OpenMPI Download)
I would not recommend installing cluster manager like Slurm for your experiments
After you are done, you can execute your MPI programs through the mpirun wrapper
mpirun -n <no_of_cores> <executable>
EDIT:
This is assuming that you are running this only MPI. Note that OpenMP utilizes the cores as well. If you are running MPI+OpenMP - n*OMP_NUM_THREADS=cores on a single node.

Process gets killed after xterm terminates

I want to run xterm terminal in C++ to create a Linux process like this
system("xterm -e adb start-server")
The adb process is created but after that command it gets killed. I was trying to solve this problem by using nohup and screen but nothing works. I know that I have to put the adb process into background, but how to do that with xterm?
Edit:
I'm loking for solution that will terminate/close the xterm window, but not the adb process. Later I want to use multiple commands in the same xterm window like
system("xterm -e \"adb start-server; adb connect 192.168.X.XXX;\"");
and all output (and eventually errors) I want to see in the same xterm.
You can do it like this:
xterm -e /bin/bash -c "adb start-server; /bin/bash"

Have Gnu Screen Pass SIGTERM Signal to Child Processes, Allowing Them To Shut Down Cleanly

We are using Upstart to launch/terminate an in-house developed binary.
In the Upstart configuration file for this binary, we define the script as such:
script
exec su - user -c "screen -D -m -S $product /opt/bin/prog /opt/cfg/$product -v 5 --max_log_size=7"
end script
When the runlevel is set to 5, Upstart launches the script. When the runlevel is set to 3, Upstart terminates the script.
My problem, is Upstart is sending a SIGTERM and then a SIGKILL.
The SIGTERM is being 'handled' by screen, and not by my custom binary, so the signal handlers in our binary dont get the SIGTERM, and thus, cannot shut down cleanly.
I've verified that the signal handlers in our binary do allow it to shut down cleanly when it is NOT launched via screen.
Turns out I had to approach this from a different perspective, and handle it via Upstart. The addition of a pre-stop script, allowed me to identify the Screen session, and then stuff in the commands ("quit\n" and then "y\n") to cleanly shut down the binary that Screen was running.
pre-stop script
SESSID=`ps -elf | grep '/opt/bin/prog /opt/cfg/$product' | grep SCREEN | awk '{print $4}'`
QUIT_CMD="screen -S $SESSID.$product -X stuff \"exit"$'\n'"\""
exec `su spuser -c "$QUIT_CMD"`
QUIT_CMD="screen -S $SESSID.$product -X stuff \"y"$'\n'"\""
exec `su spuser -c "$QUIT_CMD"`
sleep 20
end script

Why can't and environment variable be seen by an executable if it is run on two or more nodes?

I am writing a program (I'll call it the "launcher") in C++ using MPI to "Spawn" a second executable (the "slave"). Depending on how many nodes a cluster has available for the launcher, it will launch slaves on each node and the slave will communicate back with the launcher also through MPI. When the slave is done with its math, it tells the launcher that the node is now available and the launcher Spawns another slave to the free node. The point is to run 1000 independent calculations, that depend on a second executable, on an heterogeneous group of machines.
This is working in my own computer, where I create a "fake" machinefile (or hostfile) giving two nodes to the program: localhost and localhost. The launcher Spawns two slaves and when one of them ends another slave is launched. This tells me that the Spawning process is working correctly.
When I move it to the cluster at my lab (whe use torque/maui to manage it), it also works if I ask for 1 (one) node. If I ask for more, I get a missing library error (libimf.so, to be precise. A library from the intel compilers). The lib is there and the node can see it, since the program runs if i ask for just one node.
My PBS that works looks like this:
#!/bin/bash
#PBS -q small
#PBS -l nodes=1:ppn=8:xeon
#PBS -l walltime=1:00:00
#PBS -N MyJob
#PBS -V
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/mpich2.shared.exec/lib/:/opt/intel/composerxe-2011.3.174/compiler/lib/intel64/:/usr/local/boost/lib/
log_file="output_pbs.txt"
cd $PBS_O_WORKDIR
echo "Beginning PBS script." > $log_file
echo "Executing on hosts ($PBS_NODEFILE): " >> $log_file
cat $PBS_NODEFILE >> $log_file
echo "Running your stuff now!" >> $log_file
# mpiexec is needed in order to let "launcher" call MPI_Comm_spawn.
/usr/local/mpich2.shared.exec/bin/mpiexec -hostfile $PBS_NODEFILE -n 1 /home/user/launhcer --hostfile $PBS_NODEFILE -r 1 >> $log_file 2>&1
echo "Fim do pbs." >> $log_file
When I try two or more nodes, the launcher doesn't Spawn any executables.
I get an output like this:
Beginning PBS script.
Executing on hosts (/var/spool/torque/aux//2742.cluster):
node3
node3
node3
node3
node3
node3
node3
node3
node2
node2
node2
node2
node2
node2
node2
node2
Running your stuff now!
(Bla bla bla from launcher initialization)
Spawning!
/usr/local/mpich2.shared.exec/bin/hydra_pmi_proxy: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
I found one other person with a problem such as mine in a mailing list, but no solution. (http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-July/010442.html). The only answer suggested trying to find if the node can see the lib (if the directory where the lib was stored was mounted on the node), so I tried an
ssh node2 ls /opt/intel/composerxe-2011.3.174/compiler/lib/intel64/libimf.so >> $log_file
inside my PBS script and the lib exists in a folder that the node can see.
In my opinion, it seems that torque/maui is not exporting the environment variables to all nodes (even though I don't know why it wouldn't), so when I try to use MPI_Spawn to run another executable in another node, it can't find the lib.
Does that make any sense? If so, could you suggest a solution?
Can anyone offer any other ideas?
Thanks in advance,
Marcelo
EDIT:
Following the suggestion in one of the answers, I installed OpenMPI to test the option "-x VARNAME" with mpiexec. In the PBS script I changed the execution line to the following:
/usr/local/openmpi144/bin/mpiexec -x LD_LIBRARY_PATH -hostfile $PBS_NODEFILE -n 1 /var/dipro/melomcr/GSAFold_2/gsafold --hostfile $PBS_NODEFILE -r 1 >> $log_file 2>&1
but got the following error messages:
[node5:02982] [[3837,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105
[node5:02982] [[3837,1],0] could not get route to [[INVALID],INVALID]
[node5:02982] [[3837,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 86
From the internet I could gather that this error usually comes from executing mpiexec more than onece, like in /path/to/mpiexec mpiexec -n 2 my_program which is not my case.
I believe I should add that the spawned "slave" program communicates with the "launcher" program using a port. The launcher opens a port with MPI_Open_port and MPI_Comm_accept, then it waits for the slave program to connect when the slave runs MPI_Comm_connect.
Like I said above, all of this works (with MPICH2) when I ask for just one node. With OpenMPI I get the above error even when I ask for only one node.
You're correct. The remoting calls far below the clustering software do not transfer environment variables.
You could use the -x option to mpiexec to pass environment variables to other nodes.