How to build single-threaded TensorFlow 2.x from source

How to build single-threaded TensorFlow 2.x from source - c++

While building TensorFlow 2.x (for CPU) from source, what change should I make to force the TensorFlow not to use more than 1 threads? If this is not possible, what specific c++ statements (and in which cpp files) should I change to suppress the generation of multiple threads?
No matter what the number of cpus/cores are, I need 1 thread in total from TensorFlow 2.x.
Use top -H -b -n1 | grep program_name | wc -l to count the total number of threads.

The solution is in C++ the options you can give to a session:
// set the number of worker threads
tensorflow::SessionOptions options;
tensorflow::ConfigProto & configuration = options.config;
configuration.set_inter_op_parallelism_threads(1);
configuration.set_intra_op_parallelism_threads(1);
configuration.set_use_per_session_threads(false);
mySession->reset(tensorflow::NewSession(options));
In this way you will have only a worker thread.
But this not ensure that top -H -b -n1 | grep program_name | wc -l command return 1 thread only. In fact in the above example we speeking about a worker thread, but for sure there is at least the main thread that manage the spawn and the return of the working threads.

Related

Graceful signal handling in slurm

I have an issue with graceful exiting my slurm jobs with saving data, etc.
I have a signal handler in my program which sets a flag, which is then queried in a main loop and a graceful exit with data saving follows. The general scheme is something like this:
#include <utility>
#include <atomic>
#include <fstream>
#include <unistd.h>
namespace {
std::atomic<bool> sigint_received = false;
}
void sigint_handler(int) {
sigint_received = true;
}
int main() {
std::signal(SIGTERM, sigint_handler);
while(true) {
usleep(10); // There are around 100 iterations per second
if (sigint_received)
break;
}
std::ofstream out("result.dat");
if (!out)
return 1;
out << "Here I save the data";
return 0;
}
Batch scripts are unfortunately complicated because:
I want hundreds of parallel, low-thread-count independent tasks, but my cluster allows only 16 jobs per user
srun in my cluster always claims a whole node, even if I don't want all cores, so in order to run multiple processes on a single node I have to use bash
Because of it, batch script is this mess (2 nodes for 4 processes):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
srun -N 1 -n 1 bash -c '
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
./my_program input3 &
./my_program input4 &
wait
' &
wait
Now, to propagate signals sent by slurm, I have even a bigger mess like this (following this answer, in particular double waits):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
trap 'kill $(jobs -p) && wait' TERM
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input3 &
./my_program input4 &
wait
' &
wait
For the most part it is working. But, firstly, I am getting error messeges at the end of output:
run: error: nid00682: task 0: Exited with exit code 143
srun: Terminating job step 732774.7
srun: error: nid00541: task 0: Exited with exit code 143
srun: Terminating job step 732774.4
...
and, what is worse, like 4-6 out of over 300 processes actually fail on if (!out) - errno gives "Interrupted system call". Again, guided by this, I guess that my signal handler is called two times - the second one during some syscall under std::ofstream constructor.
Now,
How to get rid of slurm errors and have an actual graceful exit?
Am I correct that signal is sent two times? If so, why, and how can I fix it?

Suggestions:
trap EXIT, not a signal. EXIT happens once, TERM can be delivered multiple times.
use declare -f to transfer code and declare -p to transfer variables to an unrelated subshell
kill can fail, I do not think you should && on it
use xargs (or parallel) instead of reinventing the wheel with kill $(jobs -p)
extract "data" (input1 input2 ...) from "code" (work to be done)
Something along:
# The input.
input="$(cat <<'EOF'
input1
input2
input3
input4
EOF
)"
work() {
# Normally write work to be done.
# For each argument, run `my_program` in parallel.
printf "%s\n" "$#" | xargs -d'\n' -P0 ./my_program
}
# For each two arguments run `srun....` with a shell that runs `work` in parallel.
# Note - declare -f outputs source-able definition of the function.
# "No more hand escaping!"
# Then the work function is called with arguments passed by xargs inside the spawned shell.
xargs -P0 -n2 -d'\n' <<<"$input" \
srun -N 1 -n 1 \
bash -c "$(declare -f work)"'; work "$#"' --
The -P0 is specific to GNU xargs. GNU xargs specially handles exit status 255, you can write a wrapper like xargs ... bash -c './my_program "$#" || exit 255' -- || exit 255 if you want xargs to terminate if any of programs fail.
If srun preserves environment variables, then export work function export -f work and just call it within child shell like xargs ... srun ... bash -c 'work "$#"' --.

MPI_Comm_spawn fails with "All nodes which are allocated for this job are already filled"

I'm trying to use Torque's (5.1.1) qsub command to launch multiple OpenMPI
processes, one process per node, and having each process launch a single
process on its own local node using MPI_Comm_spawn(). MPI_Comm_spawn() is reporting:
All nodes which are allocated for this job are already filled.
My OpenMPI version is 4.0.1.
I am following the instructions here to control the mapping of nodes.
Controlling node mapping of MPI_COMM_SPAWN
using the --map-by ppr:1:node option to mpiexec, and a hostfile (programatically derived
from the ${PBS_NODEFILE} file that Torque produces). My derived file MyHostFile looks
like this:
n001.cluster.com slots=2 max_slots=2
n002.cluster.com slots=2 max_slots=2
while the original ${PBS_NODEFILE} only has the node names, and no slot specifications.
My qsub command is
qsub -V -j oe -e ./tempdir -o ./tempdir -N MyJob MyJob.bash
The mpiexec command from MyJob.bash is
mpiexec --display-map --np 2 --hostfile MyNodefile --map-by ppr:1:node <executable>.
MPI_Comm_spawn() causes this error to be printed:
Data for JOB [22220,1] offset 0 Total slots allocated 1 <=====
======================== JOB MAP ========================
Data for node: n001 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [22220,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
=============================================================
All nodes which are allocated for this job are already filled.
There are two things that occur to me:
(1) "Total slots allocated" is 1 above, but I need at least two slots available.
(2) It may not be right to try to specify a hostfile to mpiexec when
using Torque (though it is derived from the Torque hostfile ${PBS_NODEFILE}). Maybe my derived hostfile is being ignored.
Is there a way to make this work? I've tried recompiling OpenMPI
without Torque support, hopefully preventing OpenMPI from interacting
with it, but it didn't change the error message.

Answering my own question: adding the argument -l nodes=1:ppn=2 to the qsub command reserves 2 processors on the node, even though mpiexec is launching only one process. MPI_Comm_spawn() can then spawn the new process on the second reserved slot.
I also had to compile OpenMPI without Torque support, since including it causes my hostfile argument to be ignored and the Torque-generated hostfile to be used.

OpenMP code is using only 4 threads instead of the specified 72

I have a program written by someone else that uses OpenMP. I am running it on a cluster that uses Slurm as its job manager. Despite setting OMP_NUM_THREADS=72 and properly requesting 72 cores for the job, the job is only using four cores.
I have already used scontrol show job <job_id> --details to verify that there are 72 cores assigned to the job. I have also remoted into the node that the job is running on and used htop to inspect it. It was running 72 threads, all on four cores. It is worth noting that this is on an SMT4 power9 cpu, meaning that each physical core executes 4 simultaneous threads. Ultimately, it looks like openMP is putting all threads on one physical core. This is further complicated by the fact that this is an IBM system. I can't seem to find any useful documentation on more fine control of the openMP environment. Everything I find is for Intel.
I have also tried using taskset to manually change the affinity. This worked as intended and moved one of the threads to an unused core. The program continued to work as intended after this.
I could theoretically write a script to find all of the threads and call taskset to assign them to cores in a logical way, but I am afraid to do this. It seems like a bad idea to me. It would also take a while.
I guess my main question would be, is this a Slurm problem, an openMP problem, an IBM problem or a user error? Is there some environment variable I don't know about that I need to set? Will it break Slurm if I manually call taskset using a script? I would use scontrol to figure out which cpus are assigned to the job if I did that. I don't want to anger the people who run the cluster by messing things up though.
Here is the submission script. I can't include any of the actual running code due to license issues though. I'm hoping this will just be a simple matter of fixing an environment variable. The MPI_OPTIONS variables were recommended by the guy who administers the system. If by some chance someone here has worked with the ENKI cluster before, that's where this is running.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module load openmpi/3.1.3/2019
module load pgi/2019
export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh

Edit: Fix resulted in a 7x speedup when using 72 cores, vs. just running on 4 cores. Considering the nature of the calculations being run, this is pretty good.
Edit 2: Fix resulted in a 17x speedup when using 160 vs. just running on 4 cores.
This might not work for everyone, but I have a really hacky solution. I wrote a python script that uses psutil to find all threads that are children of the running process and set their affinity manually. This script uses scontrol to figure out which cpus are assigned to the job and uses taskset to force the threads to distribute across those cpus.
So far the process is running a lot faster. I'm sure that forcing CPU affinity isn't the best way to do it, but its a lot better than not using the available resources at all.
Here is the basic idea behind the code. The program I am running is called pgmc, hence the variable names. You will need to create an anaconda environment with psutil installed if you are running on a system like mine.
import psutil
import subprocess
import os
import sys
import time
# Gets the id for the current job.
def get_job_id():
return os.environ["SLURM_JOB_ID"]
# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
run_str = 'scontrol show job %s --details'%get_job_id()
stdout = subprocess.getoutput(run_str)
id_spec = None
num_cpus = None
chunks = stdout.split(' ')
for chunk in chunks:
if chunk.lower().startswith("cpu_ids"):
id_spec = chunk.split('=')[1]
start, stop = id_spec.split('-')
id_spec = list(range(int(start), int(stop) + 1))
if chunk.lower().startswith('numcpus'):
num_cpus = int(chunk.split('=')[1])
if id_spec is not None and num_cpus is not None:
return id_spec, num_cpus
raise Exception("Couldn't find information about the allocated cpus.")
if __name__ == '__main__':
# Before we do anything, make sure that we can get the list of cpus
# assigned to the job. Once we have that, run the command line supplied.
cpus, cpu_count = get_proc_info()
if len(cpus) != cpu_count:
raise Exception("CPU list didn't match CPU count.")
# If we successefully got to here, run the command line.
program_name = ' '.join(sys.argv[1:])
pgmc = subprocess.Popen(sys.argv[1:])
time.sleep(10)
pid = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid
# Now that we have the pid of the pgmc process, we need to get all
# child threads of the process.
pgmc_proc = psutil.Process(pid)
pgmc_threads = list(pgmc_proc.threads())
# Now that we have a list of threads, we loop over available cores and
# assign threads to them. Once this is done, we wait for the process
# to complete.
while len(pgmc_threads) != 0:
for core_id in cpus:
if len(pgmc_threads) != 0:
thread_id = pgmc_threads[-1].id
pgmc_threads.remove(pgmc_threads[-1])
taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
print(taskset_string)
subprocess.getoutput(taskset_string)
else:
break
# All of the threads should now be assigned to a core.
# Wait for the process to exit.
pgmc.wait()
print("program terminated, exiting . . . ")
Here is the submission script used.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node
export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
My main reason for including the submission script is to demonstrate how the python script is used. More specifically, you call it, with your real job as an argument.

Executing commands with pipes and timeout in c++ (and reading stdout)

I need your help !
I made a reporting deamon (in c++) which needs to periodicaly execute a bunch of commands on a server.
A simple example command would be : "/bin/ps aux | /usr/bin/wc -l"
The first idea was to fork a child process that runs the command with popen() and set up an alarm() in the parent process to kill the child after 5 seconds if the command has not exited already.
I tried using "sleep 20000" as command, the child process is killed but the sleep command is still running... not good.
The second idea was to use execlp() instead of popen(), it works with simple commands (ie with no pipes) such as "ls -lisa" or "sleep 20000". I can get the result and the processes are killed if they're not done after 5 seconds.
Now I need to execute that "/bin/ps aux | /usr/bin/wc -l" command, obviously it won't work with execlp() directly, so I tried that "hack" :
execlp("sh","sh","-c","/bin/ps aux | /usr/bin/wc -l",NULL);
I works... or so I thought... I tried
execlp("sh","sh","-c","sleep 20000",NULL);
just to be sure and the child process is killed after 5 secs (my timeout) but the sleep command just stays there...
i'm open for suggestions (I'd settle for a hack) !
Thanks in advance !
TLDR;
I need a way to :
execute a "complex" command such as "/bin/ps aux | /usr/bin/wc -l"
get its output
make sure it's killed if it takes more than 5 seconds (the ps command is just and example, actual commands may hang forever)

Use timeout command from coreutils:
/usr/bin/timeout 5 /bin/sh -c "/bin/ps aux | /usr/bin/wc -l"

Automate gdb: show backtrace every 10 ms

I want to write a script for gdb, which will save backtrace (stack) of process every 10 ms. How can I do this?
It can be smth like call graph profiling for 'penniless' (for people, who can't use any sort of advanced profiler).
Yes, there are a lot of advanced profilers. For popular CPUs and for popular OSes. Shark is very impressive and easy to use, but I want to get a basic functionality with such script, working with gdb.

Can you get lsstack? Perhaps you could run that from a script outside your app. Why 10ms? Percentages will be about the same at 100ms or more. If the app is too fast, you could artificially slow it down with an outer loop, and that wouldn't change the percentages either. For that matter, you could just use Ctrl-C to get the samples manually under gdb, if the app runs long enough and if your goal is to find out where the performance problems are.

(1) Manual. Execute the following in a shell. Keep pressing Ctrl+C repeatedly on shell prompt.
gdb -x print_callstack.gdb -p pid
or, (2) send signals to pid repeatedly same number of times on another shell as in below loop
let count=0; \
while [ $count -le 100 ]; do \
kill -INT pid ; sleep 0.10; \
let $count=$count+1; \
done
The source of print_callstack.gdb from (1) is as below:
set pagination 0
set $count = 0
while $count < 100
backtrace
continue
set $count = $count + 1
end
detach
quit
man page of pstack https://linux.die.net/man/1/pstack

cat > gdb.run
set pagination 0
backtrace
continue
backtrace
continue
... as many more backtrace + continue's as needed
backtrace
continue
detach
quit
Of course, omit the duplicate newlines, how do you do single newlines in this forum software? :(
gdb -x gdb.run -p $pid
Then just use do
kill -INT $pid ; sleep 0.01
in a loop in another script.
kill -INT is what the OS does when you hit ctrl-C. Exercise for the reader: make the gdb script use a loop with $n iterations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js