SLURM : how should I understand the ntasks parameter? - amazon-web-services

I am playing with a cluster using SLURM on AWS. I have defined the following parameters :
#!/bin/sh
[...]
#SBATCH --ntasks=216
#SBATCH --constraint=c5n.18xlarge
Now how should I understand ntasks ? What is exactly this parameter ? How does it relate to the number of vCPU? And therefore the number of nodes that will be provisionned ?
AFAIK, it does not correspond to the number of vCPU because I tried to select a multiple of 72 (c5n.18xlarge have 72 vCPU) and it did not correspond to the number of EC2 instances provisioned.
I saw I can also use other parameters such as :
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
but again, the ntasks parameter remains unclear to me. For information, I then use the cluster to run an openmpi process using $SLURM_NTASKS variable, as advised in an AWS workshop, i.e. :
mpirun -np $SLURM_NTASKS some_process
Thanks for your help

In Slurm the number of tasks is essentially the number of parallel programs you can start in your allocation. By default, each task can access one CPU (which can be core or thread, depending on config), which can be modified with --cpus-per-task=#.
This in itself does not tell you anything about the number of nodes you will get. If you just specify --ntasks (or just -n), your job will be spread over many nodes, depending on whats available. You can limit this with --nodes #min-#max/--nodes #exact.
Another way to specify the number of tasks is --ntasks-per-node, which does exactly what is says and is best used in conjunction with --nodes. (not with --ntasks, otherwise it's the max number of tasks per node!)
So, if you want three nodes with 72 tasks (each with the one default CPU), try:
#SBATCH --ntasks=216
#SBATCH --nodes=3
#SBATCH --constraint=c5n.18xlarge
or:
#SBATCH --ntasks-per-node=72
#SBATCH --nodes=3
#SBATCH --constraint=c5n.18xlarge

Related

How to use two nodes for one OpenMp Fortran90 code in SLURM Cluster?

I am freshly new to using SLURM in CLUSTER.
I am now struggling with OpenMP fortran 90.
I try to calculate integrals using two nodes (node1 and node2) through SLURM.
What I want is to return one value by combining the calculations of node 1 and node 2 using Fortran OpenMP.
However, when I using "srun" it appears that two nodes compute the same executable file independently.
For example, if I run the code as below each node will return two identical values. Besides, if I execute without "srun" then it looks fine, but actually, it is not. When I check the "squeue" command, it seems using 100 CPUs through two nodes. (it looks fine!) But in reality, if I look at the "ssh node# (#=1,2)" and check each of the two nodes, only node1 use 100 CPUs, and node2 was not working.
Is there someone shed light on me?
----source code----
program integral
use omp_lib
implicit none
integer :: i,n
real :: x,y1,y2,xs,xe,dx,sum,dsum
n=100000000
xs=0.
xe=3.
sum=0.
dx=(xe-xs)/real(n)
!$omp parallel do default(shared) private(i,dsum,x) reduction(+:sum)
do i=1,n
x=xs+real(i-1)*dx
y1=x**2
y2=(x+dx)**2
dsum=(y1+y2)*dx/2
sum=sum+dsum
enddo
!$omp end parallel do
print*, sum
end program
----job script----
#!/bin/sh
#SBATCH -J test
#SBATCH -p oldbatch
#SBATCH -o test%j.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=50
export OMP_NUM_THREADS=50
srun ./a.out

MPI_Comm_spawn fails with "All nodes which are allocated for this job are already filled"

I'm trying to use Torque's (5.1.1) qsub command to launch multiple OpenMPI
processes, one process per node, and having each process launch a single
process on its own local node using MPI_Comm_spawn(). MPI_Comm_spawn() is reporting:
All nodes which are allocated for this job are already filled.
My OpenMPI version is 4.0.1.
I am following the instructions here to control the mapping of nodes.
Controlling node mapping of MPI_COMM_SPAWN
using the --map-by ppr:1:node option to mpiexec, and a hostfile (programatically derived
from the ${PBS_NODEFILE} file that Torque produces). My derived file MyHostFile looks
like this:
n001.cluster.com slots=2 max_slots=2
n002.cluster.com slots=2 max_slots=2
while the original ${PBS_NODEFILE} only has the node names, and no slot specifications.
My qsub command is
qsub -V -j oe -e ./tempdir -o ./tempdir -N MyJob MyJob.bash
The mpiexec command from MyJob.bash is
mpiexec --display-map --np 2 --hostfile MyNodefile --map-by ppr:1:node <executable>.
MPI_Comm_spawn() causes this error to be printed:
Data for JOB [22220,1] offset 0 Total slots allocated 1 <=====
======================== JOB MAP ========================
Data for node: n001 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [22220,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
=============================================================
All nodes which are allocated for this job are already filled.
There are two things that occur to me:
(1) "Total slots allocated" is 1 above, but I need at least two slots available.
(2) It may not be right to try to specify a hostfile to mpiexec when
using Torque (though it is derived from the Torque hostfile ${PBS_NODEFILE}). Maybe my derived hostfile is being ignored.
Is there a way to make this work? I've tried recompiling OpenMPI
without Torque support, hopefully preventing OpenMPI from interacting
with it, but it didn't change the error message.
Answering my own question: adding the argument -l nodes=1:ppn=2 to the qsub command reserves 2 processors on the node, even though mpiexec is launching only one process. MPI_Comm_spawn() can then spawn the new process on the second reserved slot.
I also had to compile OpenMPI without Torque support, since including it causes my hostfile argument to be ignored and the Torque-generated hostfile to be used.

OpenMP code is using only 4 threads instead of the specified 72

I have a program written by someone else that uses OpenMP. I am running it on a cluster that uses Slurm as its job manager. Despite setting OMP_NUM_THREADS=72 and properly requesting 72 cores for the job, the job is only using four cores.
I have already used scontrol show job <job_id> --details to verify that there are 72 cores assigned to the job. I have also remoted into the node that the job is running on and used htop to inspect it. It was running 72 threads, all on four cores. It is worth noting that this is on an SMT4 power9 cpu, meaning that each physical core executes 4 simultaneous threads. Ultimately, it looks like openMP is putting all threads on one physical core. This is further complicated by the fact that this is an IBM system. I can't seem to find any useful documentation on more fine control of the openMP environment. Everything I find is for Intel.
I have also tried using taskset to manually change the affinity. This worked as intended and moved one of the threads to an unused core. The program continued to work as intended after this.
I could theoretically write a script to find all of the threads and call taskset to assign them to cores in a logical way, but I am afraid to do this. It seems like a bad idea to me. It would also take a while.
I guess my main question would be, is this a Slurm problem, an openMP problem, an IBM problem or a user error? Is there some environment variable I don't know about that I need to set? Will it break Slurm if I manually call taskset using a script? I would use scontrol to figure out which cpus are assigned to the job if I did that. I don't want to anger the people who run the cluster by messing things up though.
Here is the submission script. I can't include any of the actual running code due to license issues though. I'm hoping this will just be a simple matter of fixing an environment variable. The MPI_OPTIONS variables were recommended by the guy who administers the system. If by some chance someone here has worked with the ENKI cluster before, that's where this is running.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module load openmpi/3.1.3/2019
module load pgi/2019
export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
Edit: Fix resulted in a 7x speedup when using 72 cores, vs. just running on 4 cores. Considering the nature of the calculations being run, this is pretty good.
Edit 2: Fix resulted in a 17x speedup when using 160 vs. just running on 4 cores.
This might not work for everyone, but I have a really hacky solution. I wrote a python script that uses psutil to find all threads that are children of the running process and set their affinity manually. This script uses scontrol to figure out which cpus are assigned to the job and uses taskset to force the threads to distribute across those cpus.
So far the process is running a lot faster. I'm sure that forcing CPU affinity isn't the best way to do it, but its a lot better than not using the available resources at all.
Here is the basic idea behind the code. The program I am running is called pgmc, hence the variable names. You will need to create an anaconda environment with psutil installed if you are running on a system like mine.
import psutil
import subprocess
import os
import sys
import time
# Gets the id for the current job.
def get_job_id():
return os.environ["SLURM_JOB_ID"]
# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
run_str = 'scontrol show job %s --details'%get_job_id()
stdout = subprocess.getoutput(run_str)
id_spec = None
num_cpus = None
chunks = stdout.split(' ')
for chunk in chunks:
if chunk.lower().startswith("cpu_ids"):
id_spec = chunk.split('=')[1]
start, stop = id_spec.split('-')
id_spec = list(range(int(start), int(stop) + 1))
if chunk.lower().startswith('numcpus'):
num_cpus = int(chunk.split('=')[1])
if id_spec is not None and num_cpus is not None:
return id_spec, num_cpus
raise Exception("Couldn't find information about the allocated cpus.")
if __name__ == '__main__':
# Before we do anything, make sure that we can get the list of cpus
# assigned to the job. Once we have that, run the command line supplied.
cpus, cpu_count = get_proc_info()
if len(cpus) != cpu_count:
raise Exception("CPU list didn't match CPU count.")
# If we successefully got to here, run the command line.
program_name = ' '.join(sys.argv[1:])
pgmc = subprocess.Popen(sys.argv[1:])
time.sleep(10)
pid = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid
# Now that we have the pid of the pgmc process, we need to get all
# child threads of the process.
pgmc_proc = psutil.Process(pid)
pgmc_threads = list(pgmc_proc.threads())
# Now that we have a list of threads, we loop over available cores and
# assign threads to them. Once this is done, we wait for the process
# to complete.
while len(pgmc_threads) != 0:
for core_id in cpus:
if len(pgmc_threads) != 0:
thread_id = pgmc_threads[-1].id
pgmc_threads.remove(pgmc_threads[-1])
taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
print(taskset_string)
subprocess.getoutput(taskset_string)
else:
break
# All of the threads should now be assigned to a core.
# Wait for the process to exit.
pgmc.wait()
print("program terminated, exiting . . . ")
Here is the submission script used.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node
export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
My main reason for including the submission script is to demonstrate how the python script is used. More specifically, you call it, with your real job as an argument.

Create cron in chef

I want to create Cron in chef witch they verify size of the log if it's > 30mb it will delete it, here is my code:
cron_d 'ganglia_tomcat_thread_max' do
hour '0'
minute '1'
command "rm - f /srv/node/current/app/log/simplesamlphp.log"
only_if { ::File.size('/srv/node/current/app/log/simplesamlphp.log').to_f / 1024000 > 30 }
end
Can you help me in it please
Welcome to Stackoverflow!
I suggest you to go with existing tools like "logrotate". There is a chef cookbook available to manage logrotate.
Please note, that "cron" in chef manages the system cron service which runs independently of chef. You'll have to do the file size check within the "command". It's also better to use the cron_d resource as documented here.
In the way you create cron_d resource it will add cron task only when your log file has size greater than 30mb. In all other cases cron_d will be not created.
You can check that ruby code
File.size('file').to_f / 2**20
to get the file size in Megabytes - there is a slight difference in the result I believe that is more correct.
so you can go wirh 2 solutions for your specific case
create new cron_d resource when log file is less than 30 mb to remove existing cron and provision your node periodically
move the check of the file size in the command with bash and glue with && - in that case file will be dated only if size is greater than 30mb. something like that
du -k file.txt | cut -f1
will return size of the file in bytes
To me also correct way to to that is to use logrotate service and chef recipe for that.

AWS EMR Parallel Mappers?

I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are:
(Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013, page 89.
The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/
Sorry if i missed something obvious.
Wayne
To determine the number of parallel mappers , you will need to check this documentation from EMR called Task Configuration where EMR had a predefined mapping set of configurations for every instance type which would determine the number of mappers/reducers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
For example : Lets say you have 5 m1.xlarge core nodes. According to the default mapred-site.xml configuration values for that instance type from EMR docs, we have
mapreduce.map.memory.mb = 768
yarn.nodemanager.resource.memory-mb = 12288
yarn.scheduler.maximum-allocation-mb = 12288 (same as above)
You can simply divide the later with former setting to get the maximum number of mappers supported by one m1.xlarge node = (12288/768) = 16
So, for the 5 node cluster , it would a max of 16*5 = 80 mappers that can run in parallel (considering a map only job). The same is the case with max parallel Reducers(30). You can do similar math for a combination of mappers and reducers.
So, If you want to run more mappers in parallel , you can either re-size the cluster or reduce the mapreduce.map.memory.mb(and its heap mapreduce.map.java.opts) on every node and restart NM to
To understand what the above mapred-site.xml properties mean and why you do need to do those calculations , you can refer it here :
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Note : The above calculations and statements are true if EMR stays in its default configuration using YARN capacity scheduler with DefaultResourceCalculator. If for example , you configure your capacity scheduler to use DominantResourceCalculator, it will consider VCPU's + Memory on every nodes (not just memory's) to decide on parallel number of mappers.