OpenMP code is using only 4 threads instead of the specified 72

OpenMP code is using only 4 threads instead of the specified 72 - fortran

I have a program written by someone else that uses OpenMP. I am running it on a cluster that uses Slurm as its job manager. Despite setting OMP_NUM_THREADS=72 and properly requesting 72 cores for the job, the job is only using four cores.
I have already used scontrol show job <job_id> --details to verify that there are 72 cores assigned to the job. I have also remoted into the node that the job is running on and used htop to inspect it. It was running 72 threads, all on four cores. It is worth noting that this is on an SMT4 power9 cpu, meaning that each physical core executes 4 simultaneous threads. Ultimately, it looks like openMP is putting all threads on one physical core. This is further complicated by the fact that this is an IBM system. I can't seem to find any useful documentation on more fine control of the openMP environment. Everything I find is for Intel.
I have also tried using taskset to manually change the affinity. This worked as intended and moved one of the threads to an unused core. The program continued to work as intended after this.
I could theoretically write a script to find all of the threads and call taskset to assign them to cores in a logical way, but I am afraid to do this. It seems like a bad idea to me. It would also take a while.
I guess my main question would be, is this a Slurm problem, an openMP problem, an IBM problem or a user error? Is there some environment variable I don't know about that I need to set? Will it break Slurm if I manually call taskset using a script? I would use scontrol to figure out which cpus are assigned to the job if I did that. I don't want to anger the people who run the cluster by messing things up though.
Here is the submission script. I can't include any of the actual running code due to license issues though. I'm hoping this will just be a simple matter of fixing an environment variable. The MPI_OPTIONS variables were recommended by the guy who administers the system. If by some chance someone here has worked with the ENKI cluster before, that's where this is running.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module load openmpi/3.1.3/2019
module load pgi/2019
export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh

Edit: Fix resulted in a 7x speedup when using 72 cores, vs. just running on 4 cores. Considering the nature of the calculations being run, this is pretty good.
Edit 2: Fix resulted in a 17x speedup when using 160 vs. just running on 4 cores.
This might not work for everyone, but I have a really hacky solution. I wrote a python script that uses psutil to find all threads that are children of the running process and set their affinity manually. This script uses scontrol to figure out which cpus are assigned to the job and uses taskset to force the threads to distribute across those cpus.
So far the process is running a lot faster. I'm sure that forcing CPU affinity isn't the best way to do it, but its a lot better than not using the available resources at all.
Here is the basic idea behind the code. The program I am running is called pgmc, hence the variable names. You will need to create an anaconda environment with psutil installed if you are running on a system like mine.
import psutil
import subprocess
import os
import sys
import time
# Gets the id for the current job.
def get_job_id():
return os.environ["SLURM_JOB_ID"]
# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
run_str = 'scontrol show job %s --details'%get_job_id()
stdout = subprocess.getoutput(run_str)
id_spec = None
num_cpus = None
chunks = stdout.split(' ')
for chunk in chunks:
if chunk.lower().startswith("cpu_ids"):
id_spec = chunk.split('=')[1]
start, stop = id_spec.split('-')
id_spec = list(range(int(start), int(stop) + 1))
if chunk.lower().startswith('numcpus'):
num_cpus = int(chunk.split('=')[1])
if id_spec is not None and num_cpus is not None:
return id_spec, num_cpus
raise Exception("Couldn't find information about the allocated cpus.")
if __name__ == '__main__':
# Before we do anything, make sure that we can get the list of cpus
# assigned to the job. Once we have that, run the command line supplied.
cpus, cpu_count = get_proc_info()
if len(cpus) != cpu_count:
raise Exception("CPU list didn't match CPU count.")
# If we successefully got to here, run the command line.
program_name = ' '.join(sys.argv[1:])
pgmc = subprocess.Popen(sys.argv[1:])
time.sleep(10)
pid = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid
# Now that we have the pid of the pgmc process, we need to get all
# child threads of the process.
pgmc_proc = psutil.Process(pid)
pgmc_threads = list(pgmc_proc.threads())
# Now that we have a list of threads, we loop over available cores and
# assign threads to them. Once this is done, we wait for the process
# to complete.
while len(pgmc_threads) != 0:
for core_id in cpus:
if len(pgmc_threads) != 0:
thread_id = pgmc_threads[-1].id
pgmc_threads.remove(pgmc_threads[-1])
taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
print(taskset_string)
subprocess.getoutput(taskset_string)
else:
break
# All of the threads should now be assigned to a core.
# Wait for the process to exit.
pgmc.wait()
print("program terminated, exiting . . . ")
Here is the submission script used.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node
export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
My main reason for including the submission script is to demonstrate how the python script is used. More specifically, you call it, with your real job as an argument.

Related

process 'forkPoolworker-5' pid:111 exited with 'signal 9 (SIGKILL)'

hello every one who helps me?
python:3.8
Django==4.0.4
celery==5.2.1
I am using python/Django/celery to do something,when I get data from hive by sql,my celely worker get this error "process 'forkPoolworker-5' pid:111 exited with 'signal 9 (SIGKILL)'",and then,my task is not be used to finish and the tcp connect is closing! what can I do for it to solve?
I try to do:
CELERYD_MAX_TASKS_PER_CHILD = 1 # 单work最多任务使用数
CELERYD_CONCURRENCY = 3 # 单worker最大并发数
CELERYD_MAX_MEMORY_PER_CHILD = 1024*1024*2 # 单任务可占用2G内存
CELERY_TASK_RESULT_EXPIRES = 60 * 60 * 24 * 3
-Ofair
but these is not using for solving.

SIGKILL is raised by system, most likely due to memory or storage, monitor how much memory a celery task takes by running -P solo option or -c 1 and allocate sufficient memory accordingly.
To check memory usage either use pmap <pid> or ps -a -o rss,vsz. Please search rss and vsz for more details (in short rss is RAM and vsz is virtual memory).
CELERYD_MAX_TASKS_PER_CHILD = 1 kills process after every task, so CELERYD_MAX_MEMORY_PER_CHILD has no affect ie worker waits for completion of task before enforcing limit on running child process.

SLURM : how should I understand the ntasks parameter?

I am playing with a cluster using SLURM on AWS. I have defined the following parameters :
#!/bin/sh
[...]
#SBATCH --ntasks=216
#SBATCH --constraint=c5n.18xlarge
Now how should I understand ntasks ? What is exactly this parameter ? How does it relate to the number of vCPU? And therefore the number of nodes that will be provisionned ?
AFAIK, it does not correspond to the number of vCPU because I tried to select a multiple of 72 (c5n.18xlarge have 72 vCPU) and it did not correspond to the number of EC2 instances provisioned.
I saw I can also use other parameters such as :
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
but again, the ntasks parameter remains unclear to me. For information, I then use the cluster to run an openmpi process using $SLURM_NTASKS variable, as advised in an AWS workshop, i.e. :
mpirun -np $SLURM_NTASKS some_process
Thanks for your help

In Slurm the number of tasks is essentially the number of parallel programs you can start in your allocation. By default, each task can access one CPU (which can be core or thread, depending on config), which can be modified with --cpus-per-task=#.
This in itself does not tell you anything about the number of nodes you will get. If you just specify --ntasks (or just -n), your job will be spread over many nodes, depending on whats available. You can limit this with --nodes #min-#max/--nodes #exact.
Another way to specify the number of tasks is --ntasks-per-node, which does exactly what is says and is best used in conjunction with --nodes. (not with --ntasks, otherwise it's the max number of tasks per node!)
So, if you want three nodes with 72 tasks (each with the one default CPU), try:
#SBATCH --ntasks=216
#SBATCH --nodes=3
#SBATCH --constraint=c5n.18xlarge
or:
#SBATCH --ntasks-per-node=72
#SBATCH --nodes=3
#SBATCH --constraint=c5n.18xlarge

python multiprocessing Pool in Cluster

I am using the following code (example) to do parallel processing in a cluster.
It runs perfectly in the cluster using all available cores in a Node just by typing:
python test.py
from multiprocessing import Pool
import glob
from astropy.io import fits
def read(files):
data=fits.open(files)
if __name__ == '__main__':
files= glob.glob("*.fits*")
nfiles = len(files)
pool = Pool()
pool.map(read, files)
However, when I submit the batch job using
srun -N 1 python test.py
It seems to use only 1 core not all the available cores in that node.
What should I change to distribute nfiles in a node among all the cores such that each core will get nfiles/ncores.

That is because you are specifying -N 1 which means 1 node is assigned to the call python test.py. Also, the default value for another flag -n is 1. Change it to -N nnodes where nnodes equals the number of nodes that you intend to run your script on and specify -n ncores where ncores is equal to the number of cores on each node. If your nodes are heterogenous, then you may run the risk of setting up more tasks than cores on each node or idle cores on nodes.
For more documentation on the above flags see here.

Django Celery Memory not release

In my django project I have the following dependencies:
django==1.5.4
django-celery==3.1.9
amqp==1.4.3
kombu==3.0.14
librabbitmq==1.0.3 (as suggested by https://stackoverflow.com/a/17541942/1452356)
In dev_settings.py:
DEBUG = False
BROKER_URL = "django://"
import djcelery
djcelery.setup_loader()
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
CELERYD_CONCURRENCY = 2
# CELERYD_TASK_TIME_LIMIT = 10
CELERYD_TASK_TIME_LIMIT is commented as suggested here https://stackoverflow.com/a/17561747/1452356 along with debug_toolbar as suggested by https://stackoverflow.com/a/19931261/1452356
I start my worker in a shell with:
./manage.py celeryd --settings=dev_settings
Then I send a task:
class ExempleTask(Task):
def run(self, piProjectId):
table = []
for i in range(50000000):
table.append(1)
return None
Using a django command:
class Command(BaseCommand):
def handle(self, *plArgs, **pdKwargs):
loResult = ExempleTask.delay(1)
loResult.get()
return None
With:
./manage.py purge_and_delete_test --settings=dev_settings
I monitor the memory usage with:
watch -n 1 'ps ax -o rss,user,command | sort -nr | grep celery |head -n 5'
Every time I call the task, it increase the memory consumption of the celeryd/worker process, proportionally to the amount of data allocated in it...
It seems like a common issue (c.f. others stackoverflow link), however I couldn't fix it, even with the latest dependencies.
Thanks.

This is a Python and OS issue, not really a django or celery issue. Without getting too deep:
1) A process will never free memory addressing space once it has requested it from the OS. It never says "hey, I'm done here, you can have it back". In the example you've given, I'd expect the process size to grow for a while, and then stabilize, possibly at a high base line. After your example allocation, you might call the gc interface to force a garbage collect to see how
2) This isn't usually a problem, because unused pages are paged out by the OS because your process stops accessing that address space that it has deallocated.
3) It is a problem if your process is leaking object references, preventing python from garbage collecting to re-appropriate the space for later reuse by that process, and requiring your process to ask for more address space from the OS. At some point, the OS cries uncle and will (probably) kill your process with its oomkiller or similar mechanism.
4) If you are leaking, either fix the leak or set CELERYD_MAX_TASKS_PER_CHILD, and your child processes will (probably) commit suicide before upsetting the OS.
This is a good general discussion on Python's memory management:
CPython memory allocation
And a few minor things:
Use xrange not range - range will generate all values then iterate over that list. xrange is just a generator. Have set Django DEBUG=False?

Celery - run different workers on one server

I have 2 kind of tasks :
Type1 - A few of high priority small tasks.
Type2 - Lot of heavy tasks with lower priority.
Initially i had simple configuration with default routing, no routing keys were used. It was not sufficient - sometimes all workers were busy with Type2 Tasks, so Task1 were delayed.
I've added routing keys:
CELERY_DEFAULT_QUEUE = "default"
CELERY_QUEUES = {
"default": {
"binding_key": "task.#",
},
"highs": {
"binding_key": "starter.#",
},
}
CELERY_DEFAULT_EXCHANGE = "tasks"
CELERY_DEFAULT_EXCHANGE_TYPE = "topic"
CELERY_DEFAULT_ROUTING_KEY = "task.default"
CELERY_ROUTES = {
"search.starter.start": {
"queue": "highs",
"routing_key": "starter.starter",
},
}
So now i have 2 queues - with high and low priority tasks.
Problem is - how to start 2 celeryd's with different concurrency settings?
Previously celery was used in daemon mode(according to this), so only start of /etc/init.d/celeryd start was required, but now i have to run 2 different celeryds with different queues and concurrency. How can i do it?

Based on the above answer, I formulated the following /etc/default/celeryd file (originally based on the configuration described in the docs here: http://ask.github.com/celery/cookbook/daemonizing.html) which works for running two celery workers on the same machine, each worker servicing a different queue (in this case the queue names are "default" and "important").
Basically this answer is just an extension of the previous answer in that it simply shows how to do the same thing, but for celery in daemon mode. Please note that we are using django-celery here:
CELERYD_NODES="w1 w2"
# Where to chdir at start.
CELERYD_CHDIR="/home/peedee/projects/myproject/myproject"
# Python interpreter from environment.
#ENV_PYTHON="$CELERYD_CHDIR/env/bin/python"
ENV_PYTHON="/home/peedee/projects/myproject/myproject-env/bin/python"
# How to call "manage.py celeryd_multi"
CELERYD_MULTI="$ENV_PYTHON $CELERYD_CHDIR/manage.py celeryd_multi"
# How to call "manage.py celeryctl"
CELERYCTL="$ENV_PYTHON $CELERYD_CHDIR/manage.py celeryctl"
# Extra arguments to celeryd
# Longest task: 10 hrs (as of writing this, the UpdateQuanitites task takes 5.5 hrs)
CELERYD_OPTS="-Q:w1 default -c:w1 2 -Q:w2 important -c:w2 2 --time-limit=36000 -E"
# Name of the celery config module.
CELERY_CONFIG_MODULE="celeryconfig"
# %n will be replaced with the nodename.
CELERYD_LOG_FILE="/var/log/celery/celeryd.log"
CELERYD_PID_FILE="/var/run/celery/%n.pid"
# Name of the projects settings module.
export DJANGO_SETTINGS_MODULE="settings"
# celerycam configuration
CELERYEV_CAM="djcelery.snapshot.Camera"
CELERYEV="$ENV_PYTHON $CELERYD_CHDIR/manage.py celerycam"
CELERYEV_LOG_FILE="/var/log/celery/celerycam.log"
# Where to chdir at start.
CELERYBEAT_CHDIR="/home/peedee/projects/cottonon/cottonon"
# Path to celerybeat
CELERYBEAT="$ENV_PYTHON $CELERYBEAT_CHDIR/manage.py celerybeat"
# Extra arguments to celerybeat. This is a file that will get
# created for scheduled tasks. It's generated automatically
# when Celerybeat starts.
CELERYBEAT_OPTS="--schedule=/var/run/celerybeat-schedule"
# Log level. Can be one of DEBUG, INFO, WARNING, ERROR or CRITICAL.
CELERYBEAT_LOG_LEVEL="INFO"
# Log file locations
CELERYBEAT_LOGFILE="/var/log/celerybeat.log"
CELERYBEAT_PIDFILE="/var/run/celerybeat.pid"

It seems answer - celery-multi - is currently not documented well.
What I needed can be done by the following command:
celeryd-multi start 2 -Q:1 default -Q:2 starters -c:1 5 -c:2 3 --loglevel=INFO --pidfile=/var/run/celery/${USER}%n.pid --logfile=/var/log/celeryd.${USER}%n.log
What we do is starting 2 workers, which are listening to different queues (-Q:1 is default, Q:2 is starters ) with different concurrencies -c:1 5 -c:2 3

Another alternative is to give the worker process a unique name -- using the -n argument.
I have two Pyramid apps running on the same physical hardware, each with its own celery instance(within their own virtualenvs).
They both have Supervisor controlling both of them, both with a unique supervisord.conf file.
app1:
[program:celery]
autorestart=true
command=%(here)s/../bin/celery worker -n ${HOST}.app1--app=app1.queue -l debug
directory=%(here)s
[2013-12-27 10:36:24,084: WARNING/MainProcess] celery#maz.local.app1 ready.
app2:
[program:celery]
autorestart=true
command=%(here)s/../bin/celery worker -n ${HOST}.app2 --app=app2.queue -l debug
directory=%(here)s
[2013-12-27 10:35:20,037: WARNING/MainProcess] celery#maz.local.app2 ready.

An update:
In Celery 4.x, below would work properly:
celery multi start 2 -Q:1 celery -Q:2 starters -A $proj_name
Or if you want to designate instance's name, you could:
celery multi start name1 name2 -Q:name1 celery -Q:name2 queue_name -A $proj_name
However, I find it would not print details logs on screen then if we use celery multi since it seems only a script shortcut to boot up these instances.
I guess it would also work if we start these instances one by one manually by giving them different node names but -A the same $proj_name though it's a bit of a wasting of time.
Btw, according to the official document, you could kill all celery workers simply by:
ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js