process 'forkPoolworker-5' pid:111 exited with 'signal 9 (SIGKILL)' - django

hello every one who helps me?
python:3.8
Django==4.0.4
celery==5.2.1
I am using python/Django/celery to do something,when I get data from hive by sql,my celely worker get this error "process 'forkPoolworker-5' pid:111 exited with 'signal 9 (SIGKILL)'",and then,my task is not be used to finish and the tcp connect is closing! what can I do for it to solve?
I try to do:
CELERYD_MAX_TASKS_PER_CHILD = 1 # 单work最多任务使用数
CELERYD_CONCURRENCY = 3 # 单worker最大并发数
CELERYD_MAX_MEMORY_PER_CHILD = 1024*1024*2 # 单任务可占用2G内存
CELERY_TASK_RESULT_EXPIRES = 60 * 60 * 24 * 3
-Ofair
but these is not using for solving.

SIGKILL is raised by system, most likely due to memory or storage, monitor how much memory a celery task takes by running -P solo option or -c 1 and allocate sufficient memory accordingly.
To check memory usage either use pmap <pid> or ps -a -o rss,vsz. Please search rss and vsz for more details (in short rss is RAM and vsz is virtual memory).
CELERYD_MAX_TASKS_PER_CHILD = 1 kills process after every task, so CELERYD_MAX_MEMORY_PER_CHILD has no affect ie worker waits for completion of task before enforcing limit on running child process.

Related

Gunicorn worker, threads for GPU tasks to increase concurrency/parallelism

I'm using Flask with Gunicorn to implement an AI server. The server takes in HTTP requests and calls the algorithm (built with pytorch). The computation is run on the nvidia GPU.
I need some input as to how can I achieve concurrency/parallelism in this case. The machine has 8 vCPUs, 20 GB memory and 1 GPU, 12 GB memory.
1 worker occupies, 4 GB memory, 2.2GB GPU memory.
max workers I can give is 5. (Because of GPU memory 2.2 GB * 5 workers = 11 GB )
1 worker = 1 HTTP request (max simultaneous requests = 5)
The specific question is
How can I increase the concurrency/parallelism?
Do I have to specify number of threads for computation on GPU?
Now my gunicorn command is
gunicorn --bind 0.0.0.0:8002 main:app --timeout 360 --workers=5 --worker-class=gevent --worker-connections=1000
fast Tokenizers are not thread-safe apparently.
AutoTokenizers seems like a wrapper that uses fast or slow internally. their default is set to fast (not thread-safe) .. you'll have to switch that to slow (safe) .. that's why add the use_fast=False flag
I was able to solve this by:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
Best,
Chirag Sanghvi

MPI_Comm_spawn fails with "All nodes which are allocated for this job are already filled"

I'm trying to use Torque's (5.1.1) qsub command to launch multiple OpenMPI
processes, one process per node, and having each process launch a single
process on its own local node using MPI_Comm_spawn(). MPI_Comm_spawn() is reporting:
All nodes which are allocated for this job are already filled.
My OpenMPI version is 4.0.1.
I am following the instructions here to control the mapping of nodes.
Controlling node mapping of MPI_COMM_SPAWN
using the --map-by ppr:1:node option to mpiexec, and a hostfile (programatically derived
from the ${PBS_NODEFILE} file that Torque produces). My derived file MyHostFile looks
like this:
n001.cluster.com slots=2 max_slots=2
n002.cluster.com slots=2 max_slots=2
while the original ${PBS_NODEFILE} only has the node names, and no slot specifications.
My qsub command is
qsub -V -j oe -e ./tempdir -o ./tempdir -N MyJob MyJob.bash
The mpiexec command from MyJob.bash is
mpiexec --display-map --np 2 --hostfile MyNodefile --map-by ppr:1:node <executable>.
MPI_Comm_spawn() causes this error to be printed:
Data for JOB [22220,1] offset 0 Total slots allocated 1 <=====
======================== JOB MAP ========================
Data for node: n001 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [22220,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
=============================================================
All nodes which are allocated for this job are already filled.
There are two things that occur to me:
(1) "Total slots allocated" is 1 above, but I need at least two slots available.
(2) It may not be right to try to specify a hostfile to mpiexec when
using Torque (though it is derived from the Torque hostfile ${PBS_NODEFILE}). Maybe my derived hostfile is being ignored.
Is there a way to make this work? I've tried recompiling OpenMPI
without Torque support, hopefully preventing OpenMPI from interacting
with it, but it didn't change the error message.
Answering my own question: adding the argument -l nodes=1:ppn=2 to the qsub command reserves 2 processors on the node, even though mpiexec is launching only one process. MPI_Comm_spawn() can then spawn the new process on the second reserved slot.
I also had to compile OpenMPI without Torque support, since including it causes my hostfile argument to be ignored and the Torque-generated hostfile to be used.

OpenMP code is using only 4 threads instead of the specified 72

I have a program written by someone else that uses OpenMP. I am running it on a cluster that uses Slurm as its job manager. Despite setting OMP_NUM_THREADS=72 and properly requesting 72 cores for the job, the job is only using four cores.
I have already used scontrol show job <job_id> --details to verify that there are 72 cores assigned to the job. I have also remoted into the node that the job is running on and used htop to inspect it. It was running 72 threads, all on four cores. It is worth noting that this is on an SMT4 power9 cpu, meaning that each physical core executes 4 simultaneous threads. Ultimately, it looks like openMP is putting all threads on one physical core. This is further complicated by the fact that this is an IBM system. I can't seem to find any useful documentation on more fine control of the openMP environment. Everything I find is for Intel.
I have also tried using taskset to manually change the affinity. This worked as intended and moved one of the threads to an unused core. The program continued to work as intended after this.
I could theoretically write a script to find all of the threads and call taskset to assign them to cores in a logical way, but I am afraid to do this. It seems like a bad idea to me. It would also take a while.
I guess my main question would be, is this a Slurm problem, an openMP problem, an IBM problem or a user error? Is there some environment variable I don't know about that I need to set? Will it break Slurm if I manually call taskset using a script? I would use scontrol to figure out which cpus are assigned to the job if I did that. I don't want to anger the people who run the cluster by messing things up though.
Here is the submission script. I can't include any of the actual running code due to license issues though. I'm hoping this will just be a simple matter of fixing an environment variable. The MPI_OPTIONS variables were recommended by the guy who administers the system. If by some chance someone here has worked with the ENKI cluster before, that's where this is running.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module load openmpi/3.1.3/2019
module load pgi/2019
export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
Edit: Fix resulted in a 7x speedup when using 72 cores, vs. just running on 4 cores. Considering the nature of the calculations being run, this is pretty good.
Edit 2: Fix resulted in a 17x speedup when using 160 vs. just running on 4 cores.
This might not work for everyone, but I have a really hacky solution. I wrote a python script that uses psutil to find all threads that are children of the running process and set their affinity manually. This script uses scontrol to figure out which cpus are assigned to the job and uses taskset to force the threads to distribute across those cpus.
So far the process is running a lot faster. I'm sure that forcing CPU affinity isn't the best way to do it, but its a lot better than not using the available resources at all.
Here is the basic idea behind the code. The program I am running is called pgmc, hence the variable names. You will need to create an anaconda environment with psutil installed if you are running on a system like mine.
import psutil
import subprocess
import os
import sys
import time
# Gets the id for the current job.
def get_job_id():
return os.environ["SLURM_JOB_ID"]
# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
run_str = 'scontrol show job %s --details'%get_job_id()
stdout = subprocess.getoutput(run_str)
id_spec = None
num_cpus = None
chunks = stdout.split(' ')
for chunk in chunks:
if chunk.lower().startswith("cpu_ids"):
id_spec = chunk.split('=')[1]
start, stop = id_spec.split('-')
id_spec = list(range(int(start), int(stop) + 1))
if chunk.lower().startswith('numcpus'):
num_cpus = int(chunk.split('=')[1])
if id_spec is not None and num_cpus is not None:
return id_spec, num_cpus
raise Exception("Couldn't find information about the allocated cpus.")
if __name__ == '__main__':
# Before we do anything, make sure that we can get the list of cpus
# assigned to the job. Once we have that, run the command line supplied.
cpus, cpu_count = get_proc_info()
if len(cpus) != cpu_count:
raise Exception("CPU list didn't match CPU count.")
# If we successefully got to here, run the command line.
program_name = ' '.join(sys.argv[1:])
pgmc = subprocess.Popen(sys.argv[1:])
time.sleep(10)
pid = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid
# Now that we have the pid of the pgmc process, we need to get all
# child threads of the process.
pgmc_proc = psutil.Process(pid)
pgmc_threads = list(pgmc_proc.threads())
# Now that we have a list of threads, we loop over available cores and
# assign threads to them. Once this is done, we wait for the process
# to complete.
while len(pgmc_threads) != 0:
for core_id in cpus:
if len(pgmc_threads) != 0:
thread_id = pgmc_threads[-1].id
pgmc_threads.remove(pgmc_threads[-1])
taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
print(taskset_string)
subprocess.getoutput(taskset_string)
else:
break
# All of the threads should now be assigned to a core.
# Wait for the process to exit.
pgmc.wait()
print("program terminated, exiting . . . ")
Here is the submission script used.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node
export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
My main reason for including the submission script is to demonstrate how the python script is used. More specifically, you call it, with your real job as an argument.

Ember CLI build killed

I build my Ember CLI app inside a docker container on startup. The build fails without an error message, it just says killed:
root#fstaging:/frontend/source# node_modules/ember-cli/bin/ember build -prod
version: 1.13.15
Could not find watchman, falling back to NodeWatcher for file system events.
Visit http://www.ember-cli.com/user-guide/#watchman for more info.
Buildingember-auto-register-helpers is not required for Ember 2.0.0 and later please remove from your `package.json`.
Building.DEPRECATION: The `bind-attr` helper ('app/templates/components/file-selector.hbs' # L1:C7) is deprecated in favor of HTMLBars-style bound attributes.
at isBindAttrModifier (/app/source/bower_components/ember/ember-template-compiler.js:11751:34)
Killed
The same docker image successfully starts up in another environment, but without hardware constraints. Does Ember CLI have hard-coded hardware constraints for the build process? The RAM is limited to 128m and swap to 2g.
That is likely not enough memory for Ember CLI to do what it needs. You are correct in that, the process is being killed because of an OOM situation. If you log in to the host and take a look at the dmesg output you will probably see something like:
V8 WorkerThread invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
V8 WorkerThread cpuset=867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 mems_allowed=0
CPU: 0 PID: 2027 Comm: V8 WorkerThread Tainted: G O 4.1.13-boot2docker #1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
0000000000000000 00000000000000d0 ffffffff8154e053 ffff880039381000
ffffffff8154d3f7 ffff8800395db528 ffff8800392b4528 ffff88003e214580
ffff8800392b4000 ffff88003e217080 ffffffff81087faf ffff88003e217080
Call Trace:
[<ffffffff8154e053>] ? dump_stack+0x40/0x50
[<ffffffff8154d3f7>] ? dump_header.isra.10+0x8c/0x1f4
[<ffffffff81087faf>] ? finish_task_switch+0x4c/0xda
[<ffffffff810f46b1>] ? oom_kill_process+0x99/0x31c
[<ffffffff811340e6>] ? task_in_mem_cgroup+0x5d/0x6a
[<ffffffff81132ac5>] ? mem_cgroup_iter+0x1c/0x1b2
[<ffffffff81134984>] ? mem_cgroup_oom_synchronize+0x441/0x45a
[<ffffffff8113402f>] ? mem_cgroup_is_descendant+0x1d/0x1d
[<ffffffff810f4d77>] ? pagefault_out_of_memory+0x17/0x91
[<ffffffff815565d8>] ? page_fault+0x28/0x30
Task in /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 killed as a result of limit of /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032
memory: usage 131072kB, limit 131072kB, failcnt 2284203
memory+swap: usage 262032kB, limit 262144kB, failcnt 970540
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032: cache:340KB rss:130732KB rss_huge:10240KB mapped_file:8KB writeback:0KB swap:130960KB inactive_anon:72912KB active_anon:57880KB inactive_file:112KB active_file:40KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1993] 0 1993 380 1 6 3 17 0 sh
[ 2025] 0 2025 203490 32546 221 140 32713 0 npm
Memory cgroup out of memory: Kill process 2025 (npm) score 1001 or sacrifice child
Killed process 2025 (npm) total-vm:813960kB, anon-rss:130184kB, file-rss:0kB
It might be worthwhile to profile the container using something like https://github.com/google/cadvisor to find out what kind of memory maximums it may need.

Django, low requests per second with gunicorn 4 workers

I'm trying to see why my django website (gunicorn 4 workers) is slow under heavy load, I did some profiling http://djangosnippets.org/snippets/186/ without any clear answer so I started some load tests from scratch using ab -n 1000 -c 100 http://localhost:8888/
A simple Httpreponse("hello world") no middleware ==> 3600req/s
A simple Httpreponse("hello world") with middlewares (cached session, cached authentication) ==> 2300req/s
A simple render_to_response that only print a form (cached template) ==> 1200req/s (response time was divided by 2)
A simple render_to_response with 50 memcache queries ==> 157req/s
Memcache queries should be much faster than that (I'm using PyLibMCCache)?
Is the template rendering as slow as this result?
I tried different profiling technics without any success.
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 46936
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 400000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 46936
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ sysctl -p
fs.file-max = 700000
net.core.somaxconn = 5000
net.ipv4.tcp_keepalive_intvl = 30
I'm using ubuntu 12.04 (6Go of ram, core i5)
Any help please?
It really depends on how long it takes to do a memcached request and to open a new connection (django closes the connection as the request finishes), both your worker and memcached are able to handle much more stress but of course if it takes 5/10ms to do a memcached call then 50 of them are going to be the bottleneck as you have the network latency multiplied by call count.
Right now you are just benchmarking django, gunicorn, your machine and your network.
Unless you have something extremely wrong at this level this tests are not going to point you to very interesting discoveries.
What is slowing doing your app is very likely to be related to the way you use your db and memcached (and maybe at template rendering).
For this reason I really suggest you to get django debug toolbar and to see whats happening in your real pages.
If it turns out that opening a connection to memcached is the bottleneck you can try to use a connection pool and keep the connection open.
You could investigate memcached performance.
$ python manage.py shell
>>> from django.core.cache import cache
>>> cache.set("unique_key_name_12345", "some value with a size representative of the real world memcached usage", timeout=3600)
>>> from datetime import datetime
>>> def how_long(n):
start = datetime.utcnow()
for _ in xrange(n):
cache.get("unique_key_name_12345")
return (datetime.utcnow() - start).total_seconds()
With this kind of round-trip test I am seeing that 1 memcached lookup will take about 0.2 ms on my server.
The problem with django.core.cache and pylibmc is that the functions are blocking. Potentially you could get 50 times that number in the round trip for HTTP request. 50 times 0.2 ms is already 10 ms.
If you were achieving 1200 req/s on 4 workers without memcached, the average HTTP round-trip time was 1/(1200/4) = 3.33 ms. Add 10 ms to that and it becomes 13.33 ms. The throughput with 4 workers would drop to 300 req/s (whuch happens to be in the ballpark of your 157 number).